On Tue, 23 Aug 2022 14:58:19 +0100, Peter Xu peterx@redhat.com wrote:
On Tue, Aug 23, 2022 at 03:22:17PM +1000, Gavin Shan wrote:
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..0b41feb6fb7d 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu);
if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
return 0;
} return 1;}
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index f4c2a6eb1666..08b2f01164fa 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) {
- struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring); struct kvm_dirty_gfn *entry; /* It should never get full */
@@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) kvm_dirty_gfn_set_dirtied(entry); ring->dirty_index++; trace_kvm_dirty_ring_push(ring, slot, offset);
- if (kvm_dirty_ring_soft_full(vcpu))
} struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
Ok, thanks for the details, Marc. I will adopt your code in next revision :)
Note that there can be a slight difference with the old/new code, in that an (especially malicious) userapp can logically ignore the DIRTY_RING_FULL vmexit and keep kicking VCPU_RUN with the new code.
Unlike the old code, the 2nd/3rd/... KVM_RUN will still run in the new code until the next dirty pfn being pushed to the ring, then it'll request ring full exit again.
Each time it exits the ring grows 1.
At last iiuc it can easily hit the ring full and trigger the warning at the entry of kvm_dirty_ring_push():
/* It should never get full */ WARN_ON_ONCE(kvm_dirty_ring_full(ring));
Hmmm, yes. Well spotted.
We did that because kvm_dirty_ring_push() was previously designed to not be able to fail at all (e.g., in the old bitmap world we never will fail too). We can't because we can't lose any dirty page or migration could silently fail too (consider when we do user exit due to ring full and migration just completed; there could be unsynced pages on src/dst).
So even though the old approach will need to read kvm->dirty_ring_size for every entrance which is a pity, it will avoid issue above.
I don't think we really need this check on the hot path. All we need is to make the request sticky until userspace gets their act together and consumes elements in the ring. Something like:
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..e8ed5e1af159 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu); + + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) && + kvm_dirty_ring_soft_full(vcpu)) { + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu); + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; + trace_kvm_dirty_ring_exit(vcpu); + return 0; + } }
return 1;
However, I'm a bit concerned by the reset side of things. It iterates over the vcpus and expects the view of each ring to be consistent, even if userspace is hacking at it from another CPU. For example, I can't see what guarantees that the kernel observes the writes from userspace in the order they are being performed (the documentation provides no requirements other than "it must collect the dirty GFNs in sequence", which doesn't mean much from an ordering perspective).
I can see that working on a strongly ordered architecture, but on something as relaxed as ARM, the CPUs may^Wwill aggressively reorder stuff that isn't explicitly ordered. I have the feeling that a CAS operation on both sides would be enough, but someone who actually understands how this works should have a look...
Thanks,
M.