On Wed, 24 Aug 2022 17:21:50 +0100, Peter Xu peterx@redhat.com wrote:
On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
On Wed, 24 Aug 2022 00:19:04 +0100, Peter Xu peterx@redhat.com wrote:
On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
Atomicity doesn't guarantee ordering, unfortunately.
Right, sorry to be misleading. The "atomicity" part I was trying to say the kernel will always see consistent update on the fields.
The ordering should also be guaranteed, because things must happen with below sequence:
(1) kernel publish dirty GFN data (slot, offset) (2) kernel publish dirty GFN flag (set to DIRTY) (3) user sees DIRTY, collects (slots, offset) (4) user sets it to RESET (5) kernel reads RESET
Maybe. Maybe not. The reset could well be sitting in the CPU write buffer for as long as it wants and not be seen by the kernel if the read occurs on another CPU. And that's the crucial bit: single-CPU is fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU on collection, and global on reset (this seems like a bad decision, but it is too late to fix this).
Regarding the last statement, that's something I had question too and discussed with Paolo, even though at that time it's not a outcome of discussing memory ordering issues.
IIUC the initial design was trying to avoid tlb flush flood when vcpu number is large (each RESET per ring even for one page will need all vcpus to flush, so O(N^2) flushing needed). With global RESET it's O(N). So it's kind of a trade-off, and indeed until now I'm not sure which one is better. E.g., with per-ring reset, we can have locality too in userspace, e.g. the vcpu thread might be able to recycle without holding global locks.
I don't get that. On x86, each CPU must perform the TLB invalidation (there is an IPI for that). So whether you do a per-CPU scan of the ring or a global scan is irrelevant: each entry you find in any of the rings must result in a global invalidation, since you've updated the PTE to make the page RO.
The same is true on ARM, except that the broadcast is done in HW instead of being tracked in SW.
Buy anyway, this is all moot. The API is what it is, and it isn't going to change any time soon. All we can do is add some clarifications to the API for the more relaxed architectures, and make sure the kernel behaves accordingly.
[...]
It may be safe, but it isn't what the userspace API promises.
The document says:
After processing one or more entries in the ring buffer, userspace calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that the kernel will reprotect those collected GFNs. Therefore, the ioctl must be called *before* reading the content of the dirty pages.
I'd say it's not an explicit promise, but I think I agree with you that at least it's unclear on the behavior.
This is the least problematic part of the documentation. The bit I literally choke on is this:
<quote> It's not necessary for userspace to harvest the all dirty GFNs at once. However it must collect the dirty GFNs in sequence, i.e., the userspace program cannot skip one dirty GFN to collect the one next to it. </quote>
This is the core of the issue. Without ordering rules, the consumer on the other side cannot observe the updates correctly, even if userspace follows the above to the letter. Of course, the kernel itself must do the right thing (I guess it amounts to the kernel doing a load-acquire, and userspace doing a store-release -- effectively emulating x86...).
Since we have a global recycle mechanism, most likely the app (e.g. current qemu impl) will use the same thread to collect/reset dirty GFNs, and trigger the RESET ioctl(). In that case it's safe, IIUC, because no cross-core ops.
QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
if (total) { ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS); assert(ret == total); }
I think the assert() should never trigger as mentioned above. But ideally maybe it should just be a loop until cleared gfns match total.
Right. If userspace calls the ioctl on every vcpu, then things should work correctly. It is only that the overhead is higher than what it should be if multiple vcpus perform a reset at the same time.
In other words, without further straightening of the API, this doesn't work as expected on relaxed memory architectures. So before this gets enabled on arm64, this whole ordering issue must be addressed.
How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the possibility of recycling partial of the pages, especially when collection and the ioctl() aren't done from the same thread?
I'd rather tell people about the ordering rules. That will come at zero cost on x86.
Any suggestions will be greatly welcomed.
I'll write a couple of patch when I get the time, most likely next week. Gavin will hopefully be able to take them as part of his series.
M.