This series enables the ring-based dirty memory tracking for ARM64. The feature has been available and enabled on x86 for a while. It is beneficial when the number of dirty pages is small in a checkpointing system or live migration scenario. More details can be found from fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking").
The generic part has been comprehensive enough, meaning there isn't too much work, needed to extend it to ARM64.
- PATCH[1] enables the feature on ARM64 - PATCH[2-5] improves kvm/selftests/dirty_log_test
Testing =======
- kvm/selftests/dirty_log_test - Live migration by QEMU - Host with 4KB or 64KB base page size
Gavin Shan (5): KVM: arm64: Enable ring-based dirty memory tracking KVM: selftests: Use host page size to map ring buffer in dirty_log_test KVM: selftests: Dirty host pages in dirty_log_test KVM: selftests: Clear dirty ring states between two modes in dirty_log_test KVM: selftests: Automate choosing dirty ring size in dirty_log_test
Documentation/virt/kvm/api.rst | 2 +- arch/arm64/include/uapi/asm/kvm.h | 1 + arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 8 ++ tools/testing/selftests/kvm/dirty_log_test.c | 101 ++++++++++++++----- tools/testing/selftests/kvm/lib/kvm_util.c | 2 +- 6 files changed, 88 insertions(+), 27 deletions(-)
The ring-based dirty memory tracking has been available and enabled on x86 for a while. The feature is beneficial when the number of dirty pages is small in a checkpointing system or live migration scenario. More details can be found from fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking").
This enables the ring-based dirty memory tracking on ARM64. It's notable that no extra reserved ring entries are needed on ARM64 because the huge pages are always split into base pages when page dirty tracking is enabled.
Signed-off-by: Gavin Shan gshan@redhat.com --- Documentation/virt/kvm/api.rst | 2 +- arch/arm64/include/uapi/asm/kvm.h | 1 + arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 8 ++++++++ 4 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index abd7c32126ce..19fa1ac017ed 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf. 8.29 KVM_CAP_DIRTY_LOG_RING ---------------------------
-:Architectures: x86 +:Architectures: x86, arm64 :Parameters: args[0] - size of the dirty log ring
KVM is capable of tracking dirty memory using ring buffers that are diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 3bb134355874..7e04b0b8d2b2 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -43,6 +43,7 @@ #define __KVM_HAVE_VCPU_EVENTS
#define KVM_COALESCED_MMIO_PAGE_OFFSET 1 +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
#define KVM_REG_SIZE(id) \ (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT)) diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 815cc118c675..0309b2d0f2da 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -32,6 +32,7 @@ menuconfig KVM select KVM_VFIO select HAVE_KVM_EVENTFD select HAVE_KVM_IRQFD + select HAVE_KVM_DIRTY_RING select HAVE_KVM_MSI select HAVE_KVM_IRQCHIP select HAVE_KVM_IRQ_ROUTING diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..3de6b9b39db7 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) if (!ret) ret = 1;
+ /* Force vcpu exit if its dirty ring is soft-full */ + if (unlikely(vcpu->kvm->dirty_ring_size && + kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) { + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; + trace_kvm_dirty_ring_exit(vcpu); + ret = 0; + } + if (ret > 0) ret = check_vcpu_requests(vcpu);
On Fri, 19 Aug 2022 01:55:57 +0100, Gavin Shan gshan@redhat.com wrote:
The ring-based dirty memory tracking has been available and enabled on x86 for a while. The feature is beneficial when the number of dirty pages is small in a checkpointing system or live migration scenario. More details can be found from fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking").
This enables the ring-based dirty memory tracking on ARM64. It's notable that no extra reserved ring entries are needed on ARM64 because the huge pages are always split into base pages when page dirty tracking is enabled.
Can you please elaborate on this? Adding a per-CPU ring of course results in extra memory allocation, so there must be a subtle x86-specific detail that I'm not aware of...
Signed-off-by: Gavin Shan gshan@redhat.com
Documentation/virt/kvm/api.rst | 2 +- arch/arm64/include/uapi/asm/kvm.h | 1 + arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 8 ++++++++ 4 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index abd7c32126ce..19fa1ac017ed 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf. 8.29 KVM_CAP_DIRTY_LOG_RING
-:Architectures: x86 +:Architectures: x86, arm64 :Parameters: args[0] - size of the dirty log ring KVM is capable of tracking dirty memory using ring buffers that are diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 3bb134355874..7e04b0b8d2b2 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -43,6 +43,7 @@ #define __KVM_HAVE_VCPU_EVENTS #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
For context, the documentation says:
<quote> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...] </quote>
What is the reason for picking this particular value?
#define KVM_REG_SIZE(id) \ (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT)) diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 815cc118c675..0309b2d0f2da 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -32,6 +32,7 @@ menuconfig KVM select KVM_VFIO select HAVE_KVM_EVENTFD select HAVE_KVM_IRQFD
- select HAVE_KVM_DIRTY_RING select HAVE_KVM_MSI select HAVE_KVM_IRQCHIP select HAVE_KVM_IRQ_ROUTING
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..3de6b9b39db7 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) if (!ret) ret = 1;
/* Force vcpu exit if its dirty ring is soft-full */
if (unlikely(vcpu->kvm->dirty_ring_size &&
kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
ret = 0;
}
Why can't this be moved to kvm_vcpu_exit_request() instead? I would also very much like the check to be made a common helper with x86.
A seemingly approach would be to make this a request on dirty log insertion, and avoid the whole "check the log size" on every run, which adds pointless overhead to unsuspecting users (aka everyone).
Thanks,
M.
Hi Marc,
On 8/19/22 6:00 PM, Marc Zyngier wrote:
On Fri, 19 Aug 2022 01:55:57 +0100, Gavin Shan gshan@redhat.com wrote:
The ring-based dirty memory tracking has been available and enabled on x86 for a while. The feature is beneficial when the number of dirty pages is small in a checkpointing system or live migration scenario. More details can be found from fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking").
This enables the ring-based dirty memory tracking on ARM64. It's notable that no extra reserved ring entries are needed on ARM64 because the huge pages are always split into base pages when page dirty tracking is enabled.
Can you please elaborate on this? Adding a per-CPU ring of course results in extra memory allocation, so there must be a subtle x86-specific detail that I'm not aware of...
Sure. I guess it's helpful to explain how it works in next revision. Something like below:
This enables the ring-based dirty memory tracking on ARM64. The feature is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is pushed by host when page becomes dirty and pulled by userspace. A vcpu exit is forced when the ring buffer becomes full. The ring buffers on all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
Yes, I think so. Adding a per-CPU ring results in extra memory allocation. However, it's avoiding synchronization among multiple vcpus when dirty pages happen on multiple vcpus. More discussion can be found from [1]
[1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA7... (comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC)
Signed-off-by: Gavin Shan gshan@redhat.com
Documentation/virt/kvm/api.rst | 2 +- arch/arm64/include/uapi/asm/kvm.h | 1 + arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 8 ++++++++ 4 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index abd7c32126ce..19fa1ac017ed 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf. 8.29 KVM_CAP_DIRTY_LOG_RING
-:Architectures: x86 +:Architectures: x86, arm64 :Parameters: args[0] - size of the dirty log ring KVM is capable of tracking dirty memory using ring buffers that are diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 3bb134355874..7e04b0b8d2b2 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -43,6 +43,7 @@ #define __KVM_HAVE_VCPU_EVENTS #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
For context, the documentation says:
<quote> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...] </quote>
What is the reason for picking this particular value?
It's inherited from x86. I don't think it has to be this particular value. The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET.
How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision? The virtual area is cheap, I guess it's also nice to use x86's pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
#define KVM_COALESCED_MMIO_PAGE_OFFSET 1 #define KVM_DIRTY_LOG_PAGE_OFFSET 2
#define KVM_REG_SIZE(id) \ (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT)) diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 815cc118c675..0309b2d0f2da 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -32,6 +32,7 @@ menuconfig KVM select KVM_VFIO select HAVE_KVM_EVENTFD select HAVE_KVM_IRQFD
- select HAVE_KVM_DIRTY_RING select HAVE_KVM_MSI select HAVE_KVM_IRQCHIP select HAVE_KVM_IRQ_ROUTING
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..3de6b9b39db7 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) if (!ret) ret = 1;
/* Force vcpu exit if its dirty ring is soft-full */
if (unlikely(vcpu->kvm->dirty_ring_size &&
kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
ret = 0;
}
Why can't this be moved to kvm_vcpu_exit_request() instead? I would also very much like the check to be made a common helper with x86.
A seemingly approach would be to make this a request on dirty log insertion, and avoid the whole "check the log size" on every run, which adds pointless overhead to unsuspecting users (aka everyone).
I though of having the check in kvm_vcpu_exit_request(). The various exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the highest priority and ARM64 is just to follow. I don't think it really matters. I will improve it accordingly in next revision:
- Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c
bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu) { struct kvm *kvm = vcpu->vcpu; struct kvm_dirty_ring *ring = &vcpu->dirty_ring;
if (unlikely(kvm->dirty_ring_size && kvm_dirty_ring_used(ring) >= ring->soft_limit)) { vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; trace_kvm_dirty_ring_exit(vcpu); return true; }
return false; }
- Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request().
Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time. Otherwise, the dirty log in the ring buffer will be overwritten. I'm not sure if anything else I missed?
Thanks, Gavin
Hi, Gavin,
On Mon, Aug 22, 2022 at 11:58:20AM +1000, Gavin Shan wrote:
For context, the documentation says:
<quote> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...] </quote>
What is the reason for picking this particular value?
It's inherited from x86. I don't think it has to be this particular value. The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET.
How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision? The virtual area is cheap, I guess it's also nice to use x86's pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
#define KVM_COALESCED_MMIO_PAGE_OFFSET 1 #define KVM_DIRTY_LOG_PAGE_OFFSET 2
It was chosen not to be continuous of previous used offset because it'll be the 1st vcpu region that can cover multiple & dynamic number of pages. I wanted to leave the 3-63 (x86 used offset 2 already) for small fields so they can be continuous, which looks a little bit cleaner.
Currently how many pages it'll use depends on the size set by the user, though there's a max size limited by KVM_DIRTY_RING_MAX_ENTRIES, which is a maximum of 1MB memory.
So I think setting it to 2 is okay, as long as we keep the rest 1MB address space for the per-vcpu ring structure, so any new vcpu fields (even if only 1 page will be needed) need to be after that maximum size of the ring.
Thanks,
Hi Peter,
On 8/23/22 4:55 AM, Peter Xu wrote:
On Mon, Aug 22, 2022 at 11:58:20AM +1000, Gavin Shan wrote:
For context, the documentation says:
<quote> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...] </quote>
What is the reason for picking this particular value?
It's inherited from x86. I don't think it has to be this particular value. The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET.
How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision? The virtual area is cheap, I guess it's also nice to use x86's pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
#define KVM_COALESCED_MMIO_PAGE_OFFSET 1 #define KVM_DIRTY_LOG_PAGE_OFFSET 2
It was chosen not to be continuous of previous used offset because it'll be the 1st vcpu region that can cover multiple & dynamic number of pages. I wanted to leave the 3-63 (x86 used offset 2 already) for small fields so they can be continuous, which looks a little bit cleaner.
Currently how many pages it'll use depends on the size set by the user, though there's a max size limited by KVM_DIRTY_RING_MAX_ENTRIES, which is a maximum of 1MB memory.
So I think setting it to 2 is okay, as long as we keep the rest 1MB address space for the per-vcpu ring structure, so any new vcpu fields (even if only 1 page will be needed) need to be after that maximum size of the ring.
Thanks for the details. I think it'd keep the layout since virtual area is really cheap. So lets keep it as of being if Marc doesn't object. In this way, the new area, which is usually one page size can go after KVM_COALESCED_MMIO_PAGE_OFFSET.
#define KVM_DIRTY_LOG_PAGE_OFFSET 64
Thanks, Gavin
Hi Gavin,
On Mon, 22 Aug 2022 02:58:20 +0100, Gavin Shan gshan@redhat.com wrote:
Hi Marc,
On 8/19/22 6:00 PM, Marc Zyngier wrote:
On Fri, 19 Aug 2022 01:55:57 +0100, Gavin Shan gshan@redhat.com wrote:
The ring-based dirty memory tracking has been available and enabled on x86 for a while. The feature is beneficial when the number of dirty pages is small in a checkpointing system or live migration scenario. More details can be found from fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking").
This enables the ring-based dirty memory tracking on ARM64. It's notable that no extra reserved ring entries are needed on ARM64 because the huge pages are always split into base pages when page dirty tracking is enabled.
Can you please elaborate on this? Adding a per-CPU ring of course results in extra memory allocation, so there must be a subtle x86-specific detail that I'm not aware of...
Sure. I guess it's helpful to explain how it works in next revision. Something like below:
This enables the ring-based dirty memory tracking on ARM64. The feature is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is pushed by host when page becomes dirty and pulled by userspace. A vcpu exit is forced when the ring buffer becomes full. The ring buffers on all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
Yes, I think so. Adding a per-CPU ring results in extra memory allocation. However, it's avoiding synchronization among multiple vcpus when dirty pages happen on multiple vcpus. More discussion can be found from [1]
Oh, I totally buy the relaxation of the synchronisation (though I doubt this will have any visible effect until we have something like Oliver's patches to allow parallel faulting).
But it is the "no extra reserved ring entries are needed on ARM64" argument that I don't get yet.
[1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA7... (comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC)
Signed-off-by: Gavin Shan gshan@redhat.com
Documentation/virt/kvm/api.rst | 2 +- arch/arm64/include/uapi/asm/kvm.h | 1 + arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 8 ++++++++ 4 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index abd7c32126ce..19fa1ac017ed 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf. 8.29 KVM_CAP_DIRTY_LOG_RING
-:Architectures: x86 +:Architectures: x86, arm64 :Parameters: args[0] - size of the dirty log ring KVM is capable of tracking dirty memory using ring buffers that are diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 3bb134355874..7e04b0b8d2b2 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -43,6 +43,7 @@ #define __KVM_HAVE_VCPU_EVENTS #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
For context, the documentation says:
<quote> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...] </quote>
What is the reason for picking this particular value?
It's inherited from x86. I don't think it has to be this particular value. The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET.
How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision? The virtual area is cheap, I guess it's also nice to use x86's pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
#define KVM_COALESCED_MMIO_PAGE_OFFSET 1 #define KVM_DIRTY_LOG_PAGE_OFFSET 2
Given that this is just an offset in the vcpu "file", I don't think it matters that much. 64 definitely allows for some struct vcpu growth, and it doesn't hurt to be compatible with x86 (for once...).
#define KVM_REG_SIZE(id)
\ (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT)) diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 815cc118c675..0309b2d0f2da 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -32,6 +32,7 @@ menuconfig KVM select KVM_VFIO select HAVE_KVM_EVENTFD select HAVE_KVM_IRQFD
- select HAVE_KVM_DIRTY_RING select HAVE_KVM_MSI select HAVE_KVM_IRQCHIP select HAVE_KVM_IRQ_ROUTING
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..3de6b9b39db7 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) if (!ret) ret = 1;
/* Force vcpu exit if its dirty ring is soft-full */
if (unlikely(vcpu->kvm->dirty_ring_size &&
kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
ret = 0;
}
Why can't this be moved to kvm_vcpu_exit_request() instead? I would also very much like the check to be made a common helper with x86.
A seemingly approach would be to make this a request on dirty log insertion, and avoid the whole "check the log size" on every run, which adds pointless overhead to unsuspecting users (aka everyone).
I though of having the check in kvm_vcpu_exit_request(). The various exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the highest priority and ARM64 is just to follow. I don't think it really matters. I will improve it accordingly in next revision:
Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c
bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu) { struct kvm *kvm = vcpu->vcpu; struct kvm_dirty_ring *ring = &vcpu->dirty_ring;
if (unlikely(kvm->dirty_ring_size && kvm_dirty_ring_used(ring) >= ring->soft_limit)) { vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; trace_kvm_dirty_ring_exit(vcpu); return true; } return false;
}
Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request().
Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time. Otherwise, the dirty log in the ring buffer will be overwritten. I'm not sure if anything else I missed?
I'm fine with the above, but what I really wanted is a request from the dirty-ring insertion, instead of a check in kvm_vpcu_exit_request. Something like this (which obviously doesn't compile, but you'll get the idea):
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..0b41feb6fb7d 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu); + + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) { + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; + trace_kvm_dirty_ring_exit(vcpu); + return 0; + } }
return 1; diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index f4c2a6eb1666..08b2f01164fa 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) { + struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring); struct kvm_dirty_gfn *entry;
/* It should never get full */ @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) kvm_dirty_gfn_set_dirtied(entry); ring->dirty_index++; trace_kvm_dirty_ring_push(ring, slot, offset); + + if (kvm_dirty_ring_soft_full(vcpu)) + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu); }
struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
Thanks,
M.
Hi Marc,
On 8/23/22 7:42 AM, Marc Zyngier wrote:
On Mon, 22 Aug 2022 02:58:20 +0100, Gavin Shan gshan@redhat.com wrote:
On 8/19/22 6:00 PM, Marc Zyngier wrote:
On Fri, 19 Aug 2022 01:55:57 +0100, Gavin Shan gshan@redhat.com wrote:
The ring-based dirty memory tracking has been available and enabled on x86 for a while. The feature is beneficial when the number of dirty pages is small in a checkpointing system or live migration scenario. More details can be found from fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking").
This enables the ring-based dirty memory tracking on ARM64. It's notable that no extra reserved ring entries are needed on ARM64 because the huge pages are always split into base pages when page dirty tracking is enabled.
Can you please elaborate on this? Adding a per-CPU ring of course results in extra memory allocation, so there must be a subtle x86-specific detail that I'm not aware of...
Sure. I guess it's helpful to explain how it works in next revision. Something like below:
This enables the ring-based dirty memory tracking on ARM64. The feature is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is pushed by host when page becomes dirty and pulled by userspace. A vcpu exit is forced when the ring buffer becomes full. The ring buffers on all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
Yes, I think so. Adding a per-CPU ring results in extra memory allocation. However, it's avoiding synchronization among multiple vcpus when dirty pages happen on multiple vcpus. More discussion can be found from [1]
Oh, I totally buy the relaxation of the synchronisation (though I doubt this will have any visible effect until we have something like Oliver's patches to allow parallel faulting).
But it is the "no extra reserved ring entries are needed on ARM64" argument that I don't get yet.
Ok. The extra reserved ring entries are x86 specific. When x86's PML (Page Modification Logging) hardware capability is enabled, the vcpu exits due to full PML buffer, which is 512 entries. All the information in PML buffer is pushed to the dirty ring buffer in one shoot. To avoid overrunning the dirty ring buffer, there are 512 entries are reserved.
=== include/linux/kvm_host.h
#define KVM_DIRTY_RING_RSVD_ENTRIES 64 // fixed and reserved ring entries
=== virt/kvm/dirty_ring.c
int __weak kvm_cpu_dirty_log_size(void) { return 0; }
u32 kvm_dirty_ring_get_rsvd_entries(void) { return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size(); }
=== arch/x86/kvm/mmu/mmu.c
int kvm_cpu_dirty_log_size(void) { return kvm_x86_ops.cpu_dirty_log_size; // Set to 512 when PML is enabled }
kvm_cpu_dirty_log_size() isn't be overrided by ARM64, meaning it returns zero on ARM64. On x86, it returns 512 when PML is enabled.
[1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA7... (comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC)
Signed-off-by: Gavin Shan gshan@redhat.com
Documentation/virt/kvm/api.rst | 2 +- arch/arm64/include/uapi/asm/kvm.h | 1 + arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 8 ++++++++ 4 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index abd7c32126ce..19fa1ac017ed 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf. 8.29 KVM_CAP_DIRTY_LOG_RING
-:Architectures: x86 +:Architectures: x86, arm64 :Parameters: args[0] - size of the dirty log ring KVM is capable of tracking dirty memory using ring buffers that are diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 3bb134355874..7e04b0b8d2b2 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -43,6 +43,7 @@ #define __KVM_HAVE_VCPU_EVENTS #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
For context, the documentation says:
<quote> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...] </quote>
What is the reason for picking this particular value?
It's inherited from x86. I don't think it has to be this particular value. The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET.
How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision? The virtual area is cheap, I guess it's also nice to use x86's pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
#define KVM_COALESCED_MMIO_PAGE_OFFSET 1 #define KVM_DIRTY_LOG_PAGE_OFFSET 2
Given that this is just an offset in the vcpu "file", I don't think it matters that much. 64 definitely allows for some struct vcpu growth, and it doesn't hurt to be compatible with x86 (for once...).
Sure, thanks. I think it'd better to have same pattern as x86 either.
#define KVM_REG_SIZE(id)
\ (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT)) diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 815cc118c675..0309b2d0f2da 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -32,6 +32,7 @@ menuconfig KVM select KVM_VFIO select HAVE_KVM_EVENTFD select HAVE_KVM_IRQFD
- select HAVE_KVM_DIRTY_RING select HAVE_KVM_MSI select HAVE_KVM_IRQCHIP select HAVE_KVM_IRQ_ROUTING
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..3de6b9b39db7 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) if (!ret) ret = 1;
/* Force vcpu exit if its dirty ring is soft-full */
if (unlikely(vcpu->kvm->dirty_ring_size &&
kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
ret = 0;
}
Why can't this be moved to kvm_vcpu_exit_request() instead? I would also very much like the check to be made a common helper with x86.
A seemingly approach would be to make this a request on dirty log insertion, and avoid the whole "check the log size" on every run, which adds pointless overhead to unsuspecting users (aka everyone).
I though of having the check in kvm_vcpu_exit_request(). The various exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the highest priority and ARM64 is just to follow. I don't think it really matters. I will improve it accordingly in next revision:
Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c
bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu) { struct kvm *kvm = vcpu->vcpu; struct kvm_dirty_ring *ring = &vcpu->dirty_ring;
if (unlikely(kvm->dirty_ring_size && kvm_dirty_ring_used(ring) >= ring->soft_limit)) { vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; trace_kvm_dirty_ring_exit(vcpu); return true; } return false;
}
Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request().
Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time. Otherwise, the dirty log in the ring buffer will be overwritten. I'm not sure if anything else I missed?
I'm fine with the above, but what I really wanted is a request from the dirty-ring insertion, instead of a check in kvm_vpcu_exit_request. Something like this (which obviously doesn't compile, but you'll get the idea):
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..0b41feb6fb7d 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu);
if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
return 0;
}}
return 1; diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index f4c2a6eb1666..08b2f01164fa 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) {
- struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring); struct kvm_dirty_gfn *entry;
/* It should never get full */ @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) kvm_dirty_gfn_set_dirtied(entry); ring->dirty_index++; trace_kvm_dirty_ring_push(ring, slot, offset);
- if (kvm_dirty_ring_soft_full(vcpu))
}kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
Ok, thanks for the details, Marc. I will adopt your code in next revision :)
Thanks, Gavin
On Tue, Aug 23, 2022 at 03:22:17PM +1000, Gavin Shan wrote:
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..0b41feb6fb7d 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu);
if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
return 0;
} return 1;}
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index f4c2a6eb1666..08b2f01164fa 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) {
- struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring); struct kvm_dirty_gfn *entry; /* It should never get full */
@@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) kvm_dirty_gfn_set_dirtied(entry); ring->dirty_index++; trace_kvm_dirty_ring_push(ring, slot, offset);
- if (kvm_dirty_ring_soft_full(vcpu))
} struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
Ok, thanks for the details, Marc. I will adopt your code in next revision :)
Note that there can be a slight difference with the old/new code, in that an (especially malicious) userapp can logically ignore the DIRTY_RING_FULL vmexit and keep kicking VCPU_RUN with the new code.
Unlike the old code, the 2nd/3rd/... KVM_RUN will still run in the new code until the next dirty pfn being pushed to the ring, then it'll request ring full exit again.
Each time it exits the ring grows 1.
At last iiuc it can easily hit the ring full and trigger the warning at the entry of kvm_dirty_ring_push():
/* It should never get full */ WARN_ON_ONCE(kvm_dirty_ring_full(ring));
We did that because kvm_dirty_ring_push() was previously designed to not be able to fail at all (e.g., in the old bitmap world we never will fail too). We can't because we can't lose any dirty page or migration could silently fail too (consider when we do user exit due to ring full and migration just completed; there could be unsynced pages on src/dst).
So even though the old approach will need to read kvm->dirty_ring_size for every entrance which is a pity, it will avoid issue above.
Side note: for x86 the dirty ring check was put at the entrance not because it needs to be the highest priority - it should really be the same when check kvm requests. It's just that it'll be the fastest way to fail properly if needed before loading mmu, disabling preemption/irq, etc.
Thanks,
On Tue, 23 Aug 2022 14:58:19 +0100, Peter Xu peterx@redhat.com wrote:
On Tue, Aug 23, 2022 at 03:22:17PM +1000, Gavin Shan wrote:
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..0b41feb6fb7d 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu);
if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
return 0;
} return 1;}
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index f4c2a6eb1666..08b2f01164fa 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) {
- struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring); struct kvm_dirty_gfn *entry; /* It should never get full */
@@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) kvm_dirty_gfn_set_dirtied(entry); ring->dirty_index++; trace_kvm_dirty_ring_push(ring, slot, offset);
- if (kvm_dirty_ring_soft_full(vcpu))
} struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
Ok, thanks for the details, Marc. I will adopt your code in next revision :)
Note that there can be a slight difference with the old/new code, in that an (especially malicious) userapp can logically ignore the DIRTY_RING_FULL vmexit and keep kicking VCPU_RUN with the new code.
Unlike the old code, the 2nd/3rd/... KVM_RUN will still run in the new code until the next dirty pfn being pushed to the ring, then it'll request ring full exit again.
Each time it exits the ring grows 1.
At last iiuc it can easily hit the ring full and trigger the warning at the entry of kvm_dirty_ring_push():
/* It should never get full */ WARN_ON_ONCE(kvm_dirty_ring_full(ring));
Hmmm, yes. Well spotted.
We did that because kvm_dirty_ring_push() was previously designed to not be able to fail at all (e.g., in the old bitmap world we never will fail too). We can't because we can't lose any dirty page or migration could silently fail too (consider when we do user exit due to ring full and migration just completed; there could be unsynced pages on src/dst).
So even though the old approach will need to read kvm->dirty_ring_size for every entrance which is a pity, it will avoid issue above.
I don't think we really need this check on the hot path. All we need is to make the request sticky until userspace gets their act together and consumes elements in the ring. Something like:
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..e8ed5e1af159 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu); + + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) && + kvm_dirty_ring_soft_full(vcpu)) { + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu); + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; + trace_kvm_dirty_ring_exit(vcpu); + return 0; + } }
return 1;
However, I'm a bit concerned by the reset side of things. It iterates over the vcpus and expects the view of each ring to be consistent, even if userspace is hacking at it from another CPU. For example, I can't see what guarantees that the kernel observes the writes from userspace in the order they are being performed (the documentation provides no requirements other than "it must collect the dirty GFNs in sequence", which doesn't mean much from an ordering perspective).
I can see that working on a strongly ordered architecture, but on something as relaxed as ARM, the CPUs may^Wwill aggressively reorder stuff that isn't explicitly ordered. I have the feeling that a CAS operation on both sides would be enough, but someone who actually understands how this works should have a look...
Thanks,
M.
On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
I don't think we really need this check on the hot path. All we need is to make the request sticky until userspace gets their act together and consumes elements in the ring. Something like:
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..e8ed5e1af159 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu);
if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
kvm_dirty_ring_soft_full(vcpu)) {
kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
return 0;
}}
return 1;
Right, this seems working. We can also use kvm_test_request() here.
However, I'm a bit concerned by the reset side of things. It iterates over the vcpus and expects the view of each ring to be consistent, even if userspace is hacking at it from another CPU. For example, I can't see what guarantees that the kernel observes the writes from userspace in the order they are being performed (the documentation provides no requirements other than "it must collect the dirty GFNs in sequence", which doesn't mean much from an ordering perspective).
I can see that working on a strongly ordered architecture, but on something as relaxed as ARM, the CPUs may^Wwill aggressively reorder stuff that isn't explicitly ordered. I have the feeling that a CAS operation on both sides would be enough, but someone who actually understands how this works should have a look...
I definitely don't think I 100% understand all the ordering things since they're complicated.. but my understanding is that the reset procedure didn't need memory barrier (unlike pushing, where we have explicit wmb), because we assumed the userapp is not hostile so logically it should only modify the flags which is a 32bit field, assuming atomicity guaranteed.
IIRC we used to discuss similar questions on "what if the user is hostile and wants to hack the process by messing up with the ring", and our conclusion was as long as the process wouldn't mess up anything outside itself it should be okay. E.g. It should not be able to either cause the host to misfunction, or trigger kernel warnings in dmesg, etc..
Thanks,
On Tue, 23 Aug 2022 22:20:32 +0100, Peter Xu peterx@redhat.com wrote:
On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
I don't think we really need this check on the hot path. All we need is to make the request sticky until userspace gets their act together and consumes elements in the ring. Something like:
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..e8ed5e1af159 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu);
if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
kvm_dirty_ring_soft_full(vcpu)) {
kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
return 0;
}}
return 1;
Right, this seems working. We can also use kvm_test_request() here.
However, I'm a bit concerned by the reset side of things. It iterates over the vcpus and expects the view of each ring to be consistent, even if userspace is hacking at it from another CPU. For example, I can't see what guarantees that the kernel observes the writes from userspace in the order they are being performed (the documentation provides no requirements other than "it must collect the dirty GFNs in sequence", which doesn't mean much from an ordering perspective).
I can see that working on a strongly ordered architecture, but on something as relaxed as ARM, the CPUs may^Wwill aggressively reorder stuff that isn't explicitly ordered. I have the feeling that a CAS operation on both sides would be enough, but someone who actually understands how this works should have a look...
I definitely don't think I 100% understand all the ordering things since they're complicated.. but my understanding is that the reset procedure didn't need memory barrier (unlike pushing, where we have explicit wmb), because we assumed the userapp is not hostile so logically it should only modify the flags which is a 32bit field, assuming atomicity guaranteed.
Atomicity doesn't guarantee ordering, unfortunately. Take the following example: CPU0 is changing a bunch of flags for GFNs A, B, C, D that exist in the ring in that order, and CPU1 performs an ioctl to reset the page state.
CPU0: write_flag(A, KVM_DIRTY_GFN_F_RESET) write_flag(B, KVM_DIRTY_GFN_F_RESET) write_flag(C, KVM_DIRTY_GFN_F_RESET) write_flag(D, KVM_DIRTY_GFN_F_RESET) [...]
CPU1: ioctl(KVM_RESET_DIRTY_RINGS)
Since CPU0 writes do not have any ordering, CPU1 can observe the writes in a sequence that have nothing to do with program order, and could for example observe that GFN A and D have been reset, but not B and C. This in turn breaks the logic in the reset code (B, C, and D don't get reset), despite userspace having followed the spec to the letter. If each was a store-release (which is the case on x86), it wouldn't be a problem, but nothing calls it in the documentation.
Maybe that's not a big deal if it is expected that each CPU will issue a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own writes. But expecting this to work across CPUs without any barrier is wishful thinking.
IIRC we used to discuss similar questions on "what if the user is hostile and wants to hack the process by messing up with the ring", and our conclusion was as long as the process wouldn't mess up anything outside itself it should be okay. E.g. It should not be able to either cause the host to misfunction, or trigger kernel warnings in dmesg, etc..
I'm not even discussing safety here. I'm purely discussing the interactions between userspace and kernel based on the documentation and the existing kernel code.
Thanks,
M.
On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
On Tue, 23 Aug 2022 22:20:32 +0100, Peter Xu peterx@redhat.com wrote:
On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
I don't think we really need this check on the hot path. All we need is to make the request sticky until userspace gets their act together and consumes elements in the ring. Something like:
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..e8ed5e1af159 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu);
if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
kvm_dirty_ring_soft_full(vcpu)) {
kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
return 0;
}}
return 1;
Right, this seems working. We can also use kvm_test_request() here.
However, I'm a bit concerned by the reset side of things. It iterates over the vcpus and expects the view of each ring to be consistent, even if userspace is hacking at it from another CPU. For example, I can't see what guarantees that the kernel observes the writes from userspace in the order they are being performed (the documentation provides no requirements other than "it must collect the dirty GFNs in sequence", which doesn't mean much from an ordering perspective).
I can see that working on a strongly ordered architecture, but on something as relaxed as ARM, the CPUs may^Wwill aggressively reorder stuff that isn't explicitly ordered. I have the feeling that a CAS operation on both sides would be enough, but someone who actually understands how this works should have a look...
I definitely don't think I 100% understand all the ordering things since they're complicated.. but my understanding is that the reset procedure didn't need memory barrier (unlike pushing, where we have explicit wmb), because we assumed the userapp is not hostile so logically it should only modify the flags which is a 32bit field, assuming atomicity guaranteed.
Atomicity doesn't guarantee ordering, unfortunately.
Right, sorry to be misleading. The "atomicity" part I was trying to say the kernel will always see consistent update on the fields.
The ordering should also be guaranteed, because things must happen with below sequence:
(1) kernel publish dirty GFN data (slot, offset) (2) kernel publish dirty GFN flag (set to DIRTY) (3) user sees DIRTY, collects (slots, offset) (4) user sets it to RESET (5) kernel reads RESET
So the ordering of single-entry is guaranteed in that when (5) happens it must be after stablized (1+2).
Take the following example: CPU0 is changing a bunch of flags for GFNs A, B, C, D that exist in the ring in that order, and CPU1 performs an ioctl to reset the page state.
CPU0: write_flag(A, KVM_DIRTY_GFN_F_RESET) write_flag(B, KVM_DIRTY_GFN_F_RESET) write_flag(C, KVM_DIRTY_GFN_F_RESET) write_flag(D, KVM_DIRTY_GFN_F_RESET) [...]
CPU1: ioctl(KVM_RESET_DIRTY_RINGS)
Since CPU0 writes do not have any ordering, CPU1 can observe the writes in a sequence that have nothing to do with program order, and could for example observe that GFN A and D have been reset, but not B and C. This in turn breaks the logic in the reset code (B, C, and D don't get reset), despite userspace having followed the spec to the letter. If each was a store-release (which is the case on x86), it wouldn't be a problem, but nothing calls it in the documentation.
Maybe that's not a big deal if it is expected that each CPU will issue a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own writes. But expecting this to work across CPUs without any barrier is wishful thinking.
I see what you meant...
Firstly I'm actually curious whether that'll really happen if the gfns are collected in something like a for loop:
for(i = 0; i < N; i++) collect_dirty_gfn(ring, i);
Because since all the gfps to be read will depend on variable "i", IIUC no reordering should happen, but I'm not really sure, so more of a pure question.
Besides, the other thing to mention is that I think it is fine the RESET ioctl didn't recycle all the gfns got set to reset state. Taking above example of GFNs A-D, if when reaching the RESET ioctl only A & D's flags are updated, the ioctl will recycle gfn A but stop at gfn B assuming B-D are not reset. But IMHO it's okay because it means we reset partial of the gfns not all of them, and it's safe to do so. It means the next ring full event can come earlier because we recycled less, but that's functionally safe to me.
IIRC we used to discuss similar questions on "what if the user is hostile and wants to hack the process by messing up with the ring", and our conclusion was as long as the process wouldn't mess up anything outside itself it should be okay. E.g. It should not be able to either cause the host to misfunction, or trigger kernel warnings in dmesg, etc..
I'm not even discussing safety here. I'm purely discussing the interactions between userspace and kernel based on the documentation and the existing kernel code.
I see. Please let me know if above makes sense.
Thanks!
On Wed, 24 Aug 2022 00:19:04 +0100, Peter Xu peterx@redhat.com wrote:
On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
On Tue, 23 Aug 2022 22:20:32 +0100, Peter Xu peterx@redhat.com wrote:
On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
I don't think we really need this check on the hot path. All we need is to make the request sticky until userspace gets their act together and consumes elements in the ring. Something like:
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..e8ed5e1af159 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu);
if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
kvm_dirty_ring_soft_full(vcpu)) {
kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
return 0;
}}
return 1;
Right, this seems working. We can also use kvm_test_request() here.
However, I'm a bit concerned by the reset side of things. It iterates over the vcpus and expects the view of each ring to be consistent, even if userspace is hacking at it from another CPU. For example, I can't see what guarantees that the kernel observes the writes from userspace in the order they are being performed (the documentation provides no requirements other than "it must collect the dirty GFNs in sequence", which doesn't mean much from an ordering perspective).
I can see that working on a strongly ordered architecture, but on something as relaxed as ARM, the CPUs may^Wwill aggressively reorder stuff that isn't explicitly ordered. I have the feeling that a CAS operation on both sides would be enough, but someone who actually understands how this works should have a look...
I definitely don't think I 100% understand all the ordering things since they're complicated.. but my understanding is that the reset procedure didn't need memory barrier (unlike pushing, where we have explicit wmb), because we assumed the userapp is not hostile so logically it should only modify the flags which is a 32bit field, assuming atomicity guaranteed.
Atomicity doesn't guarantee ordering, unfortunately.
Right, sorry to be misleading. The "atomicity" part I was trying to say the kernel will always see consistent update on the fields.
The ordering should also be guaranteed, because things must happen with below sequence:
(1) kernel publish dirty GFN data (slot, offset) (2) kernel publish dirty GFN flag (set to DIRTY) (3) user sees DIRTY, collects (slots, offset) (4) user sets it to RESET (5) kernel reads RESET
Maybe. Maybe not. The reset could well be sitting in the CPU write buffer for as long as it wants and not be seen by the kernel if the read occurs on another CPU. And that's the crucial bit: single-CPU is fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU on collection, and global on reset (this seems like a bad decision, but it is too late to fix this).
So the ordering of single-entry is guaranteed in that when (5) happens it must be after stablized (1+2).
Take the following example: CPU0 is changing a bunch of flags for GFNs A, B, C, D that exist in the ring in that order, and CPU1 performs an ioctl to reset the page state.
CPU0: write_flag(A, KVM_DIRTY_GFN_F_RESET) write_flag(B, KVM_DIRTY_GFN_F_RESET) write_flag(C, KVM_DIRTY_GFN_F_RESET) write_flag(D, KVM_DIRTY_GFN_F_RESET) [...]
CPU1: ioctl(KVM_RESET_DIRTY_RINGS)
Since CPU0 writes do not have any ordering, CPU1 can observe the writes in a sequence that have nothing to do with program order, and could for example observe that GFN A and D have been reset, but not B and C. This in turn breaks the logic in the reset code (B, C, and D don't get reset), despite userspace having followed the spec to the letter. If each was a store-release (which is the case on x86), it wouldn't be a problem, but nothing calls it in the documentation.
Maybe that's not a big deal if it is expected that each CPU will issue a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own writes. But expecting this to work across CPUs without any barrier is wishful thinking.
I see what you meant...
Firstly I'm actually curious whether that'll really happen if the gfns are collected in something like a for loop:
for(i = 0; i < N; i++) collect_dirty_gfn(ring, i);
Because since all the gfps to be read will depend on variable "i", IIUC no reordering should happen, but I'm not really sure, so more of a pure question.
'i' has no influence on the write ordering. Each write targets a different address, there is no inter-write dependencies (this concept doesn't exist other than for writes to the same address), so they can be reordered at will.
If you want a proof of this, head to http://diy.inria.fr/www/ and run the MP.litmus test (which conveniently gives you a reduction of this problem) on both the x86 and AArch64 models. You will see that the reordering isn't allowed on x86, but definitely allowed on arm64.
Besides, the other thing to mention is that I think it is fine the RESET ioctl didn't recycle all the gfns got set to reset state. Taking above example of GFNs A-D, if when reaching the RESET ioctl only A & D's flags are updated, the ioctl will recycle gfn A but stop at gfn B assuming B-D are not reset. But IMHO it's okay because it means we reset partial of the gfns not all of them, and it's safe to do so. It means the next ring full event can come earlier because we recycled less, but that's functionally safe to me.
It may be safe, but it isn't what the userspace API promises. In other words, without further straightening of the API, this doesn't work as expected on relaxed memory architectures. So before this gets enabled on arm64, this whole ordering issue must be addressed.
Thanks,
M.
On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
On Wed, 24 Aug 2022 00:19:04 +0100, Peter Xu peterx@redhat.com wrote:
On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
On Tue, 23 Aug 2022 22:20:32 +0100, Peter Xu peterx@redhat.com wrote:
On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
I don't think we really need this check on the hot path. All we need is to make the request sticky until userspace gets their act together and consumes elements in the ring. Something like:
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..e8ed5e1af159 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu);
if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
kvm_dirty_ring_soft_full(vcpu)) {
kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
return 0;
}}
return 1;
Right, this seems working. We can also use kvm_test_request() here.
However, I'm a bit concerned by the reset side of things. It iterates over the vcpus and expects the view of each ring to be consistent, even if userspace is hacking at it from another CPU. For example, I can't see what guarantees that the kernel observes the writes from userspace in the order they are being performed (the documentation provides no requirements other than "it must collect the dirty GFNs in sequence", which doesn't mean much from an ordering perspective).
I can see that working on a strongly ordered architecture, but on something as relaxed as ARM, the CPUs may^Wwill aggressively reorder stuff that isn't explicitly ordered. I have the feeling that a CAS operation on both sides would be enough, but someone who actually understands how this works should have a look...
I definitely don't think I 100% understand all the ordering things since they're complicated.. but my understanding is that the reset procedure didn't need memory barrier (unlike pushing, where we have explicit wmb), because we assumed the userapp is not hostile so logically it should only modify the flags which is a 32bit field, assuming atomicity guaranteed.
Atomicity doesn't guarantee ordering, unfortunately.
Right, sorry to be misleading. The "atomicity" part I was trying to say the kernel will always see consistent update on the fields.
The ordering should also be guaranteed, because things must happen with below sequence:
(1) kernel publish dirty GFN data (slot, offset) (2) kernel publish dirty GFN flag (set to DIRTY) (3) user sees DIRTY, collects (slots, offset) (4) user sets it to RESET (5) kernel reads RESET
Maybe. Maybe not. The reset could well be sitting in the CPU write buffer for as long as it wants and not be seen by the kernel if the read occurs on another CPU. And that's the crucial bit: single-CPU is fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU on collection, and global on reset (this seems like a bad decision, but it is too late to fix this).
Regarding the last statement, that's something I had question too and discussed with Paolo, even though at that time it's not a outcome of discussing memory ordering issues.
IIUC the initial design was trying to avoid tlb flush flood when vcpu number is large (each RESET per ring even for one page will need all vcpus to flush, so O(N^2) flushing needed). With global RESET it's O(N). So it's kind of a trade-off, and indeed until now I'm not sure which one is better. E.g., with per-ring reset, we can have locality too in userspace, e.g. the vcpu thread might be able to recycle without holding global locks.
Regarding safety I hope I covered that below in previous reply.
So the ordering of single-entry is guaranteed in that when (5) happens it must be after stablized (1+2).
Take the following example: CPU0 is changing a bunch of flags for GFNs A, B, C, D that exist in the ring in that order, and CPU1 performs an ioctl to reset the page state.
CPU0: write_flag(A, KVM_DIRTY_GFN_F_RESET) write_flag(B, KVM_DIRTY_GFN_F_RESET) write_flag(C, KVM_DIRTY_GFN_F_RESET) write_flag(D, KVM_DIRTY_GFN_F_RESET) [...]
CPU1: ioctl(KVM_RESET_DIRTY_RINGS)
Since CPU0 writes do not have any ordering, CPU1 can observe the writes in a sequence that have nothing to do with program order, and could for example observe that GFN A and D have been reset, but not B and C. This in turn breaks the logic in the reset code (B, C, and D don't get reset), despite userspace having followed the spec to the letter. If each was a store-release (which is the case on x86), it wouldn't be a problem, but nothing calls it in the documentation.
Maybe that's not a big deal if it is expected that each CPU will issue a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own writes. But expecting this to work across CPUs without any barrier is wishful thinking.
I see what you meant...
Firstly I'm actually curious whether that'll really happen if the gfns are collected in something like a for loop:
for(i = 0; i < N; i++) collect_dirty_gfn(ring, i);
Because since all the gfps to be read will depend on variable "i", IIUC no reordering should happen, but I'm not really sure, so more of a pure question.
'i' has no influence on the write ordering. Each write targets a different address, there is no inter-write dependencies (this concept doesn't exist other than for writes to the same address), so they can be reordered at will.
If you want a proof of this, head to http://diy.inria.fr/www/ and run the MP.litmus test (which conveniently gives you a reduction of this problem) on both the x86 and AArch64 models. You will see that the reordering isn't allowed on x86, but definitely allowed on arm64.
Besides, the other thing to mention is that I think it is fine the RESET ioctl didn't recycle all the gfns got set to reset state. Taking above example of GFNs A-D, if when reaching the RESET ioctl only A & D's flags are updated, the ioctl will recycle gfn A but stop at gfn B assuming B-D are not reset. But IMHO it's okay because it means we reset partial of the gfns not all of them, and it's safe to do so. It means the next ring full event can come earlier because we recycled less, but that's functionally safe to me.
It may be safe, but it isn't what the userspace API promises.
The document says:
After processing one or more entries in the ring buffer, userspace calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that the kernel will reprotect those collected GFNs. Therefore, the ioctl must be called *before* reading the content of the dirty pages.
I'd say it's not an explicit promise, but I think I agree with you that at least it's unclear on the behavior.
Since we have a global recycle mechanism, most likely the app (e.g. current qemu impl) will use the same thread to collect/reset dirty GFNs, and trigger the RESET ioctl(). In that case it's safe, IIUC, because no cross-core ops.
QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
if (total) { ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS); assert(ret == total); }
I think the assert() should never trigger as mentioned above. But ideally maybe it should just be a loop until cleared gfns match total.
In other words, without further straightening of the API, this doesn't work as expected on relaxed memory architectures. So before this gets enabled on arm64, this whole ordering issue must be addressed.
How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the possibility of recycling partial of the pages, especially when collection and the ioctl() aren't done from the same thread?
Any suggestions will be greatly welcomed.
Thanks,
On Wed, 24 Aug 2022 17:21:50 +0100, Peter Xu peterx@redhat.com wrote:
On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
On Wed, 24 Aug 2022 00:19:04 +0100, Peter Xu peterx@redhat.com wrote:
On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
Atomicity doesn't guarantee ordering, unfortunately.
Right, sorry to be misleading. The "atomicity" part I was trying to say the kernel will always see consistent update on the fields.
The ordering should also be guaranteed, because things must happen with below sequence:
(1) kernel publish dirty GFN data (slot, offset) (2) kernel publish dirty GFN flag (set to DIRTY) (3) user sees DIRTY, collects (slots, offset) (4) user sets it to RESET (5) kernel reads RESET
Maybe. Maybe not. The reset could well be sitting in the CPU write buffer for as long as it wants and not be seen by the kernel if the read occurs on another CPU. And that's the crucial bit: single-CPU is fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU on collection, and global on reset (this seems like a bad decision, but it is too late to fix this).
Regarding the last statement, that's something I had question too and discussed with Paolo, even though at that time it's not a outcome of discussing memory ordering issues.
IIUC the initial design was trying to avoid tlb flush flood when vcpu number is large (each RESET per ring even for one page will need all vcpus to flush, so O(N^2) flushing needed). With global RESET it's O(N). So it's kind of a trade-off, and indeed until now I'm not sure which one is better. E.g., with per-ring reset, we can have locality too in userspace, e.g. the vcpu thread might be able to recycle without holding global locks.
I don't get that. On x86, each CPU must perform the TLB invalidation (there is an IPI for that). So whether you do a per-CPU scan of the ring or a global scan is irrelevant: each entry you find in any of the rings must result in a global invalidation, since you've updated the PTE to make the page RO.
The same is true on ARM, except that the broadcast is done in HW instead of being tracked in SW.
Buy anyway, this is all moot. The API is what it is, and it isn't going to change any time soon. All we can do is add some clarifications to the API for the more relaxed architectures, and make sure the kernel behaves accordingly.
[...]
It may be safe, but it isn't what the userspace API promises.
The document says:
After processing one or more entries in the ring buffer, userspace calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that the kernel will reprotect those collected GFNs. Therefore, the ioctl must be called *before* reading the content of the dirty pages.
I'd say it's not an explicit promise, but I think I agree with you that at least it's unclear on the behavior.
This is the least problematic part of the documentation. The bit I literally choke on is this:
<quote> It's not necessary for userspace to harvest the all dirty GFNs at once. However it must collect the dirty GFNs in sequence, i.e., the userspace program cannot skip one dirty GFN to collect the one next to it. </quote>
This is the core of the issue. Without ordering rules, the consumer on the other side cannot observe the updates correctly, even if userspace follows the above to the letter. Of course, the kernel itself must do the right thing (I guess it amounts to the kernel doing a load-acquire, and userspace doing a store-release -- effectively emulating x86...).
Since we have a global recycle mechanism, most likely the app (e.g. current qemu impl) will use the same thread to collect/reset dirty GFNs, and trigger the RESET ioctl(). In that case it's safe, IIUC, because no cross-core ops.
QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
if (total) { ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS); assert(ret == total); }
I think the assert() should never trigger as mentioned above. But ideally maybe it should just be a loop until cleared gfns match total.
Right. If userspace calls the ioctl on every vcpu, then things should work correctly. It is only that the overhead is higher than what it should be if multiple vcpus perform a reset at the same time.
In other words, without further straightening of the API, this doesn't work as expected on relaxed memory architectures. So before this gets enabled on arm64, this whole ordering issue must be addressed.
How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the possibility of recycling partial of the pages, especially when collection and the ioctl() aren't done from the same thread?
I'd rather tell people about the ordering rules. That will come at zero cost on x86.
Any suggestions will be greatly welcomed.
I'll write a couple of patch when I get the time, most likely next week. Gavin will hopefully be able to take them as part of his series.
M.
Hi Marc,
On 8/25/22 6:57 AM, Marc Zyngier wrote:
On Wed, 24 Aug 2022 17:21:50 +0100, Peter Xu peterx@redhat.com wrote:
On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
On Wed, 24 Aug 2022 00:19:04 +0100, Peter Xu peterx@redhat.com wrote:
On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
Atomicity doesn't guarantee ordering, unfortunately.
Right, sorry to be misleading. The "atomicity" part I was trying to say the kernel will always see consistent update on the fields.
The ordering should also be guaranteed, because things must happen with below sequence:
(1) kernel publish dirty GFN data (slot, offset) (2) kernel publish dirty GFN flag (set to DIRTY) (3) user sees DIRTY, collects (slots, offset) (4) user sets it to RESET (5) kernel reads RESET
Maybe. Maybe not. The reset could well be sitting in the CPU write buffer for as long as it wants and not be seen by the kernel if the read occurs on another CPU. And that's the crucial bit: single-CPU is fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU on collection, and global on reset (this seems like a bad decision, but it is too late to fix this).
Regarding the last statement, that's something I had question too and discussed with Paolo, even though at that time it's not a outcome of discussing memory ordering issues.
IIUC the initial design was trying to avoid tlb flush flood when vcpu number is large (each RESET per ring even for one page will need all vcpus to flush, so O(N^2) flushing needed). With global RESET it's O(N). So it's kind of a trade-off, and indeed until now I'm not sure which one is better. E.g., with per-ring reset, we can have locality too in userspace, e.g. the vcpu thread might be able to recycle without holding global locks.
I don't get that. On x86, each CPU must perform the TLB invalidation (there is an IPI for that). So whether you do a per-CPU scan of the ring or a global scan is irrelevant: each entry you find in any of the rings must result in a global invalidation, since you've updated the PTE to make the page RO.
The same is true on ARM, except that the broadcast is done in HW instead of being tracked in SW.
Buy anyway, this is all moot. The API is what it is, and it isn't going to change any time soon. All we can do is add some clarifications to the API for the more relaxed architectures, and make sure the kernel behaves accordingly.
[...]
It may be safe, but it isn't what the userspace API promises.
The document says:
After processing one or more entries in the ring buffer, userspace calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that the kernel will reprotect those collected GFNs. Therefore, the ioctl must be called *before* reading the content of the dirty pages.
I'd say it's not an explicit promise, but I think I agree with you that at least it's unclear on the behavior.
This is the least problematic part of the documentation. The bit I literally choke on is this:
<quote> It's not necessary for userspace to harvest the all dirty GFNs at once. However it must collect the dirty GFNs in sequence, i.e., the userspace program cannot skip one dirty GFN to collect the one next to it. </quote>
This is the core of the issue. Without ordering rules, the consumer on the other side cannot observe the updates correctly, even if userspace follows the above to the letter. Of course, the kernel itself must do the right thing (I guess it amounts to the kernel doing a load-acquire, and userspace doing a store-release -- effectively emulating x86...).
Since we have a global recycle mechanism, most likely the app (e.g. current qemu impl) will use the same thread to collect/reset dirty GFNs, and trigger the RESET ioctl(). In that case it's safe, IIUC, because no cross-core ops.
QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
if (total) { ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS); assert(ret == total); }
I think the assert() should never trigger as mentioned above. But ideally maybe it should just be a loop until cleared gfns match total.
Right. If userspace calls the ioctl on every vcpu, then things should work correctly. It is only that the overhead is higher than what it should be if multiple vcpus perform a reset at the same time.
In other words, without further straightening of the API, this doesn't work as expected on relaxed memory architectures. So before this gets enabled on arm64, this whole ordering issue must be addressed.
How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the possibility of recycling partial of the pages, especially when collection and the ioctl() aren't done from the same thread?
I'd rather tell people about the ordering rules. That will come at zero cost on x86.
Any suggestions will be greatly welcomed.
I'll write a couple of patch when I get the time, most likely next week. Gavin will hopefully be able to take them as part of his series.
Thanks, Marc. Please let me know where I can check out the patches when they're ready. I can include the patches into this series in next revision :)
Thanks, Gavin
On 8/24/22 00:47, Marc Zyngier wrote:
I definitely don't think I 100% understand all the ordering things since they're complicated.. but my understanding is that the reset procedure didn't need memory barrier (unlike pushing, where we have explicit wmb), because we assumed the userapp is not hostile so logically it should only modify the flags which is a 32bit field, assuming atomicity guaranteed.
Atomicity doesn't guarantee ordering, unfortunately. Take the following example: CPU0 is changing a bunch of flags for GFNs A, B, C, D that exist in the ring in that order, and CPU1 performs an ioctl to reset the page state.
CPU0: write_flag(A, KVM_DIRTY_GFN_F_RESET) write_flag(B, KVM_DIRTY_GFN_F_RESET) write_flag(C, KVM_DIRTY_GFN_F_RESET) write_flag(D, KVM_DIRTY_GFN_F_RESET) [...]
CPU1: ioctl(KVM_RESET_DIRTY_RINGS)
Since CPU0 writes do not have any ordering, CPU1 can observe the writes in a sequence that have nothing to do with program order, and could for example observe that GFN A and D have been reset, but not B and C. This in turn breaks the logic in the reset code (B, C, and D don't get reset), despite userspace having followed the spec to the letter. If each was a store-release (which is the case on x86), it wouldn't be a problem, but nothing calls it in the documentation.
Maybe that's not a big deal if it is expected that each CPU will issue a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own writes. But expecting this to work across CPUs without any barrier is wishful thinking.
Agreed, but that's a problem for userspace to solve. If userspace wants to reset the fields in different CPUs, it has to synchronize with its own invoking of the ioctl.
That is, CPU0 must ensure that a ioctl(KVM_RESET_DIRTY_RINGS) is done after (in the memory-ordering sense) its last write_flag(D, KVM_DIRTY_GFN_F_RESET). If there's no such ordering, there's no guarantee that the write_flag will have any effect.
The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl was because it takes kvm->slots_lock so the execution would be serialized anyway. Turning slots_lock into an rwsem would be even worse because it also takes kvm->mmu_lock (since slots_lock is a mutex, at least two concurrent invocations won't clash with each other on the mmu_lock).
Paolo
On Fri, 26 Aug 2022 11:50:24 +0100, Paolo Bonzini pbonzini@redhat.com wrote:
On 8/24/22 00:47, Marc Zyngier wrote:
I definitely don't think I 100% understand all the ordering things since they're complicated.. but my understanding is that the reset procedure didn't need memory barrier (unlike pushing, where we have explicit wmb), because we assumed the userapp is not hostile so logically it should only modify the flags which is a 32bit field, assuming atomicity guaranteed.
Atomicity doesn't guarantee ordering, unfortunately. Take the following example: CPU0 is changing a bunch of flags for GFNs A, B, C, D that exist in the ring in that order, and CPU1 performs an ioctl to reset the page state.
CPU0: write_flag(A, KVM_DIRTY_GFN_F_RESET) write_flag(B, KVM_DIRTY_GFN_F_RESET) write_flag(C, KVM_DIRTY_GFN_F_RESET) write_flag(D, KVM_DIRTY_GFN_F_RESET) [...]
CPU1: ioctl(KVM_RESET_DIRTY_RINGS)
Since CPU0 writes do not have any ordering, CPU1 can observe the writes in a sequence that have nothing to do with program order, and could for example observe that GFN A and D have been reset, but not B and C. This in turn breaks the logic in the reset code (B, C, and D don't get reset), despite userspace having followed the spec to the letter. If each was a store-release (which is the case on x86), it wouldn't be a problem, but nothing calls it in the documentation.
Maybe that's not a big deal if it is expected that each CPU will issue a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own writes. But expecting this to work across CPUs without any barrier is wishful thinking.
Agreed, but that's a problem for userspace to solve. If userspace wants to reset the fields in different CPUs, it has to synchronize with its own invoking of the ioctl.
userspace has no choice. It cannot order on its own the reads that the kernel will do to *other* rings.
That is, CPU0 must ensure that a ioctl(KVM_RESET_DIRTY_RINGS) is done after (in the memory-ordering sense) its last write_flag(D, KVM_DIRTY_GFN_F_RESET). If there's no such ordering, there's no guarantee that the write_flag will have any effect.
The problem isn't on CPU0 The problem is that CPU1 does observe inconsistent data on arm64, and I don't think this difference in behaviour is acceptable. Nothing documents this, and there is a baked in assumption that there is a strong ordering between writes as well as between writes and read.
The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl was because it takes kvm->slots_lock so the execution would be serialized anyway. Turning slots_lock into an rwsem would be even worse because it also takes kvm->mmu_lock (since slots_lock is a mutex, at least two concurrent invocations won't clash with each other on the mmu_lock).
Whatever the reason, the behaviour should be identical on all architectures. As is is, it only really works on x86, and I contend this is a bug that needs fixing.
Thankfully, this can be done at zero cost for x86, and at that of a set of load-acquires on other architectures.
M.
On 8/26/22 17:49, Marc Zyngier wrote:
Agreed, but that's a problem for userspace to solve. If userspace wants to reset the fields in different CPUs, it has to synchronize with its own invoking of the ioctl.
userspace has no choice. It cannot order on its own the reads that the kernel will do to *other* rings.
Those reads will never see KVM_DIRTY_GFN_F_RESET in the flags however, if userspace has never interacted with the ring. So there will be exactly one read on those rings, and there's nothing to reorder.
If that's too tricky and you want to add a load-acquire I have no objection though. It also helps avoiding read-read reordering between one entry's flags to the next one's, so it's a good idea to have it anyway.
The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl was because it takes kvm->slots_lock so the execution would be serialized anyway. Turning slots_lock into an rwsem would be even worse because it also takes kvm->mmu_lock (since slots_lock is a mutex, at least two concurrent invocations won't clash with each other on the mmu_lock).
Whatever the reason, the behaviour should be identical on all architectures. As is is, it only really works on x86, and I contend this is a bug that needs fixing.
Thankfully, this can be done at zero cost for x86, and at that of a set of load-acquires on other architectures.
Yes, the global-ness of the API is orthogonal to the memory ordering issue. I just wanted to explain why a per-vCPU API probably isn't going to work great.
Paolo
On Mon, Aug 22, 2022 at 10:42:15PM +0100, Marc Zyngier wrote:
Hi Gavin,
On Mon, 22 Aug 2022 02:58:20 +0100, Gavin Shan gshan@redhat.com wrote:
Hi Marc,
On 8/19/22 6:00 PM, Marc Zyngier wrote:
On Fri, 19 Aug 2022 01:55:57 +0100, Gavin Shan gshan@redhat.com wrote:
The ring-based dirty memory tracking has been available and enabled on x86 for a while. The feature is beneficial when the number of dirty pages is small in a checkpointing system or live migration scenario. More details can be found from fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking").
This enables the ring-based dirty memory tracking on ARM64. It's notable that no extra reserved ring entries are needed on ARM64 because the huge pages are always split into base pages when page dirty tracking is enabled.
Can you please elaborate on this? Adding a per-CPU ring of course results in extra memory allocation, so there must be a subtle x86-specific detail that I'm not aware of...
Sure. I guess it's helpful to explain how it works in next revision. Something like below:
This enables the ring-based dirty memory tracking on ARM64. The feature is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is pushed by host when page becomes dirty and pulled by userspace. A vcpu exit is forced when the ring buffer becomes full. The ring buffers on all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
Yes, I think so. Adding a per-CPU ring results in extra memory allocation. However, it's avoiding synchronization among multiple vcpus when dirty pages happen on multiple vcpus. More discussion can be found from [1]
Oh, I totally buy the relaxation of the synchronisation (though I doubt this will have any visible effect until we have something like Oliver's patches to allow parallel faulting).
Heh, yeah I need to get that out the door. I'll also note that Gavin's changes are still relevant without that series, as we do write unprotect in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging").
-- Thanks, Oliver
On Tue, 23 Aug 2022 15:44:42 +0100, Oliver Upton oliver.upton@linux.dev wrote:
On Mon, Aug 22, 2022 at 10:42:15PM +0100, Marc Zyngier wrote:
Hi Gavin,
On Mon, 22 Aug 2022 02:58:20 +0100, Gavin Shan gshan@redhat.com wrote:
Hi Marc,
On 8/19/22 6:00 PM, Marc Zyngier wrote:
On Fri, 19 Aug 2022 01:55:57 +0100, Gavin Shan gshan@redhat.com wrote:
The ring-based dirty memory tracking has been available and enabled on x86 for a while. The feature is beneficial when the number of dirty pages is small in a checkpointing system or live migration scenario. More details can be found from fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking").
This enables the ring-based dirty memory tracking on ARM64. It's notable that no extra reserved ring entries are needed on ARM64 because the huge pages are always split into base pages when page dirty tracking is enabled.
Can you please elaborate on this? Adding a per-CPU ring of course results in extra memory allocation, so there must be a subtle x86-specific detail that I'm not aware of...
Sure. I guess it's helpful to explain how it works in next revision. Something like below:
This enables the ring-based dirty memory tracking on ARM64. The feature is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is pushed by host when page becomes dirty and pulled by userspace. A vcpu exit is forced when the ring buffer becomes full. The ring buffers on all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
Yes, I think so. Adding a per-CPU ring results in extra memory allocation. However, it's avoiding synchronization among multiple vcpus when dirty pages happen on multiple vcpus. More discussion can be found from [1]
Oh, I totally buy the relaxation of the synchronisation (though I doubt this will have any visible effect until we have something like Oliver's patches to allow parallel faulting).
Heh, yeah I need to get that out the door. I'll also note that Gavin's changes are still relevant without that series, as we do write unprotect in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging").
Ah, true. Now if only someone could explain how the whole producer-consumer thing works without a trace of a barrier, that'd be great...
Thanks,
M.
On 8/23/22 22:35, Marc Zyngier wrote:
Heh, yeah I need to get that out the door. I'll also note that Gavin's changes are still relevant without that series, as we do write unprotect in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging").
Ah, true. Now if only someone could explain how the whole producer-consumer thing works without a trace of a barrier, that'd be great...
Do you mean this?
void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) { struct kvm_dirty_gfn *entry;
/* It should never get full */ WARN_ON_ONCE(kvm_dirty_ring_full(ring));
entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
entry->slot = slot; entry->offset = offset; /* * Make sure the data is filled in before we publish this to * the userspace program. There's no paired kernel-side reader. */ smp_wmb(); kvm_dirty_gfn_set_dirtied(entry); ring->dirty_index++; trace_kvm_dirty_ring_push(ring, slot, offset); }
The matching smp_rmb() is in userspace.
Paolo
On Fri, 26 Aug 2022 11:58:08 +0100, Paolo Bonzini pbonzini@redhat.com wrote:
On 8/23/22 22:35, Marc Zyngier wrote:
Heh, yeah I need to get that out the door. I'll also note that Gavin's changes are still relevant without that series, as we do write unprotect in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging").
Ah, true. Now if only someone could explain how the whole producer-consumer thing works without a trace of a barrier, that'd be great...
Do you mean this?
void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
Of course not. I mean this:
static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm) { unsigned long i; struct kvm_vcpu *vcpu; int cleared = 0;
if (!kvm->dirty_ring_size) return -EINVAL;
mutex_lock(&kvm->slots_lock);
kvm_for_each_vcpu(i, vcpu, kvm) cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring); [...] }
and this
int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) { u32 cur_slot, next_slot; u64 cur_offset, next_offset; unsigned long mask; int count = 0; struct kvm_dirty_gfn *entry; bool first_round = true;
/* This is only needed to make compilers happy */ cur_slot = cur_offset = mask = 0;
while (true) { entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
if (!kvm_dirty_gfn_harvested(entry)) break; [...]
}
which provides no ordering whatsoever when a ring is updated from one CPU and reset from another.
M.
On Fri, Aug 26, 2022 at 04:28:51PM +0100, Marc Zyngier wrote:
On Fri, 26 Aug 2022 11:58:08 +0100, Paolo Bonzini pbonzini@redhat.com wrote:
On 8/23/22 22:35, Marc Zyngier wrote:
Heh, yeah I need to get that out the door. I'll also note that Gavin's changes are still relevant without that series, as we do write unprotect in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging").
Ah, true. Now if only someone could explain how the whole producer-consumer thing works without a trace of a barrier, that'd be great...
Do you mean this?
void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
Of course not. I mean this:
static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm) { unsigned long i; struct kvm_vcpu *vcpu; int cleared = 0;
if (!kvm->dirty_ring_size) return -EINVAL;
mutex_lock(&kvm->slots_lock);
kvm_for_each_vcpu(i, vcpu, kvm) cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring); [...] }
and this
int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) { u32 cur_slot, next_slot; u64 cur_offset, next_offset; unsigned long mask; int count = 0; struct kvm_dirty_gfn *entry; bool first_round = true;
/* This is only needed to make compilers happy */ cur_slot = cur_offset = mask = 0;
while (true) { entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
if (!kvm_dirty_gfn_harvested(entry)) break;
[...]
}
which provides no ordering whatsoever when a ring is updated from one CPU and reset from another.
Marc,
I thought we won't hit this as long as we properly take care of other orderings of (a) gfn push, and (b) gfn collect, but after a second thought I think it's indeed logically possible that with a reversed ordering here we can be reading some garbage gfn before (a) happens butt also read the valid flag after (b).
It seems we must have all the barriers correctly applied always. If that's correct, do you perhaps mean something like this to just add the last piece of barrier?
===8<=== diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index f4c2a6eb1666..ea620bfb012d 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -84,7 +84,7 @@ static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn)
static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn) { - return gfn->flags & KVM_DIRTY_GFN_F_RESET; + return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET; }
int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) ===8<===
Thanks,
On 8/30/22 16:42, Peter Xu wrote:
Marc,
I thought we won't hit this as long as we properly take care of other orderings of (a) gfn push, and (b) gfn collect, but after a second thought I think it's indeed logically possible that with a reversed ordering here we can be reading some garbage gfn before (a) happens butt also read the valid flag after (b).
It seems we must have all the barriers correctly applied always. If that's correct, do you perhaps mean something like this to just add the last piece of barrier?
Okay, so I thought about it some more and it's quite tricky.
Strictly speaking, the synchronization is just between userspace and kernel. The fact that the actual producer of dirty pages is in another CPU is a red herring, because reset only cares about harvested pages.
In other words, the dirty page ring is essentially two ring buffers in one and we only care about the "harvested ring", not the "produced ring".
On the other hand, it may happen that userspace has set more RESET flags while the ioctl is ongoing:
CPU0 CPU1 CPU2 fill gfn0 store-rel flags for gfn0 fill gfn1 store-rel flags for gfn1 load-acq flags for gfn0 set RESET for gfn0 load-acq flags for gfn1 set RESET for gfn1 do ioctl! -----------> ioctl(RESET_RINGS) fill gfn2 store-rel flags for gfn2 load-acq flags for gfn2 set RESET for gfn2 process gfn0 process gfn1 process gfn2 do ioctl! etc.
The three load-acquire in CPU0 synchronize with the three store-release in CPU2, but CPU0 and CPU1 are only synchronized up to gfn1 and CPU1 may miss gfn2's fields other than flags.
The kernel must be able to cope with invalid values of the fields, and userspace will invoke the ioctl once more. However, once the RESET flag is cleared on gfn2, it is lost forever, therefore in the above scenario CPU1 must read the correct value of gfn2's fields.
Therefore RESET must be set with a store-release, that will synchronize with a load-acquire in CPU1 as you suggested.
Paolo
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index f4c2a6eb1666..ea620bfb012d 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -84,7 +84,7 @@ static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn) static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn) {
return gfn->flags & KVM_DIRTY_GFN_F_RESET;
return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET;
} int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) ===8<===
Thanks,
-- Peter Xu
In vcpu_map_dirty_ring(), the guest's page size is used to figure out the offset in the virtual area. It works fine when we have same page size on host and guest. However, it fails when the page sizes on host and guest are different, like below error messages indicates. Actually, the offset should be figured out according to host's page size. Otherwise, the virtual area associated with the ring buffer can't be identified by host.
# ./dirty_log_test -M dirty-ring -m 7 Setting log mode to: 'dirty-ring' Test iterations: 32, interval: 10 (ms) Testing guest mode: PA-bits:40, VA-bits:48, 64K pages guest physical test memory offset: 0xffbffc0000 vcpu stops because vcpu is kicked out... Notifying vcpu to continue vcpu continues now. ==== Test Assertion Failure ==== lib/kvm_util.c:1477: addr == MAP_FAILED pid=9000 tid=9000 errno=0 - Success 1 0x0000000000405f5b: vcpu_map_dirty_ring at kvm_util.c:1477 2 0x0000000000402ebb: dirty_ring_collect_dirty_pages at dirty_log_test.c:349 3 0x00000000004029b3: log_mode_collect_dirty_pages at dirty_log_test.c:478 4 (inlined by) run_test at dirty_log_test.c:778 5 (inlined by) run_test at dirty_log_test.c:691 6 0x0000000000403a57: for_each_guest_mode at guest_modes.c:105 7 0x0000000000401ccf: main at dirty_log_test.c:921 8 0x0000ffffb06ec79b: ?? ??:0 9 0x0000ffffb06ec86b: ?? ??:0 10 0x0000000000401def: _start at ??:? Dirty ring mapped private
Fix the issue by using host's page size to map the ring buffer.
Signed-off-by: Gavin Shan gshan@redhat.com --- tools/testing/selftests/kvm/lib/kvm_util.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c index 9889fe0d8919..4e823cbe6b48 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -1464,7 +1464,7 @@ struct kvm_reg_list *vcpu_get_reg_list(struct kvm_vcpu *vcpu)
void *vcpu_map_dirty_ring(struct kvm_vcpu *vcpu) { - uint32_t page_size = vcpu->vm->page_size; + uint32_t page_size = getpagesize(); uint32_t size = vcpu->vm->dirty_ring_size;
TEST_ASSERT(size > 0, "Should enable dirty ring first");
It's assumed that 1024 host pages, instead of guest pages, are dirtied in each iteration in guest_code(). The current implementation misses the case of mismatched page sizes in host and guest. For example, ARM64 could have 64KB page size in guest, but 4KB page size in host. (TEST_PAGES_PER_LOOP / 16), instead of TEST_PAGES_PER_LOOP, host pages are dirtied in every iteration.
Fix the issue by touching all sub-pages when we have mismatched page sizes in host and guest.
Signed-off-by: Gavin Shan gshan@redhat.com --- tools/testing/selftests/kvm/dirty_log_test.c | 50 +++++++++++++++----- 1 file changed, 39 insertions(+), 11 deletions(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index 9c883c94d478..50b02186ce12 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -70,6 +70,7 @@ * that may change. */ static uint64_t host_page_size; +static uint64_t host_num_pages; static uint64_t guest_page_size; static uint64_t guest_num_pages; static uint64_t random_array[TEST_PAGES_PER_LOOP]; @@ -94,8 +95,23 @@ static uint64_t guest_test_virt_mem = DEFAULT_GUEST_TEST_MEM; */ static void guest_code(void) { + uint64_t num_pages, page_size, sub_page_size; uint64_t addr; - int i; + int pages_per_loop, i, j; + + /* + * The page sizes on host and VM could be different. We need + * to perform writing on all sub-pages. + */ + if (host_page_size >= guest_page_size) { + num_pages = host_num_pages; + page_size = host_page_size; + sub_page_size = host_page_size; + } else { + num_pages = guest_num_pages; + page_size = guest_page_size; + sub_page_size = host_page_size; + }
/* * On s390x, all pages of a 1M segment are initially marked as dirty @@ -103,18 +119,29 @@ static void guest_code(void) * To compensate this specialty in this test, we need to touch all * pages during the first iteration. */ - for (i = 0; i < guest_num_pages; i++) { - addr = guest_test_virt_mem + i * guest_page_size; - *(uint64_t *)addr = READ_ONCE(iteration); + for (i = 0; i < num_pages; i++) { + addr = guest_test_virt_mem + i * page_size; + addr = align_down(addr, page_size); + + for (j = 0; j < page_size / sub_page_size; j++) { + *(uint64_t *)(addr + j * sub_page_size) = + READ_ONCE(iteration); + } }
+ pages_per_loop = (TEST_PAGES_PER_LOOP * sub_page_size) / page_size; + while (true) { - for (i = 0; i < TEST_PAGES_PER_LOOP; i++) { + for (i = 0; i < pages_per_loop; i++) { addr = guest_test_virt_mem; - addr += (READ_ONCE(random_array[i]) % guest_num_pages) - * guest_page_size; - addr = align_down(addr, host_page_size); - *(uint64_t *)addr = READ_ONCE(iteration); + addr += (READ_ONCE(random_array[i]) % num_pages) + * page_size; + addr = align_down(addr, page_size); + + for (j = 0; j < page_size / sub_page_size; j++) { + *(uint64_t *)(addr + j * sub_page_size) = + READ_ONCE(iteration); + } }
/* Tell the host that we need more random numbers */ @@ -713,14 +740,14 @@ static void run_test(enum vm_guest_mode mode, void *arg) 2ul << (DIRTY_MEM_BITS - PAGE_SHIFT_4K), guest_code);
guest_page_size = vm->page_size; + host_page_size = getpagesize(); + /* * A little more than 1G of guest page sized pages. Cover the * case where the size is not aligned to 64 pages. */ guest_num_pages = (1ul << (DIRTY_MEM_BITS - vm->page_shift)) + 3; guest_num_pages = vm_adjust_num_guest_pages(mode, guest_num_pages); - - host_page_size = getpagesize(); host_num_pages = vm_num_host_pages(mode, guest_num_pages);
if (!p->phys_offset) { @@ -760,6 +787,7 @@ static void run_test(enum vm_guest_mode mode, void *arg) sync_global_to_guest(vm, host_page_size); sync_global_to_guest(vm, guest_page_size); sync_global_to_guest(vm, guest_test_virt_mem); + sync_global_to_guest(vm, host_num_pages); sync_global_to_guest(vm, guest_num_pages);
/* Start the iterations */
On Fri, Aug 19, 2022 at 08:55:59AM +0800, Gavin Shan wrote:
It's assumed that 1024 host pages, instead of guest pages, are dirtied in each iteration in guest_code(). The current implementation misses the case of mismatched page sizes in host and guest. For example, ARM64 could have 64KB page size in guest, but 4KB page size in host. (TEST_PAGES_PER_LOOP / 16), instead of TEST_PAGES_PER_LOOP, host pages are dirtied in every iteration.
Fix the issue by touching all sub-pages when we have mismatched page sizes in host and guest.
I'll let the dirty-log test authors decide what's best to do for this test, but I'd think we should let the guest continue dirtying its pages without knowledge of the host pages. Then, adjust the host test code to assert all sub-pages, other than the ones it expects the guest to have written, remain untouched.
Thanks, drew
Hi Drew,
On 8/19/22 3:28 PM, Andrew Jones wrote:
On Fri, Aug 19, 2022 at 08:55:59AM +0800, Gavin Shan wrote:
It's assumed that 1024 host pages, instead of guest pages, are dirtied in each iteration in guest_code(). The current implementation misses the case of mismatched page sizes in host and guest. For example, ARM64 could have 64KB page size in guest, but 4KB page size in host. (TEST_PAGES_PER_LOOP / 16), instead of TEST_PAGES_PER_LOOP, host pages are dirtied in every iteration.
Fix the issue by touching all sub-pages when we have mismatched page sizes in host and guest.
I'll let the dirty-log test authors decide what's best to do for this test, but I'd think we should let the guest continue dirtying its pages without knowledge of the host pages. Then, adjust the host test code to assert all sub-pages, other than the ones it expects the guest to have written, remain untouched.
I don't think what is clarified in the change log is correct. The current implementation already had the logic to handle the mismatched page sizes in vm_dirty_log_verify() where 'step' is used for it by fetching value from vm_num_host_pages(mode, 1). Please ignore this patch for now, as explained below.
The issue I have is the 'dirty_log_test' hangs when I have 4KB host page size and 64KB guest page size. It seems the vcpu doesn't exit due to full ring buffer state or kick-off. I will have more investigations to figure out the root cause.
# ./dirty_log_test -M dirty-ring -m 7 Setting log mode to: 'dirty-ring' Test iterations: 32, interval: 10 (ms) Testing guest mode: PA-bits:40, VA-bits:48, 64K pages guest physical test memory offset: 0xffbffc0000 vcpu stops because vcpu is kicked out... Notifying vcpu to continue vcpu continues now. Iteration 1 collected 1903 pages <no more output>
'dirty_lot_test' works well when both host and guest have 4KB page size.
# ./dirty_log_test -M dirty-ring -m 5 Setting log mode to: 'dirty-ring' Test iterations: 32, interval: 10 (ms) Testing guest mode: PA-bits:40, VA-bits:48, 4K pages guest physical test memory offset: 0xffbfffc000 vcpu stops because vcpu is kicked out... Notifying vcpu to continue vcpu continues now. : Dirtied 1006592 pages Total bits checked: dirty (1020487), clear (7106070), track_next (974104)
Thanks, Gavin
Hi Drew,
On 8/22/22 4:29 PM, Gavin Shan wrote:
On 8/19/22 3:28 PM, Andrew Jones wrote:
On Fri, Aug 19, 2022 at 08:55:59AM +0800, Gavin Shan wrote:
It's assumed that 1024 host pages, instead of guest pages, are dirtied in each iteration in guest_code(). The current implementation misses the case of mismatched page sizes in host and guest. For example, ARM64 could have 64KB page size in guest, but 4KB page size in host. (TEST_PAGES_PER_LOOP / 16), instead of TEST_PAGES_PER_LOOP, host pages are dirtied in every iteration.
Fix the issue by touching all sub-pages when we have mismatched page sizes in host and guest.
I'll let the dirty-log test authors decide what's best to do for this test, but I'd think we should let the guest continue dirtying its pages without knowledge of the host pages. Then, adjust the host test code to assert all sub-pages, other than the ones it expects the guest to have written, remain untouched.
I don't think what is clarified in the change log is correct. The current implementation already had the logic to handle the mismatched page sizes in vm_dirty_log_verify() where 'step' is used for it by fetching value from vm_num_host_pages(mode, 1). Please ignore this patch for now, as explained below.
The issue I have is the 'dirty_log_test' hangs when I have 4KB host page size and 64KB guest page size. It seems the vcpu doesn't exit due to full ring buffer state or kick-off. I will have more investigations to figure out the root cause.
[...]
Please ignore this PATCH[3/5], I think this should be fixed by selecting correct dirty ring count and the fix will be folded to PATCH[5/5] in next revision.
In dirty_log_test, we have 1GB memory for guest to write and make them dirty. When we have mismatch page sizes on host and guest, which is either 4kb-host-64kb-guest or 64kb-host-4kb-guest apart from 16kb case, 16384 host pages are dirtied in each iteration. The default dirty ring count is 65536. So the vcpu never exit due to full-dirty-ring-buffer state. This leads the guest's code keep running and the dirty log isn't collected by the main thread.
#define TEST_DIRTY_RING_COUNT 65536
dirty_pages_per_iteration = (0x40000000 / 0x10000) = 0x4000 = 16384
Thanks, Gavin
There are two states, which need to be cleared before next mode is executed. Otherwise, we will hit failure as the following messages indicate.
- The variable 'dirty_ring_vcpu_ring_full' shared by main and vcpu thread. It's indicating if the vcpu exit due to full ring buffer. The value can be carried from previous mode (VM_MODE_P40V48_4K) to current one (VM_MODE_P40V48_64K) when VM_MODE_P40V48_16K isn't supported.
- The current ring buffer index needs to be reset before next mode (VM_MODE_P40V48_64K) is executed. Otherwise, the stale value is carried from previous mode (VM_MODE_P40V48_4K).
# ./dirty_log_test -M dirty-ring Setting log mode to: 'dirty-ring' Test iterations: 32, interval: 10 (ms) Testing guest mode: PA-bits:40, VA-bits:48, 4K pages guest physical test memory offset: 0xffbfffc000 : Dirtied 995328 pages Total bits checked: dirty (1012434), clear (7114123), track_next (966700) Testing guest mode: PA-bits:40, VA-bits:48, 64K pages guest physical test memory offset: 0xffbffc0000 vcpu stops because vcpu is kicked out... vcpu continues now. Notifying vcpu to continue Iteration 1 collected 0 pages vcpu stops because dirty ring is full... vcpu continues now. vcpu stops because dirty ring is full... vcpu continues now. vcpu stops because dirty ring is full... ==== Test Assertion Failure ==== dirty_log_test.c:369: cleared == count pid=10541 tid=10541 errno=22 - Invalid argument 1 0x0000000000403087: dirty_ring_collect_dirty_pages at dirty_log_test.c:369 2 0x0000000000402a0b: log_mode_collect_dirty_pages at dirty_log_test.c:492 3 (inlined by) run_test at dirty_log_test.c:795 4 (inlined by) run_test at dirty_log_test.c:705 5 0x0000000000403a37: for_each_guest_mode at guest_modes.c:100 6 0x0000000000401ccf: main at dirty_log_test.c:938 7 0x0000ffff9ecd279b: ?? ??:0 8 0x0000ffff9ecd286b: ?? ??:0 9 0x0000000000401def: _start at ??:? Reset dirty pages (0) mismatch with collected (35566)
Fix the issues by clearing 'dirty_ring_vcpu_ring_full' and the ring buffer index before a new mode is executed.
Signed-off-by: Gavin Shan gshan@redhat.com --- tools/testing/selftests/kvm/dirty_log_test.c | 27 ++++++++++++-------- 1 file changed, 17 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index 50b02186ce12..450e97d10de7 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -252,13 +252,15 @@ static void clear_log_create_vm_done(struct kvm_vm *vm) }
static void dirty_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, - void *bitmap, uint32_t num_pages) + void *bitmap, uint32_t num_pages, + uint32_t *unused) { kvm_vm_get_dirty_log(vcpu->vm, slot, bitmap); }
static void clear_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, - void *bitmap, uint32_t num_pages) + void *bitmap, uint32_t num_pages, + uint32_t *unused) { kvm_vm_get_dirty_log(vcpu->vm, slot, bitmap); kvm_vm_clear_dirty_log(vcpu->vm, slot, bitmap, 0, num_pages); @@ -354,10 +356,9 @@ static void dirty_ring_continue_vcpu(void) }
static void dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, - void *bitmap, uint32_t num_pages) + void *bitmap, uint32_t num_pages, + uint32_t *ring_buf_idx) { - /* We only have one vcpu */ - static uint32_t fetch_index = 0; uint32_t count = 0, cleared; bool continued_vcpu = false;
@@ -374,7 +375,8 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
/* Only have one vcpu */ count = dirty_ring_collect_one(vcpu_map_dirty_ring(vcpu), - slot, bitmap, num_pages, &fetch_index); + slot, bitmap, num_pages, + ring_buf_idx);
cleared = kvm_vm_reset_dirty_ring(vcpu->vm);
@@ -431,7 +433,8 @@ struct log_mode { void (*create_vm_done)(struct kvm_vm *vm); /* Hook to collect the dirty pages into the bitmap provided */ void (*collect_dirty_pages) (struct kvm_vcpu *vcpu, int slot, - void *bitmap, uint32_t num_pages); + void *bitmap, uint32_t num_pages, + uint32_t *ring_buf_idx); /* Hook to call when after each vcpu run */ void (*after_vcpu_run)(struct kvm_vcpu *vcpu, int ret, int err); void (*before_vcpu_join) (void); @@ -496,13 +499,14 @@ static void log_mode_create_vm_done(struct kvm_vm *vm) }
static void log_mode_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, - void *bitmap, uint32_t num_pages) + void *bitmap, uint32_t num_pages, + uint32_t *ring_buf_idx) { struct log_mode *mode = &log_modes[host_log_mode];
TEST_ASSERT(mode->collect_dirty_pages != NULL, "collect_dirty_pages() is required for any log mode!"); - mode->collect_dirty_pages(vcpu, slot, bitmap, num_pages); + mode->collect_dirty_pages(vcpu, slot, bitmap, num_pages, ring_buf_idx); }
static void log_mode_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err) @@ -721,6 +725,7 @@ static void run_test(enum vm_guest_mode mode, void *arg) struct kvm_vcpu *vcpu; struct kvm_vm *vm; unsigned long *bmap; + uint32_t ring_buf_idx = 0;
if (!log_mode_supported()) { print_skip("Log mode '%s' not supported", @@ -797,6 +802,7 @@ static void run_test(enum vm_guest_mode mode, void *arg) host_dirty_count = 0; host_clear_count = 0; host_track_next_count = 0; + WRITE_ONCE(dirty_ring_vcpu_ring_full, false);
pthread_create(&vcpu_thread, NULL, vcpu_worker, vcpu);
@@ -804,7 +810,8 @@ static void run_test(enum vm_guest_mode mode, void *arg) /* Give the vcpu thread some time to dirty some pages */ usleep(p->interval * 1000); log_mode_collect_dirty_pages(vcpu, TEST_MEM_SLOT_INDEX, - bmap, host_num_pages); + bmap, host_num_pages, + &ring_buf_idx);
/* * See vcpu_sync_stop_requested definition for details on why
In the dirty ring case, we rely on VM_EXIT due to full dirty ring state. On ARM64 system, there are 4096 host pages when the host page size is 64KB. In this case, the vcpu never exits due to the full dirty ring state. The vcpu corrupts same set of pages, but the dirty page information isn't collected in the main thread. This leads to infinite loop as the following log shows.
# ./dirty_log_test -M dirty-ring -c 65536 -m 5 Setting log mode to: 'dirty-ring' Test iterations: 32, interval: 10 (ms) Testing guest mode: PA-bits:40, VA-bits:48, 4K pages guest physical test memory offset: 0xffbffe0000 vcpu stops because vcpu is kicked out... Notifying vcpu to continue vcpu continues now. Iteration 1 collected 576 pages <No more output afterwards>
Fix the issue by automatically choosing the best dirty ring size, to ensure VM_EXIT due to full dirty ring state. The option '-c' provides a hint to it, instead of the value of it.
Signed-off-by: Gavin Shan gshan@redhat.com --- tools/testing/selftests/kvm/dirty_log_test.c | 24 ++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index 450e97d10de7..ad31b6e3fe6a 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -23,6 +23,9 @@ #include "guest_modes.h" #include "processor.h"
+#define DIRTY_MEM_BITS 30 /* 1G */ +#define PAGE_SHIFT_4K 12 + /* The memory slot index to track dirty pages */ #define TEST_MEM_SLOT_INDEX 1
@@ -298,6 +301,22 @@ static bool dirty_ring_supported(void)
static void dirty_ring_create_vm_done(struct kvm_vm *vm) { + uint64_t pages; + uint32_t limit; + + /* + * We rely on VM_EXIT due to full dirty ring state. Adjust + * the ring buffer size to ensure we're able to reach the + * full dirty ring state. + */ + pages = (1ul << (DIRTY_MEM_BITS - vm->page_shift)) + 3; + pages = vm_adjust_num_guest_pages(vm->mode, pages); + pages = vm_num_host_pages(vm->mode, pages); + + limit = 1 << (31 - __builtin_clz(pages)); + test_dirty_ring_count = 1 << (31 - __builtin_clz(test_dirty_ring_count)); + test_dirty_ring_count = min(limit, test_dirty_ring_count); + /* * Switch to dirty ring mode after VM creation but before any * of the vcpu creation. @@ -710,9 +729,6 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, struct kvm_vcpu **vcpu, return vm; }
-#define DIRTY_MEM_BITS 30 /* 1G */ -#define PAGE_SHIFT_4K 12 - struct test_params { unsigned long iterations; unsigned long interval; @@ -856,7 +872,7 @@ static void help(char *name) printf("usage: %s [-h] [-i iterations] [-I interval] " "[-p offset] [-m mode]\n", name); puts(""); - printf(" -c: specify dirty ring size, in number of entries\n"); + printf(" -c: hint to dirty ring size, in number of entries\n"); printf(" (only useful for dirty-ring test; default: %"PRIu32")\n", TEST_DIRTY_RING_COUNT); printf(" -i: specify iteration counts (default: %"PRIu64")\n",
linux-kselftest-mirror@lists.linaro.org