On 19 April 2024 19:40:06 BST, David Woodhouse dwmw2@infradead.org wrote:
On 19 April 2024 18:13:16 BST, "Chen, Zide" zide.chen@intel.com wrote:
I'm wondering what's the underling theory that we definitely can achieve ±1ns accuracy? I tested it on a Sapphire Rapids @2100MHz TSC frequency, and I can see delta_corrected=2 in ~2% cases.
Hm. Thanks for testing!
So the KVM clock is based on the guest TSC. Given a delta between the guest TSC T and some reference point in time R, the KVM clock is expressed as a(T-R)+r, where little r is the value of the KVM clock when the guest TSC was R, and (a) is the rate of the guest TSC.
When set the clock with KVM_SET_CLOCK_GUEST, we are changing the values of R and r to a new point in time. Call the new ones Q and q respectively.
But we calculate precisely (within 1ns at least) what the KVM clock would have been with the *old* formula, and adjust our new offset (q) so that at our new reference TSC value Q, the formulae give exactly the same result.
And because the *rates* are the same, they should continue to give the same results, ±1ns.
Or such *was* my theory, at least.
Would be interesting to see it disproven with actual numbers for the old+new pvclock structs, so I can understand where the logic goes wrong.
Were you using frequency scaling?
Oh, also please could you test the updated version I posted yesterday, from https://git.infradead.org/?p=users/dwmw2/linux.git%3Ba=shortlog%3Bh=refs/hea...
On 4/19/2024 11:43 AM, David Woodhouse wrote:
On 19 April 2024 19:40:06 BST, David Woodhouse dwmw2@infradead.org wrote:
On 19 April 2024 18:13:16 BST, "Chen, Zide" zide.chen@intel.com wrote:
I'm wondering what's the underling theory that we definitely can achieve ±1ns accuracy? I tested it on a Sapphire Rapids @2100MHz TSC frequency, and I can see delta_corrected=2 in ~2% cases.
Hm. Thanks for testing!
So the KVM clock is based on the guest TSC. Given a delta between the guest TSC T and some reference point in time R, the KVM clock is expressed as a(T-R)+r, where little r is the value of the KVM clock when the guest TSC was R, and (a) is the rate of the guest TSC.
When set the clock with KVM_SET_CLOCK_GUEST, we are changing the values of R and r to a new point in time. Call the new ones Q and q respectively.
But we calculate precisely (within 1ns at least) what the KVM clock would have been with the *old* formula, and adjust our new offset (q) so that at our new reference TSC value Q, the formulae give exactly the same result.
And because the *rates* are the same, they should continue to give the same results, ±1ns.
Or such *was* my theory, at least.
Would be interesting to see it disproven with actual numbers for the old+new pvclock structs, so I can understand where the logic goes wrong.
Were you using frequency scaling?
Oh, also please could you test the updated version I posted yesterday, from https://git.infradead.org/?p=users/dwmw2/linux.git%3Ba=shortlog%3Bh=refs/hea...
I failed to check out your branch, instead I downloaded the patch series from: https://lore.kernel.org/linux-kselftest/FABCFBD0-4B76-4662-9F7B-7E1A856BBBB6...
However, the selftest hangs:
[Apr19 16:15] kselftest: Running tests in kvm [Apr19 16:16] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ +0.000628] rcu: 78-...0: (1 GPs behind) idle=3c8c/1/0x4000000000000000 softirq=5908/5913 fqs=14025 [ +0.000468] rcu: (detected by 104, t=60003 jiffies, g=60073, q=3100 ncpus=128) [ +0.000389] Sending NMI from CPU 104 to CPUs 78: [ +0.000360] NMI backtrace for cpu 78 [ +0.000004] CPU: 78 PID: 33515 Comm: pvclock_test Tainted: G O 6.9.0-rc1zide-l0+ #194 [ +0.000003] Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01 08/18/2023 [ +0.000002] RIP: 0010:pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm] [ +0.000079] Code: ea 83 e1 40 48 0f 45 c2 31 d2 48 3d 00 94 35 77 76 0e 48 d1 e8 83 ea 01 48 3d 00 94 35 77 77 f2 48 3d 00 ca 9a 3b 89 c1 77 0d <01> c9 83 c2 01 81 f9 00 ca 9a 3b 76 f3 88 93 8c 95 00 00 31 c0 ba [ +0.000002] RSP: 0018:ff368a58cfe07e30 EFLAGS: 00000087 [ +0.000002] RAX: 0000000000000000 RBX: ff368a58e0ccd000 RCX: 0000000000000000 [ +0.000001] RDX: 000000005ca49a49 RSI: 00000000000029aa RDI: 0000019ee77a1c00 [ +0.000002] RBP: ff368a58cfe07e50 R08: 0000000000000001 R09: 0000000000000000 [ +0.000000] R10: ff26383d853ab400 R11: 0000000000000002 R12: 0000000000000000 [ +0.000001] R13: ff368a58e0cd6400 R14: 0000000000000293 R15: ff368a58e0cd69f0 [ +0.000001] FS: 00007f6946473740(0000) GS:ff26384c7fb80000(0000) knlGS:0000000000000000 [ +0.000001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000001] CR2: 00007f69463bd445 CR3: 000000016f466006 CR4: 0000000000f71ef0 [ +0.000001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ +0.000000] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ +0.000001] PKRU: 55555554 [ +0.000001] Call Trace: [ +0.000004] <NMI> [ +0.000003] ? nmi_cpu_backtrace+0x87/0xf0 [ +0.000008] ? nmi_cpu_backtrace_handler+0x11/0x20 [ +0.000005] ? nmi_handle+0x5f/0x170 [ +0.000005] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm] [ +0.000045] ? default_do_nmi+0x79/0x1a0 [ +0.000004] ? exc_nmi+0xf0/0x130 [ +0.000001] ? end_repeat_nmi+0xf/0x53 [ +0.000006] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm] [ +0.000041] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm] [ +0.000040] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm] [ +0.000039] </NMI> [ +0.000000] <TASK> [ +0.000001] ? preempt_count_add+0x73/0xa0 [ +0.000004] kvm_arch_init_vm+0xf1/0x1e0 [kvm] [ +0.000049] kvm_create_vm+0x370/0x650 [kvm] [ +0.000036] kvm_dev_ioctl+0x88/0x180 [kvm] [ +0.000034] __x64_sys_ioctl+0x8e/0xd0 [ +0.000007] do_syscall_64+0x5b/0x120 [ +0.000003] entry_SYSCALL_64_after_hwframe+0x6c/0x74 [ +0.000003] RIP: 0033:0x7f694631a94f [ +0.000002] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00 [ +0.000001] RSP: 002b:00007ffca91b2e50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.000002] RAX: ffffffffffffffda RBX: 0000000000434480 RCX: 00007f694631a94f [ +0.000001] RDX: 0000000000000000 RSI: 000000000000ae01 RDI: 0000000000000005 [ +0.000000] RBP: 0000000000000009 R08: 000000000041b198 R09: 000000000041bfbf [ +0.000001] R10: 00007f69463d8882 R11: 0000000000000246 R12: 0000000000434480 [ +0.000000] R13: 000000000041e0f0 R14: 0000000000001000 R15: 0000000000000207 [ +0.000002] </TASK>
On 20 April 2024 00:54:05 BST, "Chen, Zide" zide.chen@intel.com wrote:
On 4/19/2024 11:43 AM, David Woodhouse wrote:
On 19 April 2024 19:40:06 BST, David Woodhouse dwmw2@infradead.org wrote:
On 19 April 2024 18:13:16 BST, "Chen, Zide" zide.chen@intel.com wrote:
I'm wondering what's the underling theory that we definitely can achieve ±1ns accuracy? I tested it on a Sapphire Rapids @2100MHz TSC frequency, and I can see delta_corrected=2 in ~2% cases.
Hm. Thanks for testing!
So the KVM clock is based on the guest TSC. Given a delta between the guest TSC T and some reference point in time R, the KVM clock is expressed as a(T-R)+r, where little r is the value of the KVM clock when the guest TSC was R, and (a) is the rate of the guest TSC.
When set the clock with KVM_SET_CLOCK_GUEST, we are changing the values of R and r to a new point in time. Call the new ones Q and q respectively.
But we calculate precisely (within 1ns at least) what the KVM clock would have been with the *old* formula, and adjust our new offset (q) so that at our new reference TSC value Q, the formulae give exactly the same result.
And because the *rates* are the same, they should continue to give the same results, ±1ns.
Or such *was* my theory, at least.
Would be interesting to see it disproven with actual numbers for the old+new pvclock structs, so I can understand where the logic goes wrong.
Were you using frequency scaling?
Oh, also please could you test the updated version I posted yesterday, from https://git.infradead.org/?p=users/dwmw2/linux.git%3Ba=shortlog%3Bh=refs/hea...
I failed to check out your branch, instead I downloaded the patch series from: https://lore.kernel.org/linux-kselftest/FABCFBD0-4B76-4662-9F7B-7E1A856BBBB6...
However, the selftest hangs:
Odd. It locks up in kvm_arch_init_vm(). Maybe when I get back to my desk something will be obvious. But please could I have your .config?
If you're able to bisect and see which patch causes that, it would also be much appreciated. Thanks!
[Apr19 16:15] kselftest: Running tests in kvm [Apr19 16:16] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ +0.000628] rcu: 78-...0: (1 GPs behind) idle=3c8c/1/0x4000000000000000 softirq=5908/5913 fqs=14025 [ +0.000468] rcu: (detected by 104, t=60003 jiffies, g=60073, q=3100 ncpus=128) [ +0.000389] Sending NMI from CPU 104 to CPUs 78: [ +0.000360] NMI backtrace for cpu 78 [ +0.000004] CPU: 78 PID: 33515 Comm: pvclock_test Tainted: G O 6.9.0-rc1zide-l0+ #194 [ +0.000003] Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01 08/18/2023 [ +0.000002] RIP: 0010:pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm] [ +0.000079] Code: ea 83 e1 40 48 0f 45 c2 31 d2 48 3d 00 94 35 77 76 0e 48 d1 e8 83 ea 01 48 3d 00 94 35 77 77 f2 48 3d 00 ca 9a 3b 89 c1 77 0d <01> c9 83 c2 01 81 f9 00 ca 9a 3b 76 f3 88 93 8c 95 00 00 31 c0 ba [ +0.000002] RSP: 0018:ff368a58cfe07e30 EFLAGS: 00000087 [ +0.000002] RAX: 0000000000000000 RBX: ff368a58e0ccd000 RCX: 0000000000000000 [ +0.000001] RDX: 000000005ca49a49 RSI: 00000000000029aa RDI: 0000019ee77a1c00 [ +0.000002] RBP: ff368a58cfe07e50 R08: 0000000000000001 R09: 0000000000000000 [ +0.000000] R10: ff26383d853ab400 R11: 0000000000000002 R12: 0000000000000000 [ +0.000001] R13: ff368a58e0cd6400 R14: 0000000000000293 R15: ff368a58e0cd69f0 [ +0.000001] FS: 00007f6946473740(0000) GS:ff26384c7fb80000(0000) knlGS:0000000000000000 [ +0.000001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000001] CR2: 00007f69463bd445 CR3: 000000016f466006 CR4: 0000000000f71ef0 [ +0.000001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ +0.000000] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ +0.000001] PKRU: 55555554 [ +0.000001] Call Trace: [ +0.000004] <NMI> [ +0.000003] ? nmi_cpu_backtrace+0x87/0xf0 [ +0.000008] ? nmi_cpu_backtrace_handler+0x11/0x20 [ +0.000005] ? nmi_handle+0x5f/0x170 [ +0.000005] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm] [ +0.000045] ? default_do_nmi+0x79/0x1a0 [ +0.000004] ? exc_nmi+0xf0/0x130 [ +0.000001] ? end_repeat_nmi+0xf/0x53 [ +0.000006] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm] [ +0.000041] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm] [ +0.000040] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm] [ +0.000039] </NMI> [ +0.000000] <TASK> [ +0.000001] ? preempt_count_add+0x73/0xa0 [ +0.000004] kvm_arch_init_vm+0xf1/0x1e0 [kvm] [ +0.000049] kvm_create_vm+0x370/0x650 [kvm] [ +0.000036] kvm_dev_ioctl+0x88/0x180 [kvm] [ +0.000034] __x64_sys_ioctl+0x8e/0xd0 [ +0.000007] do_syscall_64+0x5b/0x120 [ +0.000003] entry_SYSCALL_64_after_hwframe+0x6c/0x74 [ +0.000003] RIP: 0033:0x7f694631a94f [ +0.000002] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00 [ +0.000001] RSP: 002b:00007ffca91b2e50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.000002] RAX: ffffffffffffffda RBX: 0000000000434480 RCX: 00007f694631a94f [ +0.000001] RDX: 0000000000000000 RSI: 000000000000ae01 RDI: 0000000000000005 [ +0.000000] RBP: 0000000000000009 R08: 000000000041b198 R09: 000000000041bfbf [ +0.000001] R10: 00007f69463d8882 R11: 0000000000000246 R12: 0000000000434480 [ +0.000000] R13: 000000000041e0f0 R14: 0000000000001000 R15: 0000000000000207 [ +0.000002] </TASK>
On Fri, 2024-04-19 at 16:54 -0700, Chen, Zide wrote:
However, the selftest hangs:
[Apr19 16:15] kselftest: Running tests in kvm [Apr19 16:16] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ +0.000628] rcu: 78-...0: (1 GPs behind) idle=3c8c/1/0x4000000000000000 softirq=5908/5913 fqs=14025 [ +0.000468] rcu: (detected by 104, t=60003 jiffies, g=60073, q=3100 ncpus=128) [ +0.000389] Sending NMI from CPU 104 to CPUs 78: [ +0.000360] NMI backtrace for cpu 78 [ +0.000004] CPU: 78 PID: 33515 Comm: pvclock_test Tainted: G O 6.9.0-rc1zide-l0+ #194 [ +0.000003] Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01 08/18/2023 [ +0.000002] RIP: 0010:pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
Ah, kvm_get_time_scale() doesn't much like being asked to scale to zero.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a07b60351894..45fb99986cf9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3046,7 +3046,8 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm) * Copy from the field protected solely by ka->tsc_write_lock, * to the field protected by the ka->pvclock_sc seqlock. */ - ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio; + ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio ? : + kvm_caps.default_tsc_scaling_ratio;
/* * Calculate the scaling factors precisely the same way
On 4/20/2024 9:03 AM, David Woodhouse wrote:
On Fri, 2024-04-19 at 16:54 -0700, Chen, Zide wrote:
However, the selftest hangs:
[Apr19 16:15] kselftest: Running tests in kvm [Apr19 16:16] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ +0.000628] rcu: 78-...0: (1 GPs behind) idle=3c8c/1/0x4000000000000000 softirq=5908/5913 fqs=14025 [ +0.000468] rcu: (detected by 104, t=60003 jiffies, g=60073, q=3100 ncpus=128) [ +0.000389] Sending NMI from CPU 104 to CPUs 78: [ +0.000360] NMI backtrace for cpu 78 [ +0.000004] CPU: 78 PID: 33515 Comm: pvclock_test Tainted: G O 6.9.0-rc1zide-l0+ #194 [ +0.000003] Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01 08/18/2023 [ +0.000002] RIP: 0010:pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
Ah, kvm_get_time_scale() doesn't much like being asked to scale to zero.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a07b60351894..45fb99986cf9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3046,7 +3046,8 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm) * Copy from the field protected solely by ka->tsc_write_lock, * to the field protected by the ka->pvclock_sc seqlock. */
ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio;
ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio ? :
kvm_caps.default_tsc_scaling_ratio;
/* * Calculate the scaling factors precisely the same way * that kvm_guest_time_update() does. last_tsc_hz = kvm_scale_tsc(tsc_khz * 1000, ka->last_tsc_scaling_ratio);
Should be ka->master_tsc_scaling_ratio?
If I restored the KVM_REQ_GLOBAL_CLOCK_UPDATE request from kvm_arch_vcpu_load(), the selftest works for me, and I ran the test for 1000+ iterations, w/ or w/o TSC scaling, the TEST_ASSERT(delta_corrected <= ±1) never got hit. This is awesome!
However, without KVM_REQ_GLOBAL_CLOCK_UPDATE, it still fails on creating a VM. Maybe the init sequence sill needs some rework.
BUG: unable to handle page fault for address: 005b29e3f221ccf0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 86 PID: 4118 Comm: pvclock_test Tainted Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01 08/18/2023 RIP: 0010:start_creating+0x80/0x190 Code: ce ad 48 c7 c6 70 a1 ce ad 48 c7 c7 80 1c 9b ab e8 b5 10 d5 ff 4c 63 e0 45 85 e4 0f 85 cd 00 00 00 48 85 db 0f 84 b5 00 00 00 <48> 8b 43 30 48 8d b8 b8 > RSP: 0018:ff786eaacf3cfdd0 EFLAGS: 00010206 RAX: 0000000000000000 RBX: 005b29e3f221ccc0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffadcea170 RDI: 0000000000000000 RBP: ffffffffc06ac8cf R08: ffffffffa6ea0fe0 R09: ffffffffc06a5940 R10: ff786eaacf3cfe30 R11: 00000013a7b5feaa R12: 0000000000000000 R13: 0000000000000124 R14: ff786eaacfa11000 R15: 00000000000081a4 FS: 00007f0837c89740(0000) GS:ff4f44b6bfd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0 CR2: 005b29e3f221ccf0 CR3: 000000014bdf8002 CR4: 0000000000f73ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __die+0x24/0x70 ? page_fault_oops+0x81/0x150 ? do_user_addr_fault+0x64/0x6c0 ? exc_page_fault+0x8a/0x1a0 ? asm_exc_page_fault+0x26/0x30 ? start_creating+0x80/0x190 __debugfs_create_file+0x43/0x1f0 kvm_create_vm_debugfs+0x28b/0x2d0 [kvm] kvm_create_vm+0x457/0x650 [kvm] kvm_dev_ioctl+0x88/0x180 [kvm] __x64_sys_ioctl+0x8e/0xd0 do_syscall_64+0x5b/0x120 entry_SYSCALL_64_after_hwframe+0x71/0x79 RIP: 0033:0x7f0837b1a94f Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff > RSP: 002b:00007ffe01be3fc0 EFLAGS: 00000246 ORIG_RAX RAX: ffffffffffffffda RBX: 0000000000434480 RCX: 00007f0837b1a94f RDX: 0000000000000000 RSI: 000000000000ae01 RDI: 0000000000000005 RBP: 0000000000000009 R08: 000000000041b1a0 R09: 000000000041bfcf R10: 00007f0837bd8882 R11: 0000000000000246 R12: 0000000000434480 R13: 000000000041e0f0 R14: 0000000000001000 R15: 0000000000000207 </TASK> Modules linked in: kvm_intel(O) kvm(O) [last unloaded: kvm(O)] CR2: 005b29e3f221ccf0
On Mon, 2024-04-22 at 15:02 -0700, Chen, Zide wrote:
On 4/20/2024 9:03 AM, David Woodhouse wrote:
On Fri, 2024-04-19 at 16:54 -0700, Chen, Zide wrote:
However, the selftest hangs:
[Apr19 16:15] kselftest: Running tests in kvm [Apr19 16:16] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ +0.000628] rcu: 78-...0: (1 GPs behind) idle=3c8c/1/0x4000000000000000 softirq=5908/5913 fqs=14025 [ +0.000468] rcu: (detected by 104, t=60003 jiffies, g=60073, q=3100 ncpus=128) [ +0.000389] Sending NMI from CPU 104 to CPUs 78: [ +0.000360] NMI backtrace for cpu 78 [ +0.000004] CPU: 78 PID: 33515 Comm: pvclock_test Tainted: G O 6.9.0-rc1zide-l0+ #194 [ +0.000003] Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01 08/18/2023 [ +0.000002] RIP: 0010:pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
Ah, kvm_get_time_scale() doesn't much like being asked to scale to zero.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a07b60351894..45fb99986cf9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3046,7 +3046,8 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm) * Copy from the field protected solely by ka->tsc_write_lock, * to the field protected by the ka->pvclock_sc seqlock. */ - ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio; + ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio ? : + kvm_caps.default_tsc_scaling_ratio; /* * Calculate the scaling factors precisely the same way * that kvm_guest_time_update() does. last_tsc_hz = kvm_scale_tsc(tsc_khz * 1000, ka->last_tsc_scaling_ratio);
Should be ka->master_tsc_scaling_ratio?
Oops, yes. I'll actually do some proper testing on a host with TSC scaling this week. Thanks.
If I restored the KVM_REQ_GLOBAL_CLOCK_UPDATE request from kvm_arch_vcpu_load(), the selftest works for me, and I ran the test for 1000+ iterations, w/ or w/o TSC scaling, the TEST_ASSERT(delta_corrected <= ±1) never got hit. This is awesome!
However, without KVM_REQ_GLOBAL_CLOCK_UPDATE, it still fails on creating a VM. Maybe the init sequence sill needs some rework.
That one confuses me. The crash is actually in debugfs, as it's registering the per-vm or per-vcpu stats. I can't imagine *how* that's occurring. Or see why the availability of TSC scaling would cause it to show up for you and not me. Can I have your .config please?
First thought would be that there's some change in the KVM structures and you have some stale object files using the old struct, but then I realise I forgot to actually *remove* the now-unused kvmclock_update_work from x86's struct kvm_arch anyway.
I'll try to reproduce, as I think I want to *know* what's going on here, even if I am going to drop that patch as mentioned in https://lore.kernel.org/kvm/a6723ac9e0169839cb33e8022a47c2de213866ac.camel@i...
Are you able to load that vmlinux in gdb and (gdb) list *start_creating+0x80 (gdb) list *kvm_create_vm_debugfs+0x28b
Thanks again.
BUG: unable to handle page fault for address: 005b29e3f221ccf0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 86 PID: 4118 Comm: pvclock_test Tainted Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01 08/18/2023 RIP: 0010:start_creating+0x80/0x190 Code: ce ad 48 c7 c6 70 a1 ce ad 48 c7 c7 80 1c 9b ab e8 b5 10 d5 ff 4c 63 e0 45 85 e4 0f 85 cd 00 00 00 48 85 db 0f 84 b5 00 00 00 <48> 8b 43 30 48 8d b8 b8 > RSP: 0018:ff786eaacf3cfdd0 EFLAGS: 00010206 RAX: 0000000000000000 RBX: 005b29e3f221ccc0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffadcea170 RDI: 0000000000000000 RBP: ffffffffc06ac8cf R08: ffffffffa6ea0fe0 R09: ffffffffc06a5940 R10: ff786eaacf3cfe30 R11: 00000013a7b5feaa R12: 0000000000000000 R13: 0000000000000124 R14: ff786eaacfa11000 R15: 00000000000081a4 FS: 00007f0837c89740(0000) GS:ff4f44b6bfd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0 CR2: 005b29e3f221ccf0 CR3: 000000014bdf8002 CR4: 0000000000f73ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __die+0x24/0x70 ? page_fault_oops+0x81/0x150 ? do_user_addr_fault+0x64/0x6c0 ? exc_page_fault+0x8a/0x1a0 ? asm_exc_page_fault+0x26/0x30 ? start_creating+0x80/0x190 __debugfs_create_file+0x43/0x1f0 kvm_create_vm_debugfs+0x28b/0x2d0 [kvm] kvm_create_vm+0x457/0x650 [kvm] kvm_dev_ioctl+0x88/0x180 [kvm] __x64_sys_ioctl+0x8e/0xd0 do_syscall_64+0x5b/0x120 entry_SYSCALL_64_after_hwframe+0x71/0x79 RIP: 0033:0x7f0837b1a94f Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff > RSP: 002b:00007ffe01be3fc0 EFLAGS: 00000246 ORIG_RAX RAX: ffffffffffffffda RBX: 0000000000434480 RCX: 00007f0837b1a94f RDX: 0000000000000000 RSI: 000000000000ae01 RDI: 0000000000000005 RBP: 0000000000000009 R08: 000000000041b1a0 R09: 000000000041bfcf R10: 00007f0837bd8882 R11: 0000000000000246 R12: 0000000000434480 R13: 000000000041e0f0 R14: 0000000000001000 R15: 0000000000000207 </TASK> Modules linked in: kvm_intel(O) kvm(O) [last unloaded: kvm(O)] CR2: 005b29e3f221ccf0
On 4/23/2024 12:49 AM, David Woodhouse wrote:
If I restored the KVM_REQ_GLOBAL_CLOCK_UPDATE request from kvm_arch_vcpu_load(), the selftest works for me, and I ran the test for 1000+ iterations, w/ or w/o TSC scaling, the TEST_ASSERT(delta_corrected <= ±1) never got hit. This is awesome!
However, without KVM_REQ_GLOBAL_CLOCK_UPDATE, it still fails on creating a VM. Maybe the init sequence sill needs some rework.
That one confuses me. The crash is actually in debugfs, as it's registering the per-vm or per-vcpu stats. I can't imagine *how* that's occurring. Or see why the availability of TSC scaling would cause it to show up for you and not me. Can I have your .config please?
First thought would be that there's some change in the KVM structures and you have some stale object files using the old struct, but then I realise I forgot to actually *remove* the now-unused kvmclock_update_work from x86's struct kvm_arch anyway.
I'll try to reproduce, as I think I want to *know* what's going on here, even if I am going to drop that patch as mentioned in https://lore.kernel.org/kvm/a6723ac9e0169839cb33e8022a47c2de213866ac.camel@i...
Are you able to load that vmlinux in gdb and (gdb) list *start_creating+0x80 (gdb) list *kvm_create_vm_debugfs+0x28b
Thanks again.
My apologies, it turns out the KVM_REQ_GLOBAL_CLOCK_UPDATE is not needed. Today I can't reproduce the issue after removing it. Yesterday I thought it may miss something related to pfncache.
To be clear, with the above mentioned change to kvm_scale_tsc(master_tsc_scaling_ratio), the selftest runs reliably regardless TSC scaling is enabled or not.
On 23 April 2024 18:59:21 BST, "Chen, Zide" zide.chen@intel.com wrote:
On 4/23/2024 12:49 AM, David Woodhouse wrote:
If I restored the KVM_REQ_GLOBAL_CLOCK_UPDATE request from kvm_arch_vcpu_load(), the selftest works for me, and I ran the test for 1000+ iterations, w/ or w/o TSC scaling, the TEST_ASSERT(delta_corrected <= ±1) never got hit. This is awesome!
However, without KVM_REQ_GLOBAL_CLOCK_UPDATE, it still fails on creating a VM. Maybe the init sequence sill needs some rework.
That one confuses me. The crash is actually in debugfs, as it's registering the per-vm or per-vcpu stats. I can't imagine *how* that's occurring. Or see why the availability of TSC scaling would cause it to show up for you and not me. Can I have your .config please?
First thought would be that there's some change in the KVM structures and you have some stale object files using the old struct, but then I realise I forgot to actually *remove* the now-unused kvmclock_update_work from x86's struct kvm_arch anyway.
I'll try to reproduce, as I think I want to *know* what's going on here, even if I am going to drop that patch as mentioned in https://lore.kernel.org/kvm/a6723ac9e0169839cb33e8022a47c2de213866ac.camel@i...
Are you able to load that vmlinux in gdb and (gdb) list *start_creating+0x80 (gdb) list *kvm_create_vm_debugfs+0x28b
Thanks again.
My apologies, it turns out the KVM_REQ_GLOBAL_CLOCK_UPDATE is not needed. Today I can't reproduce the issue after removing it. Yesterday I thought it may miss something related to pfncache.
To be clear, with the above mentioned change to kvm_scale_tsc(master_tsc_scaling_ratio), the selftest runs reliably regardless TSC scaling is enabled or not.
Thanks. That version is now in my git tree and I have tested it myself on Skylake. Then I got distracted by reverse-engineering kvm_get_time_scale() so I could actually add some comments to it.
I'm still going to have to put the clock updates back though, for the non-masterclock case.
While I'm ripping all this up I guess I ought to rename it to "reference clock" too?
On Mon, 2024-04-22 at 15:02 -0700, Chen, Zide wrote:
the selftest works for me, and I ran the test for 1000+ iterations, w/ or w/o TSC scaling, the TEST_ASSERT(delta_corrected <= ±1) never got hit. This is awesome!
I think that with further care we can get even better than that.
Let's look at where that ±1ns tolerance comes from.
Consider a 3GHz TSC. That gives us three ticks per nanosecond. Each TSC value can be seen as (3n) (3n+1) or (3n+2) for a given nanosecond n.
If we take a new reference point at a (3n+2) TSC value and calculate the KVM clock from that, we *know* we're going to round down and lose two-thirds of a nanosecond.
So then we set the new KVM clock parameters to use that new reference point, and that's why we have to allow a disruption of up to a single nanosecond. In fact, I don't think it's ±1 ns, is it? It'll only ever be in the same direction (rounding down)?
But if we're careful which *which* TSC value we use as the reference point, we can reduce that error.
The TSC value we use should be *around* the current time, but what if we were to evaluate maybe the previous 100 TSC values. Pass *each* of them through the conversion to nanoseconds and use the one that comes *closest* to a precise nanosecond (nnnnnnnn.000).
It's even fairly easy to calculate those, because of the way the KVM clock ABI has us multiply and then shift right by 32 bits. We just need to look at those low 32 bits (the fractional nanosecond) *before* shifting them out of existence. Something like...
uint64_t tsc_candidate, tsc_candidate_last, best_tsc; uint32_t frac_ns_min = 0xffffffff; uint64_t frac_ns;
best_tsc = tsc_candidate = rdtsc(); tsc_candidate_last = tsc_candidate - 100;
while (tsc_candidate-- > tsc_candidate_last) { uint64_t guest_tsc = kvm_scale_tsc(tsc_candidate, ...); frac_ns = guest_tsc * hvclock->tsc_to_system_mul; /* Shift *after* multiplication, not before as pvclock_scale_cycles() does. */ if (hvclock->tsc_shift < 0) frac_ns >>= -hvclock->tsc_shift; else frac_ns <<= hvclock->tsc_shift;
if ( (uint32_t)frac_ns <= frac_ns_min ) { frac_ns_min = frac_ns; best_tsc = tsc_candidate; } } printk("Best TSC to use for reference point is %lld", best_tsc);
And then you calculate your CLOCK_MONOTONIC_RAW and guest KVM clock from *that* host TSC value, and thus minimise the discrepancies due to rounding down?
Aside from the fact that I literally just typed that into email and it's barely even thought through let alone entirely untested... I'm just not sure it's even worth the runtime cost, for that ±1 ns on a rare case.
A slop of ±1ns is probably sufficient because over the past few years we've already shifted the definition of the KVM clock to *not* be NTP- corrected, and we leave guests to do fine-grained synchronization through other means anyway.
But I see talk of people offering a PPS signal to *hundreds* of guests on the same host simultaneously, just for them all to use it to calibrate the same underlying oscillator. Which is a little bit insane.
We *should* have a way for the host to do that once and then expose the precise time to its guests, in a much saner way than the KVM clock does. I'll look at adding something along those lines to this series too, which can be driven from the host's adjtimex() adjustments (which KVM already consumes), and fed into each guest's timekeeping as a PTP/PPS device or something.
linux-kselftest-mirror@lists.linaro.org