On Mon, 2024-04-22 at 17:54 +0200, Paolo Bonzini wrote:
On Mon, Apr 22, 2024 at 5:39 PM David Woodhouse dwmw2@infradead.org wrote:
... especially considering that you did use a 64-bit integer here (though---please use u64 not uint64_t; and BTW if you want to add a patch to change kvm_get_time_scale() to u64, please do.
Meh, I'm used to programming in C. Yes, I *am* old enough to have been doing this since the last decade of the 1900s, but it *has* been a long time since 1999, and my fingers have learned :)
Oh, I am on the same page (working on both QEMU and Linux, adapting my muscle memory to the context sucks) but u64/s64 is the preferred spelling and I have been asked to use them before.
Ever heard of Jury Nullification... ? :)
Heh, looks like it was you who made it uint64_t, in 2016. In a commit (3ae13faac) which said "Prepare for improving the precision in the next patch"... which never came, AFAICT?
Yes, it was posted as https://lore.kernel.org/lkml/1454944711-33022-5-git-send-email-pbonzini@redh... but not committed.
Ah, thanks. So that isn't about arithmetic precision, but about dealing with the mess we had where the KVM clock was afflicted by NTP skew.
We live in a much saner world now where it's simply based on the guest TSC (or, in pathological cases, the host's CLOCK_MONOTONIC_RAW.
As an aside, we discovered later that the patch you list as "Fixes" fixed another tricky bug: before, kvmclock could jump if the TSC is set within the 250 ppm tolerance that does not activate TSC scaling. This is possible after a first live migration, and then the second live migration used the guest TSC frequency *that userspace desired* instead of the *actual* TSC frequency.
Given our saner world where the KVM clock now *isn't* adjusted for NTP skew, that "bug" was probably better left unfixed. In fact, I may give some thought to reverting commit 78db6a503796 completely.
Perhaps we should call that "definition E". I think we're up to five now? But no, let's not add historical ones to the taxonomy :)
I believe Jack's KVM_SET_CLOCK_GUEST fixes the fundamental issue there of the clock jumping on migration. It's just a special case of the breakage in KVM_REQ_MASTERCLOCK_UPDATE, where the KVM clock has happily been running as a direct function of the guest TSC, and when we yank it back to some other definition when we create a new masterclock reference point.
With KVM_SET_CLOCK_GUEST, the KVM clock is restored as a function of the guest TSC, rather than based on realtime/CLOCK_MONOTONIC_RAW/etc.
So even though a new masterclock reference is taken in the new kernel (and the scaling factors in the pvclock may be slightly different, as we discussed in the comment you responded to), the ka->kvmclock_offset is adjusted so that the value of the KVM clock at *that* moment when the new reference is taken, is precisely what it would have been under the old kvmclock regime for the contemporary guest TSC value.
Before:
this_tsc_khz = __this_cpu_read(cpu_tsc_khz); if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) { tgt_tsc_khz = vcpu->virtual_tsc_khz; kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz, &vcpu->hv_clock.tsc_shift, &vcpu->hv_clock.tsc_to_system_mul); vcpu->hw_tsc_khz = this_tsc_khz;
After:
tgt_tsc_khz = __this_cpu_read(cpu_tsc_khz);
// tgt_tsc_khz unchanged because TSC scaling was not enabled tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz);
if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz, &vcpu->hv_clock.tsc_shift, &vcpu->hv_clock.tsc_to_system_mul); vcpu->hw_tsc_khz = tgt_tsc_khz;
So in the first case kvm_get_time_scale uses vcpu->virtual_tsc_khz, in the second case it uses __this_cpu_read(cpu_tsc_khz).
This then caused a mismatch between the actual guest frequency and what is used by kvm_guest_time_update, which only becomes visible when migration resets the clock with KVM_GET/SET_CLOCK. KVM_GET_CLOCK returns what _should have been_ the same value read by the guest, but it's wrong.
Hm? Until I fixed it in this series, KVM_GET_CLOCK didn't even *try* scaling via the guest's TSC frequency, did it? It just converted from the *host* TSC to nanoseconds (since master_kernel_now) directly. Which was even *more* broken :)