Hi!
This is the second version of the work to make TSC migration more accurate, as was defined by Paulo at: https://www.spinics.net/lists/kvm/msg225525.html
I omitted most of the semi-offtopic points I raised related to TSC in the previous RFC where we can continue the discussion.
I do want to raise another thing that I almost forgot.
On AMD systems, the Linux kernel will mark the guest tsc as unstable unless invtsc is set which is set on recent AMD hardware.
Take a look at 'unsynchronized_tsc()' to verify this.
This is another thing that IMHO should be fixed at least when running under KVM.
Note that I forgot to mention that X86_FEATURE_TSC_RELIABLE also short-circuits this code, thus giving another reason to enable it under KVM.
Changes from V1:
- added KVM_TSC_STATE_TIMESTAMP_VALID instead of testing ns == 0 - allow diff < 0, because it is still better that capping it to 0 - updated tsc_msr_test unit test to cover this feature - refactoring
Patches to enable this feature in qemu are in the process of being sent to qemu-devel mailing list.
Best regards, Maxim Levitsky
Maxim Levitsky (3): KVM: x86: implement KVM_{GET|SET}_TSC_STATE KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS kvm/selftests: update tsc_msrs_test to cover KVM_X86_QUIRK_TSC_HOST_ACCESS
Documentation/virt/kvm/api.rst | 65 +++++++++++++ arch/x86/include/uapi/asm/kvm.h | 1 + arch/x86/kvm/x86.c | 92 ++++++++++++++++++- include/uapi/linux/kvm.h | 15 +++ .../selftests/kvm/x86_64/tsc_msrs_test.c | 79 ++++++++++++++-- 5 files changed, 237 insertions(+), 15 deletions(-)
These two new ioctls allow to more precisly capture and restore guest's TSC state.
Both ioctls are meant to be used to accurately migrate guest TSC even when there is a significant downtime during the migration.
Suggested-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 73 ++++++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 15 +++++++ 3 files changed, 153 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 70254eaa5229f..ebecfe4b414ce 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may experience inconsistent filtering behavior on MSR accesses.
+4.127 KVM_GET_TSC_STATE +---------------------------- + +:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error + +:: + + #define KVM_TSC_STATE_TIMESTAMP_VALID 1 + #define KVM_TSC_STATE_TSC_ADJUST_VALID 2 + struct kvm_tsc_state { + __u32 flags; + __u64 nsec; + __u64 tsc; + __u64 tsc_adjust; + }; + +flags values for ``struct kvm_tsc_state``: + +``KVM_TSC_STATE_TIMESTAMP_VALID`` + + ``nsec`` contains nanoseconds from unix epoch. + Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE + +``KVM_TSC_STATE_TSC_ADJUST_VALID`` + + ``tsc_adjust`` contains valid IA32_TSC_ADJUST value + + +This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST, +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix +epoch. + + +4.128 KVM_SET_TSC_STATE +---------------------------- + +:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error + +:: + +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU. + +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value. + +Otherwise KVM will set the guest TSC value to the exact value as given +in the struct. + +if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST, +then its value will be set to the given value from the struct. + +It is assumed that either both ioctls will be run on the same machine, +or that source and destination machines have synchronized clocks. + 5. The kvm_run structure ========================
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a3fdc16cfd6f3..9b8a2fe3a2398 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp)); } + + +static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc) +{ + struct timespec64 ts; + + if (kvm_get_walltime_and_clockread(&ts, host_tsc)) { + *walltime_ns = timespec64_to_ns(&ts); + return; + } + + *host_tsc = rdtsc(); + *walltime_ns = ktime_get_real_ns(); +} + #endif
/* @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_X86_USER_SPACE_MSR: case KVM_CAP_X86_MSR_FILTER: case KVM_CAP_ENFORCE_PV_FEATURE_CPUID: +#ifdef CONFIG_X86_64 + case KVM_CAP_PRECISE_TSC: +#endif r = 1; break; case KVM_CAP_SYNC_REGS: @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp, case KVM_GET_SUPPORTED_HV_CPUID: r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp); break; +#ifdef CONFIG_X86_64 + case KVM_GET_TSC_STATE: { + struct kvm_tsc_state __user *user_tsc_state = argp; + u64 host_tsc; + + struct kvm_tsc_state tsc_state = { + .flags = KVM_TSC_STATE_TIMESTAMP_VALID + }; + + kvm_get_walltime(&tsc_state.nsec, &host_tsc); + tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc); + + if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) { + tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr; + tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID; + } + + r = -EFAULT; + if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state))) + goto out; + r = 0; + break; + } + case KVM_SET_TSC_STATE: { + struct kvm_tsc_state __user *user_tsc_state = argp; + struct kvm_tsc_state tsc_state; + u64 host_tsc, wall_nsec; + + u64 new_guest_tsc, new_guest_tsc_offset; + + r = -EFAULT; + if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state))) + goto out; + + kvm_get_walltime(&wall_nsec, &host_tsc); + new_guest_tsc = tsc_state.tsc; + + if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) { + s64 diff = wall_nsec - tsc_state.nsec; + if (diff >= 0) + new_guest_tsc += nsec_to_cycles(vcpu, diff); + else + new_guest_tsc -= nsec_to_cycles(vcpu, -diff); + } + + new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc); + kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset); + + if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID) + if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) + vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust; + r = 0; + break; + } +#endif default: r = -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 886802b8ffba3..bf4c38fd58291 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190 #define KVM_CAP_SYS_HYPERV_CPUID 191 #define KVM_CAP_DIRTY_LOG_RING 192 +#define KVM_CAP_PRECISE_TSC 193
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1169,6 +1170,16 @@ struct kvm_clock_data { __u32 pad[9]; };
+ +#define KVM_TSC_STATE_TIMESTAMP_VALID 1 +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2 +struct kvm_tsc_state { + __u32 flags; + __u64 nsec; + __u64 tsc; + __u64 tsc_adjust; +}; + /* For KVM_CAP_SW_TLB */
#define KVM_MMU_FSL_BOOKE_NOHV 0 @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd { /* Available with KVM_CAP_DIRTY_LOG_RING */ #define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc7)
+/* Available with KVM_CAP_PRECISE_TSC*/ +#define KVM_SET_TSC_STATE _IOW(KVMIO, 0xc8, struct kvm_tsc_state) +#define KVM_GET_TSC_STATE _IOR(KVMIO, 0xc9, struct kvm_tsc_state) + /* Secure Encrypted Virtualization command */ enum sev_cmd_id { /* Guest initialization commands */
On Thu, Dec 03 2020 at 19:11, Maxim Levitsky wrote:
- case KVM_SET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
struct kvm_tsc_state tsc_state;
u64 host_tsc, wall_nsec;
u64 new_guest_tsc, new_guest_tsc_offset;
r = -EFAULT;
if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
goto out;
kvm_get_walltime(&wall_nsec, &host_tsc);
new_guest_tsc = tsc_state.tsc;
if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
s64 diff = wall_nsec - tsc_state.nsec;
if (diff >= 0)
new_guest_tsc += nsec_to_cycles(vcpu, diff);
else
new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
}
new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
From a timekeeping POV and the guests expectation of TSC this is
fundamentally wrong:
tscguest = scaled(hosttsc) + offset
The TSC has to be viewed systemwide and not per CPU. It's systemwide used for timekeeping and for that to work it has to be synchronized.
Why would this be different on virt? Just because it's virt or what?
Migration is a guest wide thing and you're not migrating single vCPUs.
This hackery just papers over he underlying design fail that KVM looks at the TSC per vCPU which is the root cause and that needs to be fixed.
Thanks,
tglx
On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote:
On Thu, Dec 03 2020 at 19:11, Maxim Levitsky wrote:
- case KVM_SET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
struct kvm_tsc_state tsc_state;
u64 host_tsc, wall_nsec;
u64 new_guest_tsc, new_guest_tsc_offset;
r = -EFAULT;
if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
goto out;
kvm_get_walltime(&wall_nsec, &host_tsc);
new_guest_tsc = tsc_state.tsc;
if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
s64 diff = wall_nsec - tsc_state.nsec;
if (diff >= 0)
new_guest_tsc += nsec_to_cycles(vcpu, diff);
else
new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
}
new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
From a timekeeping POV and the guests expectation of TSC this is fundamentally wrong:
tscguest = scaled(hosttsc) + offset
The TSC has to be viewed systemwide and not per CPU. It's systemwide used for timekeeping and for that to work it has to be synchronized.
Why would this be different on virt? Just because it's virt or what?
Migration is a guest wide thing and you're not migrating single vCPUs.
This hackery just papers over he underlying design fail that KVM looks at the TSC per vCPU which is the root cause and that needs to be fixed.
I don't disagree with you. As far as I know the main reasons that kvm tracks TSC per guest are
1. cases when host tsc is not stable (hopefully rare now, and I don't mind making the new API just refuse to work when this is detected, and revert to old way of doing things).
2. (theoretical) ability of the guest to introduce per core tsc offfset by either using TSC_ADJUST (for which I got recently an idea to stop advertising this feature to the guest), or writing TSC directly which is allowed by Intel's PRM:
"The RDMSR and WRMSR instructions read and write the time-stamp counter, treating the time-stamp counter as an ordinary MSR (address 10H). In the Pentium 4, Intel Xeon, and P6 family processors, all 64-bits of the time-stamp counter are read using RDMSR (just as with RDTSC). When WRMSR is used to write the time-stamp counter on processors before family [0FH], models [03H, 04H]: only the low-order 32-bits of the time-stamp counter can be written (the high-order 32 bits are cleared to 0). For family [0FH], models [03H, 04H, 06H]; for family [06H]], model [0EH, 0FH]; for family [06H]], DisplayModel [17H, 1AH, 1CH, 1DH]: all 64 bits are writable."
But other than that I don't mind making TSC offset global per VM thing. Paulo, what do you think about this?
Best regards, Maxim Levitsky
Thanks,
tglx
Maxim Levitsky mlevitsk@redhat.com writes:
But other than that I don't mind making TSC offset global per VM thing. Paulo, what do you think about this?
Not Paolo here but personally I'd very much prefer we go this route but unsynchronized TSCs are, unfortunately, still a thing: I was observing it on an AMD Epyc server just a couple years ago (cured with firmware update). We try to catch such situation in KVM instead of blowing up but this may still result in subtle bugs I believe. Maybe we would be better off killing all VMs in case TSC ever gets unsynced (by default).
Another thing to this bucket is kvmclock which is currently per-cpu. If we forbid TSC to un-synchronize (he-he), there is no point in doing that. We can as well use e.g. Hyper-V TSC page method which is per-VM. Creating another PV clock in KVM may be a hard sell as all modern x86 CPUs support TSC scaling (in addition to TSC offsetting which is there for a long time) and when it's there we don't really need a PV clock to make migration possible.
On Mon, Dec 07 2020 at 14:16, Vitaly Kuznetsov wrote:
Maxim Levitsky mlevitsk@redhat.com writes:
But other than that I don't mind making TSC offset global per VM thing. Paulo, what do you think about this?
Not Paolo here but personally I'd very much prefer we go this route but unsynchronized TSCs are, unfortunately, still a thing: I was observing it on an AMD Epyc server just a couple years ago (cured with firmware update).
Right this happens still occasionally, but for quite some time this is 100% firmware sillyness and not a fundamental property of the hardware anymore. Interestingly enough has the number of reports on Intel based systems vs. such wreckage as obvservable via TSC_ADJUST gone down after we added support for it and yelled prominently. I wish AMD would have that as well.
We try to catch such situation in KVM instead of blowing up but this may still result in subtle bugs I believe. Maybe we would be better off killing all VMs in case TSC ever gets unsynced (by default).
I just ran a guest on an old machine with unsynchronized TSCs and was able to observe clock monotonic going backwards between two threads pinned on two vCPUs, which _is_ bad. Getting unsynced clocks reliably under control is extremly hard.
Another thing to this bucket is kvmclock which is currently per-cpu. If we forbid TSC to un-synchronize (he-he), there is no point in doing that. We can as well use e.g. Hyper-V TSC page method which is per-VM. Creating another PV clock in KVM may be a hard sell as all modern x86 CPUs support TSC scaling (in addition to TSC offsetting which is there for a long time) and when it's there we don't really need a PV clock to make migration possible.
That should be the long term goal.
Thanks,
tglx
On Mon, Dec 07, 2020 at 06:41:41PM +0100, Thomas Gleixner wrote:
Right this happens still occasionally, but for quite some time this is 100% firmware sillyness and not a fundamental property of the hardware anymore.
Ever since Nehalem (2008) TSC is synchronized on <= 2 sockets, and any observed deviation is firmware fail. I don't remember exactly where 4 socket and up got reliable.
(there's the physical hotplug case, but let's not make this complicated)
AMD has had Constant TSC since Barcelona which is roughly the same timeframe IIRC.
So basically every TSC fail in the last decase is due to firmware being shit.
On 07/12/20 18:41, Thomas Gleixner wrote:
Right this happens still occasionally, but for quite some time this is 100% firmware sillyness and not a fundamental property of the hardware anymore.
It's still a fundamental property of old hardware. Last time I tried to kill support for processors earlier than Core 2, I had to revert it. That's older than Nehalem.
We try to catch such situation in KVM instead of blowing up but this may still result in subtle bugs I believe. Maybe we would be better off killing all VMs in case TSC ever gets unsynced (by default).
I just ran a guest on an old machine with unsynchronized TSCs and was able to observe clock monotonic going backwards between two threads pinned on two vCPUs, which _is_ bad. Getting unsynced clocks reliably under control is extremly hard.
Using kvmclock? (Half serious: perhaps a good reason to have per-vCPU offsets is to be able to test what happens with unsynchronized TSCs...).
Paolo
On Thu, Dec 10, 2020 at 12:42:36PM +0100, Paolo Bonzini wrote:
On 07/12/20 18:41, Thomas Gleixner wrote:
Right this happens still occasionally, but for quite some time this is 100% firmware sillyness and not a fundamental property of the hardware anymore.
It's still a fundamental property of old hardware. Last time I tried to kill support for processors earlier than Core 2, I had to revert it. That's older than Nehalem.
Core2 doesn't use TSC for timekeeping anyway. KVM shouldn't either.
On 10/12/20 13:14, Peter Zijlstra wrote:
On Thu, Dec 10, 2020 at 12:42:36PM +0100, Paolo Bonzini wrote:
On 07/12/20 18:41, Thomas Gleixner wrote:
Right this happens still occasionally, but for quite some time this is 100% firmware sillyness and not a fundamental property of the hardware anymore.
It's still a fundamental property of old hardware. Last time I tried to kill support for processors earlier than Core 2, I had to revert it. That's older than Nehalem.
Core2 doesn't use TSC for timekeeping anyway. KVM shouldn't either.
On Core2, KVM guests pass TSC through kvmclock in order to get something usable and not incredibly slow.
Paolo
On Thu, Dec 10, 2020 at 01:22:02PM +0100, Paolo Bonzini wrote:
On 10/12/20 13:14, Peter Zijlstra wrote:
On Thu, Dec 10, 2020 at 12:42:36PM +0100, Paolo Bonzini wrote:
On 07/12/20 18:41, Thomas Gleixner wrote:
Right this happens still occasionally, but for quite some time this is 100% firmware sillyness and not a fundamental property of the hardware anymore.
It's still a fundamental property of old hardware. Last time I tried to kill support for processors earlier than Core 2, I had to revert it. That's older than Nehalem.
Core2 doesn't use TSC for timekeeping anyway. KVM shouldn't either.
On Core2, KVM guests pass TSC through kvmclock in order to get something usable and not incredibly slow.
Which is incredibly wrong.
On Thu, Dec 10 2020 at 14:01, Peter Zijlstra wrote:
On Thu, Dec 10, 2020 at 01:22:02PM +0100, Paolo Bonzini wrote:
On 10/12/20 13:14, Peter Zijlstra wrote:
On Thu, Dec 10, 2020 at 12:42:36PM +0100, Paolo Bonzini wrote:
On 07/12/20 18:41, Thomas Gleixner wrote:
Right this happens still occasionally, but for quite some time this is 100% firmware sillyness and not a fundamental property of the hardware anymore.
It's still a fundamental property of old hardware. Last time I tried to kill support for processors earlier than Core 2, I had to revert it. That's older than Nehalem.
Core2 doesn't use TSC for timekeeping anyway. KVM shouldn't either.
On Core2, KVM guests pass TSC through kvmclock in order to get something usable and not incredibly slow.
Which is incredibly wrong.
Core2 is really not something which should prevent making all of this correct and robust. That'd be not only wrong, that'd be outright insane.
Thanks,
tglx
On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote:
From a timekeeping POV and the guests expectation of TSC this is fundamentally wrong:
tscguest = scaled(hosttsc) + offset
The TSC has to be viewed systemwide and not per CPU. It's systemwide used for timekeeping and for that to work it has to be synchronized.
Why would this be different on virt? Just because it's virt or what?
Migration is a guest wide thing and you're not migrating single vCPUs.
This hackery just papers over he underlying design fail that KVM looks at the TSC per vCPU which is the root cause and that needs to be fixed.
I don't disagree with you. As far as I know the main reasons that kvm tracks TSC per guest are
- cases when host tsc is not stable
(hopefully rare now, and I don't mind making the new API just refuse to work when this is detected, and revert to old way of doing things).
That's a trainwreck to begin with and I really would just not support it for anything new which aims to be more precise and correct. TSC has become pretty reliable over the years.
- (theoretical) ability of the guest to introduce per core tsc offfset
by either using TSC_ADJUST (for which I got recently an idea to stop advertising this feature to the guest), or writing TSC directly which is allowed by Intel's PRM:
For anything halfways modern the write to TSC is reflected in TSC_ADJUST which means you get the precise offset.
The general principle still applies from a system POV.
TSC base (systemwide view) - The sane case
TSC CPU = TSC base + TSC_ADJUST
The guest TSC base is a per guest constant offset to the host TSC.
TSC guest base = TSC host base + guest base offset
If the guest want's this different per vCPU by writing to the MSR or to TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which is the offset to the TSC base of the guest.
TSC guest CPU = TSC guest base + CPU TSC_ADJUST
==>
TSC guest CPU = TSC host base + guest base offset + CPU TSC_ADJUST
The normal and sane case is just TSC_ADJUST == 0.
It's very cleanly decomposable.
Thanks,
tglx
On Dec 7, 2020, at 8:38 AM, Thomas Gleixner tglx@linutronix.de wrote:
On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote: From a timekeeping POV and the guests expectation of TSC this is fundamentally wrong:
tscguest = scaled(hosttsc) + offset
The TSC has to be viewed systemwide and not per CPU. It's systemwide used for timekeeping and for that to work it has to be synchronized.
Why would this be different on virt? Just because it's virt or what?
Migration is a guest wide thing and you're not migrating single vCPUs.
This hackery just papers over he underlying design fail that KVM looks at the TSC per vCPU which is the root cause and that needs to be fixed.
I don't disagree with you. As far as I know the main reasons that kvm tracks TSC per guest are
- cases when host tsc is not stable
(hopefully rare now, and I don't mind making the new API just refuse to work when this is detected, and revert to old way of doing things).
That's a trainwreck to begin with and I really would just not support it for anything new which aims to be more precise and correct. TSC has become pretty reliable over the years.
- (theoretical) ability of the guest to introduce per core tsc offfset
by either using TSC_ADJUST (for which I got recently an idea to stop advertising this feature to the guest), or writing TSC directly which is allowed by Intel's PRM:
For anything halfways modern the write to TSC is reflected in TSC_ADJUST which means you get the precise offset.
The general principle still applies from a system POV.
TSC base (systemwide view) - The sane case TSC CPU = TSC base + TSC_ADJUST
The guest TSC base is a per guest constant offset to the host TSC.
TSC guest base = TSC host base + guest base offset
If the guest want's this different per vCPU by writing to the MSR or to TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which is the offset to the TSC base of the guest.
How about, if the guest wants to write TSC_ADJUST, it can turn off all paravirt features and keep both pieces?
On Mon, 2020-12-07 at 08:53 -0800, Andy Lutomirski wrote:
On Dec 7, 2020, at 8:38 AM, Thomas Gleixner tglx@linutronix.de wrote:
On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote: From a timekeeping POV and the guests expectation of TSC this is fundamentally wrong:
tscguest = scaled(hosttsc) + offset
The TSC has to be viewed systemwide and not per CPU. It's systemwide used for timekeeping and for that to work it has to be synchronized.
Why would this be different on virt? Just because it's virt or what?
Migration is a guest wide thing and you're not migrating single vCPUs.
This hackery just papers over he underlying design fail that KVM looks at the TSC per vCPU which is the root cause and that needs to be fixed.
I don't disagree with you. As far as I know the main reasons that kvm tracks TSC per guest are
- cases when host tsc is not stable
(hopefully rare now, and I don't mind making the new API just refuse to work when this is detected, and revert to old way of doing things).
That's a trainwreck to begin with and I really would just not support it for anything new which aims to be more precise and correct. TSC has become pretty reliable over the years.
- (theoretical) ability of the guest to introduce per core tsc offfset
by either using TSC_ADJUST (for which I got recently an idea to stop advertising this feature to the guest), or writing TSC directly which is allowed by Intel's PRM:
For anything halfways modern the write to TSC is reflected in TSC_ADJUST which means you get the precise offset.
The general principle still applies from a system POV.
TSC base (systemwide view) - The sane case TSC CPU = TSC base + TSC_ADJUST
The guest TSC base is a per guest constant offset to the host TSC.
TSC guest base = TSC host base + guest base offset
If the guest want's this different per vCPU by writing to the MSR or to TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which is the offset to the TSC base of the guest.
How about, if the guest wants to write TSC_ADJUST, it can turn off all paravirt features and keep both pieces?
This is one of the things I had in mind recently.
Even better, we can stop advertising TSC_ADJUST in CPUID to the guest and forbid it from writing it at all.
And if the guest insists and writes to the TSC itself, then indeed let it keep both pieces (or invoke some failback code).
I have nothing against this solution.
Best regards, Maxim Levitsky
On Dec 7, 2020, at 9:00 AM, Maxim Levitsky mlevitsk@redhat.com wrote:
On Mon, 2020-12-07 at 08:53 -0800, Andy Lutomirski wrote:
On Dec 7, 2020, at 8:38 AM, Thomas Gleixner tglx@linutronix.de wrote:
On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote: From a timekeeping POV and the guests expectation of TSC this is fundamentally wrong:
tscguest = scaled(hosttsc) + offset
The TSC has to be viewed systemwide and not per CPU. It's systemwide used for timekeeping and for that to work it has to be synchronized.
Why would this be different on virt? Just because it's virt or what?
Migration is a guest wide thing and you're not migrating single vCPUs.
This hackery just papers over he underlying design fail that KVM looks at the TSC per vCPU which is the root cause and that needs to be fixed.
I don't disagree with you. As far as I know the main reasons that kvm tracks TSC per guest are
- cases when host tsc is not stable
(hopefully rare now, and I don't mind making the new API just refuse to work when this is detected, and revert to old way of doing things).
That's a trainwreck to begin with and I really would just not support it for anything new which aims to be more precise and correct. TSC has become pretty reliable over the years.
- (theoretical) ability of the guest to introduce per core tsc offfset
by either using TSC_ADJUST (for which I got recently an idea to stop advertising this feature to the guest), or writing TSC directly which is allowed by Intel's PRM:
For anything halfways modern the write to TSC is reflected in TSC_ADJUST which means you get the precise offset.
The general principle still applies from a system POV.
TSC base (systemwide view) - The sane case
TSC CPU = TSC base + TSC_ADJUST
The guest TSC base is a per guest constant offset to the host TSC.
TSC guest base = TSC host base + guest base offset
If the guest want's this different per vCPU by writing to the MSR or to TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which is the offset to the TSC base of the guest.
How about, if the guest wants to write TSC_ADJUST, it can turn off all paravirt features and keep both pieces?
This is one of the things I had in mind recently.
Even better, we can stop advertising TSC_ADJUST in CPUID to the guest and forbid it from writing it at all.
Seems reasonable to me.
It also seems okay for some MSRs to stop working after the guest enabled new PV timekeeping.
I do have a feature request, though: IMO it would be quite nifty if the new kvmclock structure could also expose NTP corrections. In other words, if you could expose enough info to calculate CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC, and CLOCK_REALTIME, then we could have paravirt NTP.
Bonus points if whatever you do for CLOCK_REALTIME also exposes leap seconds in a race free way :). But I suppose that just exposing TAI and letting the guest deal with the TAI - UTC offset itself would get the job done just fine.
On Mon, Dec 07, 2020 at 10:04:45AM -0800, Andy Lutomirski wrote:
On Dec 7, 2020, at 9:00 AM, Maxim Levitsky mlevitsk@redhat.com wrote:
On Mon, 2020-12-07 at 08:53 -0800, Andy Lutomirski wrote:
On Dec 7, 2020, at 8:38 AM, Thomas Gleixner tglx@linutronix.de wrote:
On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote: From a timekeeping POV and the guests expectation of TSC this is fundamentally wrong:
tscguest = scaled(hosttsc) + offset
The TSC has to be viewed systemwide and not per CPU. It's systemwide used for timekeeping and for that to work it has to be synchronized.
Why would this be different on virt? Just because it's virt or what?
Migration is a guest wide thing and you're not migrating single vCPUs.
This hackery just papers over he underlying design fail that KVM looks at the TSC per vCPU which is the root cause and that needs to be fixed.
I don't disagree with you. As far as I know the main reasons that kvm tracks TSC per guest are
- cases when host tsc is not stable
(hopefully rare now, and I don't mind making the new API just refuse to work when this is detected, and revert to old way of doing things).
That's a trainwreck to begin with and I really would just not support it for anything new which aims to be more precise and correct. TSC has become pretty reliable over the years.
- (theoretical) ability of the guest to introduce per core tsc offfset
by either using TSC_ADJUST (for which I got recently an idea to stop advertising this feature to the guest), or writing TSC directly which is allowed by Intel's PRM:
For anything halfways modern the write to TSC is reflected in TSC_ADJUST which means you get the precise offset.
The general principle still applies from a system POV.
TSC base (systemwide view) - The sane case
TSC CPU = TSC base + TSC_ADJUST
The guest TSC base is a per guest constant offset to the host TSC.
TSC guest base = TSC host base + guest base offset
If the guest want's this different per vCPU by writing to the MSR or to TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which is the offset to the TSC base of the guest.
How about, if the guest wants to write TSC_ADJUST, it can turn off all paravirt features and keep both pieces?
This is one of the things I had in mind recently.
Even better, we can stop advertising TSC_ADJUST in CPUID to the guest and forbid it from writing it at all.
Seems reasonable to me.
It also seems okay for some MSRs to stop working after the guest enabled new PV timekeeping.
I do have a feature request, though: IMO it would be quite nifty if the new kvmclock structure could also expose NTP corrections. In other words, if you could expose enough info to calculate CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC, and CLOCK_REALTIME, then we could have paravirt NTP.
Hi Andy,
Any reason why drivers/ptp/ptp_kvm.c does not work for you?
Bonus points if whatever you do for CLOCK_REALTIME also exposes leap seconds in a race free way :). But I suppose that just exposing TAI and letting the guest deal with the TAI - UTC offset itself would get the job done just fine.
On Tue, Dec 8, 2020 at 6:23 AM Marcelo Tosatti mtosatti@redhat.com wrote:
On Mon, Dec 07, 2020 at 10:04:45AM -0800, Andy Lutomirski wrote:
I do have a feature request, though: IMO it would be quite nifty if the new kvmclock structure could also expose NTP corrections. In other words, if you could expose enough info to calculate CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC, and CLOCK_REALTIME, then we could have paravirt NTP.
Hi Andy,
Any reason why drivers/ptp/ptp_kvm.c does not work for you?
It looks like it tries to accomplish the right goal, but in a rather roundabout way. The host knows how to convert from TSC to CLOCK_REALTIME, and ptp_kvm.c exposes this data to the guest. But, rather than just making the guest use the same CLOCK_REALTIME data as the host, ptp_kvm.c seems to expose information to usermode that a user daemon could use to attempt (with some degree of error?) to use to make the guest kernel track CLOCK_REALTIME. This seems inefficient and dubiously accurate.
My feature request is for this to be fully automatic and completely coherent. I would like for a host user program and a guest user program to be able to share memory, run concurrently, and use the shared memory to exchange CLOCK_REALTIME values without ever observing the clock going backwards. This ought to be doable. Ideally the result should even be usable for Spanner-style synchronization assuming the host clock is good enough. Also, this whole thing should work without needing to periodically wake the guest to remain synchronized. If the guest sleeps for two minutes (full nohz-idle, no guest activity at all), the host makes a small REALTIME frequency adjustment, and then the guest runs user code that reads CLOCK_REALTIME, the guest clock should still be fully synchronized with the host. I don't think that ptp_kvm.c-style synchronization can do this.
tglx etc, I think that doing this really really nicely might involve promoting something like the current vDSO data structures to ABI -- a straightforward-ish implementation would be for the KVM host to export its vvar clock data to the guest and for the guest to use it, possibly with an offset applied. The offset could work a lot like timens works today.
--Andy
On Tue, Dec 08 2020 at 09:43, Andy Lutomirski wrote:
On Tue, Dec 8, 2020 at 6:23 AM Marcelo Tosatti mtosatti@redhat.com wrote: It looks like it tries to accomplish the right goal, but in a rather roundabout way. The host knows how to convert from TSC to CLOCK_REALTIME, and ptp_kvm.c exposes this data to the guest. But, rather than just making the guest use the same CLOCK_REALTIME data as the host, ptp_kvm.c seems to expose information to usermode that a user daemon could use to attempt (with some degree of error?) to use to make the guest kernel track CLOCK_REALTIME. This seems inefficient and dubiously accurate.
My feature request is for this to be fully automatic and completely coherent. I would like for a host user program and a guest user program to be able to share memory, run concurrently, and use the shared memory to exchange CLOCK_REALTIME values without ever observing the clock going backwards. This ought to be doable. Ideally the result should even be usable for Spanner-style synchronization assuming the host clock is good enough. Also, this whole thing should work without needing to periodically wake the guest to remain synchronized. If the guest sleeps for two minutes (full nohz-idle, no guest activity at all), the host makes a small REALTIME frequency adjustment, and then the guest runs user code that reads CLOCK_REALTIME, the guest clock should still be fully synchronized with the host. I don't think that ptp_kvm.c-style synchronization can do this.
One issue here is that guests might want to run their own NTP/PTP. One reason to do that is that some people prefer the leap second smearing NTP servers.
tglx etc, I think that doing this really really nicely might involve promoting something like the current vDSO data structures to ABI -- a straightforward-ish implementation would be for the KVM host to export its vvar clock data to the guest and for the guest to use it, possibly with an offset applied. The offset could work a lot like timens works today.
Works nicely if the guest TSC is not scaled. But that means that on migration the raw TSC usage in the guest is borked because the new host might have a different TSC frequency.
If you use TSC scaling then the conversion needs to take TSC scaling into account which needs some thought. And the guest would need to read the host conversion from 'vdso data' and the scaling from the next page (per guest) and then still has to support timens. Doable but adds extra overhead on every time read operation.
If you want to avoid that you are back to the point where you need to chase all guest data when the host NTP/PTP adjusts the host side. Chasing and updating all this stuff in the tick was the reason why I was fighting the idea of clock realtime in namespaces.
Thanks,
tglx
On Dec 8, 2020, at 11:25 AM, Thomas Gleixner tglx@linutronix.de wrote:
On Tue, Dec 08 2020 at 09:43, Andy Lutomirski wrote:
On Tue, Dec 8, 2020 at 6:23 AM Marcelo Tosatti mtosatti@redhat.com wrote: It looks like it tries to accomplish the right goal, but in a rather roundabout way. The host knows how to convert from TSC to CLOCK_REALTIME, and ptp_kvm.c exposes this data to the guest. But, rather than just making the guest use the same CLOCK_REALTIME data as the host, ptp_kvm.c seems to expose information to usermode that a user daemon could use to attempt (with some degree of error?) to use to make the guest kernel track CLOCK_REALTIME. This seems inefficient and dubiously accurate.
My feature request is for this to be fully automatic and completely coherent. I would like for a host user program and a guest user program to be able to share memory, run concurrently, and use the shared memory to exchange CLOCK_REALTIME values without ever observing the clock going backwards. This ought to be doable. Ideally the result should even be usable for Spanner-style synchronization assuming the host clock is good enough. Also, this whole thing should work without needing to periodically wake the guest to remain synchronized. If the guest sleeps for two minutes (full nohz-idle, no guest activity at all), the host makes a small REALTIME frequency adjustment, and then the guest runs user code that reads CLOCK_REALTIME, the guest clock should still be fully synchronized with the host. I don't think that ptp_kvm.c-style synchronization can do this.
One issue here is that guests might want to run their own NTP/PTP. One reason to do that is that some people prefer the leap second smearing NTP servers.
I would hope that using this part would be optional on the guest’s part. Guests should be able to use just the CLOCK_MONOTONIC_RAW part or fancier stuff at their option.
(Hmm, it would, in principle, be possible for a guest to use the host’s TAI but still smear leap seconds. Even without virt, smearing could be a per-timens option.)
tglx etc, I think that doing this really really nicely might involve promoting something like the current vDSO data structures to ABI -- a straightforward-ish implementation would be for the KVM host to export its vvar clock data to the guest and for the guest to use it, possibly with an offset applied. The offset could work a lot like timens works today.
Works nicely if the guest TSC is not scaled. But that means that on migration the raw TSC usage in the guest is borked because the new host might have a different TSC frequency.
If you use TSC scaling then the conversion needs to take TSC scaling into account which needs some thought. And the guest would need to read the host conversion from 'vdso data' and the scaling from the next page (per guest) and then still has to support timens. Doable but adds extra overhead on every time read operation.
Is the issue that scaling would result in a different guest vs host frequency? Perhaps we could limit each physical machine to exactly two modes: unscaled (use TSC ticks, convert in software) and scaled to nanoseconds (CLOCK_MONOTONIC_RAW is RDTSC + possible offset). Then the host could produce its data structures in exactly those two formats and export them as appropriate.
If you want to avoid that you are back to the point where you need to chase all guest data when the host NTP/PTP adjusts the host side. Chasing and updating all this stuff in the tick was the reason why I was fighting the idea of clock realtime in namespaces.
I think that, if we can arrange for a small, bounded number of pages generated by the host, then this problem isn’t so bad.
Hmm, leap second smearing is just a different linear mapping. I’m not sure how leap second smearing should interact with timens, but it seems to be that the host should be able to produce four data pages (scaled vs unscaled and smeared vs unsmeared) and one per-guest/timens offset page (where offset applies to MONOTONIC and MONOTONIC_RAW only) and cover all bases. Or do people actually want to offset their TAI and/or REALTIME, and what would that even mean if the offset crosses a leap second?
(I haven’t though about the interaction of any of this with ART.)
On Tue, Dec 08 2020 at 12:32, Andy Lutomirski wrote:
On Dec 8, 2020, at 11:25 AM, Thomas Gleixner tglx@linutronix.de wrote: One issue here is that guests might want to run their own NTP/PTP. One reason to do that is that some people prefer the leap second smearing NTP servers.
I would hope that using this part would be optional on the guest’s part. Guests should be able to use just the CLOCK_MONOTONIC_RAW part or fancier stuff at their option.
(Hmm, it would, in principle, be possible for a guest to use the host’s TAI but still smear leap seconds. Even without virt, smearing could be a per-timens option.)
No. Don't even think about it. Read the thread:
https://lore.kernel.org/r/20201030110229.43f0773b@jawa
all the way through the end and then come up with a real proposal which solves all of the issues mentioned there.
I might be missing the obvious, but please before you make proposals about time keeping out of thin air, please do your homework. You would not ask these questions otherwise.
If it would be that simple we wouldn't be discussing this in the first place.
Sorry for being blunt, but this has been discussed to death already.
It can be solved on the theory level, but it's not practical.
You _cannot_ make leap second smearing or different notions of clock realtime an isolated problem. We have
CLOCK_MONOTONIC CLOCK_BOOTTIME CLOCK_REALTIME CLOCK_TAI
They share one fundamental property:
All frequency adjustments done by NTP/PTP/PPS or whatever affect _ALL_ of them in the exactly same way.
I.e. leap second smearing whether you like it or not is affecting all of them and there is nothing we can do about that. Why?
1) Because it's wrong to begin with. It creates a seperate universe of CLOCK_REALTIME and therefore of CLOCK_TAI because they are strictly coupled by definition.
2) It's user space ABI. adjtimex() can make the kernel do random crap. So if you extend that to time namespaces (which is pretty much the same as a guest time space) then you have to solve the following problems:
A) The tick based gradual adjustment of adjtimex() to avoid time jumping around have to be done for every time universe which has different parameters than the host.
Arguably this can be solved by having a seqcount based magic hack which forces the first time name space consumer into updating the per time name space notion of time instead of doing it from the host tick, but that requires to have fully synchronized nested sequence counts and if you extend this to nested virt it creates an exponential problem.
B) What to do about enforced time jumps on the host (settimeofday, adjtimex)?
C) Once you have solved #1 and #2 explain how timers (nanosleep, interval timers, ....) which are all user space ABI and have the fundamental guarantee to not expire early can be handled in a sane and scalable way.
Once you have a coherent answer for all of the above I'm happy to step down and hand over. In that case I'm more than happy not to have to deal with the inevitable regression reports.
tglx etc, I think that doing this really really nicely might involve promoting something like the current vDSO data structures to ABI -- a straightforward-ish implementation would be for the KVM host to export its vvar clock data to the guest and for the guest to use it, possibly with an offset applied. The offset could work a lot like timens works today.
Works nicely if the guest TSC is not scaled. But that means that on migration the raw TSC usage in the guest is borked because the new host might have a different TSC frequency.
If you use TSC scaling then the conversion needs to take TSC scaling into account which needs some thought. And the guest would need to read the host conversion from 'vdso data' and the scaling from the next page (per guest) and then still has to support timens. Doable but adds extra overhead on every time read operation.
Is the issue that scaling would result in a different guest vs host frequency? Perhaps we could limit each physical machine to exactly two modes: unscaled (use TSC ticks, convert in software) and scaled to nanoseconds (CLOCK_MONOTONIC_RAW is RDTSC + possible offset). Then the host could produce its data structures in exactly those two formats and export them as appropriate.
The latter - nanoseconds scaling - is the only reasonable solution but then _ALL_ involved hosts must agree on that.
If you want to avoid that you are back to the point where you need to chase all guest data when the host NTP/PTP adjusts the host side. Chasing and updating all this stuff in the tick was the reason why I was fighting the idea of clock realtime in namespaces.
I think that, if we can arrange for a small, bounded number of pages generated by the host, then this problem isn’t so bad.
Whishful thinking unless we have a very strict contract about
- scaling to 1 GHz, aka nanoseconds, - all guest argree with the host defined management of clock REALTIME and TAI
Hmm, leap second smearing is just a different linear mapping. I’m not sure how leap second smearing should interact with timens, but it seems to be that the host should be able to produce four data pages (scaled vs unscaled and smeared vs unsmeared) and one per-guest/timens offset page (where offset applies to MONOTONIC and MONOTONIC_RAW only) and cover all bases. Or do people actually want to offset their TAI and/or REALTIME, and what would that even mean if the offset crosses a leap second?
See above.
And yes people want to run different time universes where TAI != TAI.
(I haven’t though about the interaction of any of this with ART.)
Obviously not:) And I'm pretty sure that nobody else did, but ART is the least of my worries so far.
The current VM migration approach is: Get it done no matter what, we deal with the fallout later. Which means endless tinkering because you can't fix essential design fails after the fact.
Thanks,
tglx
On Tue, Dec 8, 2020 at 4:19 PM Thomas Gleixner tglx@linutronix.de wrote:
On Tue, Dec 08 2020 at 12:32, Andy Lutomirski wrote:
On Dec 8, 2020, at 11:25 AM, Thomas Gleixner tglx@linutronix.de wrote: One issue here is that guests might want to run their own NTP/PTP. One reason to do that is that some people prefer the leap second smearing NTP servers.
I would hope that using this part would be optional on the guest’s part. Guests should be able to use just the CLOCK_MONOTONIC_RAW part or fancier stuff at their option.
(Hmm, it would, in principle, be possible for a guest to use the host’s TAI but still smear leap seconds. Even without virt, smearing could be a per-timens option.)
No. Don't even think about it. Read the thread:
https://lore.kernel.org/r/20201030110229.43f0773b@jawa
all the way through the end and then come up with a real proposal which solves all of the issues mentioned there.
You're misunderstanding me, which is entirely reasonable, since my description was crap. In particular, what I meant by smearing is not at all what's done today. Let me try again. The thing below is my proposal, not necessarily a description of exactly what happens now.
(I read most of that thread, and I read most of this thread, and I've hacked on the time code, cursed at the KVM code, modified the KVM code, cursed at the KVM code some more, etc. None of which is to say that I have a full understanding of every possible timekeeping nuance, but I'm pretty sure I can at least pretend to understand some of it.)
We have some time source that we can read (e.g. TSC). Let's call it read_time(). It returns an integer (64-bits would be nice, but we'll take what we can get). From the output of read_time(), Linux user programs, and the kernel itself (and guests perhaps, see below) would like to produce various outputs. Each of them is protected by a seqlock that I'll omit in the descriptions below. The operations below are all protected by a seqlock retry loop. Also, when I say * below, I mean the usual calculation with a multiplication and a shift.
All of these are only valid if t_start <= read_time() <= t_end and, and they all assume that read_time() hasn't wrapped and gotten into that interval again. There is nothing at all we can do in software if we wrap like this. t_end isn't necessarily something we compute explicitly --- it might just be the case that, if read_time() > t_end, our arithmetic overflows and we return garbage. But t_end might be a real thing on architectures where vdso_cycles_ok() actually does something (sigh, x86).
CLOCK_MONOTONIC_RAW: not affected by NTP, adjtimex, etc. return mult[monotonic_raw] * (read_time() - t_start) + offset[monotonic_raw];
CLOCK_MONOTONIC: This is never affected by leap-second smearing. If userspace tries to smear it in the new mode, userspace gets to keep all the pieces. return mult[monotonic] * (read_time() - t_start) + offset[monotonic];
CLOCK_TAI: This is not smeared. return mult[tai] * (read_time() - t_start) + offset[tai];
CLOCK_SANE_REALTIME: This is not smeared either. return mult[sane_realtime] * (read_time() - t_start) + offset[sane_realtime];
And yes, we require that mult[monotonic] == mult[tai] == mult[sane_realtime].
CLOCK_SMEARED_REALTIME: return mult[smeared_realtime] * (read_time() - t_start) + offset[smeared_realtime] This is a leap-second-smeared variant of CLOCK_SANE_REALTIME.
CLOCK_REALTIME: maps to CLOCK_SANE_REALTIME or CLOCK_SMEARED_REALTIME depending on user preference. Doing this without an extra branch somewhere might take a bit of thought.
If t > t_end, then we fall back to a syscall if we're in user mode and we fall back to hypercall or we just spin if we're in the kernel. But see below.
As far as I can tell, if the kernel were to do something utterly asinine like adding some arbitrary value to TSC_ADJUST on all CPUs, the kernel could do so correctly by taking the seqlock, making the change, updating everything, and releasing the seqlock. This would be nuts, but it's more or less the same thing that happens when a VM migrates. So I think a VM could migrate a guest without any particular magic, except that there's a potential race if the old and new systems happen to have close enough seqlock values that the guest might start reading on the old host, finish on the new host, see the same seqlock value, and end up with utter garbage. One way to mitigate this would be, in paravirt mode, to have an extra per-guest page that contains a count of how many times the guest has migrated.
Timens would work a lot like it does today, but the mechanism that tells the vdso code to use timens might need tweaking.
I could easily be missing something that prevents this from working, but I'm not seeing any fundamental problems.
If we want to get fancy, we can make a change that I've contemplated for a while -- we could make t_end explicit and have two copies of all these data structures. The reader would use one copy if t < t_change and a different copy if t >= t_change. This would allow NTP-like code in usermode to schedule a frequency shift to start at a specific time. With some care, it would also allow the timekeeping code to update the data structures without causing clock_gettime() to block while the timekeeping code is running on a different CPU.
One other thing that might be worth noting: there's another thread about "vmgenid". It's plausible that it's worth considering stopping the guest or perhaps interrupting all vCPUs to allow it to take some careful actions on migration for reasons that have nothing to do with timekeeping.
Andy,
On Tue, Dec 08 2020 at 20:08, Andy Lutomirski wrote:
On Tue, Dec 8, 2020 at 4:19 PM Thomas Gleixner tglx@linutronix.de wrote:
On Tue, Dec 08 2020 at 12:32, Andy Lutomirski wrote: all the way through the end and then come up with a real proposal which solves all of the issues mentioned there.
You're misunderstanding me, which is entirely reasonable, since my description was crap. In particular, what I meant by smearing is not at all what's done today. Let me try again. The thing below is my proposal, not necessarily a description of exactly what happens now.
Fair enough. /me rewinds grump and starts over.
All of these are only valid if t_start <= read_time() <= t_end and, and they all assume that read_time() hasn't wrapped and gotten into that interval again. There is nothing at all we can do in software if we wrap like this. t_end isn't necessarily something we compute explicitly --- it might just be the case that, if read_time() > t_end, our arithmetic overflows and we return garbage. But t_end might be a real thing on architectures where vdso_cycles_ok() actually does something (sigh, x86).
If t > t_end, then we fall back to a syscall if we're in user mode and we fall back to hypercall or we just spin if we're in the kernel. But see below.
Yes, we could do that.
CLOCK_SMEARED_REALTIME: return mult[smeared_realtime] * (read_time() - t_start) + offset[smeared_realtime] This is a leap-second-smeared variant of CLOCK_SANE_REALTIME.
CLOCK_REALTIME: maps to CLOCK_SANE_REALTIME or CLOCK_SMEARED_REALTIME depending on user preference. Doing this without an extra branch somewhere might take a bit of thought.
Plus adding support for clock_*(CLOCK_DISTORTED_TIME) and make that work correctly. Plus dealing with all the other interesting problems vs. file time stamps and whatever. Time is all over the place and not just in clock_gettime(CLOCK*).
But what's more problematic is the basic requirement that time all over the place has to be consistent.
On machines which use DISTORTED_REALTIME everything _IS_ consistent within the distorted universe they created. It's still inconsistent vs. the outside, but that's unsolvable and none of our problems.
TLDR: Do not even think about opening pandoras box.
As far as I can tell, if the kernel were to do something utterly asinine like adding some arbitrary value to TSC_ADJUST on all CPUs, the kernel could do so correctly by taking the seqlock, making the change, updating everything, and releasing the seqlock.
Plus a few other things, but yes that's similar to the scheme I outlined. Using stomp_machine() ensures that _all_ possible ways to wreckage things are covered.
This would be nuts, but it's more or less the same thing that happens when a VM migrates. So I think a VM could migrate a guest without any particular magic, except that there's a potential race if the old and new systems happen to have close enough seqlock values that the guest might start reading on the old host, finish on the new host, see the same seqlock value, and end up with utter garbage. One way to mitigate this would be, in paravirt mode, to have an extra per-guest page that contains a count of how many times the guest has migrated.
Timens would work a lot like it does today, but the mechanism that tells the vdso code to use timens might need tweaking.
I could easily be missing something that prevents this from working, but I'm not seeing any fundamental problems.
It can be made work.
If we want to get fancy, we can make a change that I've contemplated for a while -- we could make t_end explicit and have two copies of all these data structures. The reader would use one copy if t < t_change and a different copy if t >= t_change.
See below.
This would allow NTP-like code in usermode to schedule a frequency shift to start at a specific time.
That's an orthogonal problem and can be done without changing the reader side.
With some care, it would also allow the timekeeping code to update the data structures without causing clock_gettime() to block while the timekeeping code is running on a different CPU.
It still has to block, i.e. retry, because the data set becomes invalid when t_end is reached. So the whole thing would be:
do { seq = read_seqcount_latch(); data = select_data(seq); delta = read_clocksource() - data->base; if (delta >= data->max_delta) continue; .... } while (read_seqcount_latch_retry());
TBH, I like the idea for exactly one reason: It increases robustness.
For X86 we already have the comparison for dealing with TSC < base which would be covered by
if (delta >= data->max_delta) continue;
automatically. Non X86 gains this extra conditional, but I think it's worth to pay that price.
It won't solve the VM migration problem on it's own though. You still have to be careful about the inner workings of everything related to timekeeping itself.
One other thing that might be worth noting: there's another thread about "vmgenid". It's plausible that it's worth considering stopping the guest or perhaps interrupting all vCPUs to allow it to take some careful actions on migration for reasons that have nothing to do with timekeeping.
How surprising. Who could have thought about that?
OMG, virtualization seems to have gone off into a virtual reality long ago.
Thanks,
tglx
On Dec 9, 2020, at 2:14 AM, Thomas Gleixner tglx@linutronix.de wrote:
But what's more problematic is the basic requirement that time all over the place has to be consistent.
On machines which use DISTORTED_REALTIME everything _IS_ consistent within the distorted universe they created. It's still inconsistent vs. the outside, but that's unsolvable and none of our problems.
TLDR: Do not even think about opening pandoras box.
This could be a per-machine/per-vm setting that nonetheless uses the same underlying implementation. There are, sadly, lots of people using smeared time, and there will probably be VM hosts that simultaneously have both styles of guest. Supporting full PV time for both would be nice. Obviously this gets a bit screwy if they are using a paravirt fs, but it’s also a problem with NFS, etc. So maybe the nasty corners could be narrow enough to just say “don’t do that”.
If we want to get fancy, we can make a change that I've contemplated for a while -- we could make t_end explicit and have two copies of all these data structures. The reader would use one copy if t < t_change and a different copy if t >= t_change.
See below.
This would allow NTP-like code in usermode to schedule a frequency shift to start at a specific time.
That's an orthogonal problem and can be done without changing the reader side.
Really? Right now, the switch happens whenever the kernel takes the seqlock, which it doesn’t have exact control over. But I meant something a little different:
With some care, it would also allow the timekeeping code to update the data structures without causing clock_gettime() to block while the timekeeping code is running on a different CPU.
It still has to block, i.e. retry, because the data set becomes invalid when t_end is reached. So the whole thing would be:
do { seq = read_seqcount_latch(); data = select_data(seq); delta = read_clocksource() - data->base; if (delta >= data->max_delta) continue; .... } while (read_seqcount_latch_retry());
TBH, I like the idea for exactly one reason: It increases robustness.
I like the max_delta for robustness, too.
What do you have in mind for select_data()? Is the idea that select_data() returns something like &data[seq & 1]?
But I actually meant something a little bit different: you would use delta >= data->max_delta as an indication that you need to look at the other copy. Perhaps the lower three bits of the seqcount would work like:
00: both copies are valid, but start with the first copy. 10: only the first copy is valid. 01: both copies are valid, but start with the second copy. 11: only the second copy is valid
You'd read it like this (with all the bugs that I surely have fixed);
do { seq = read_seqcount(); data = &data_array[seq & 1]; clock = read_clocksource(); delta = clock - data->base; if (delta->data->max_delta) { if (seq & 2) continue; data = &data_array[(seq + 1) & 1]; // <-- the other copy delta = clock - data->base; if (delta >= data->max_delta) continue; } ...; } while (seq == read_seqcount());
This has two main benefits. First, it allows the timekeeping code to run concurrently with readers, which is nice for tail latency -- readers would only ever need to spin if the timekeeper falls behind, intentionally disables both copies, or somehow manages to run one iteration for each reader attempt and livelocks the reader. The latter is very unlikely.) Second, it allows the timekeeping code to literally schedule an update to occur at a precise clocksource tick, which seems to be like it could make the timekeeping code simpler and more robust.
(If the timekeeper wants to simultaneously disable both copies, it sets one copy's max_delta to zero and uses seq to disable the other copy.)
--Andy
For X86 we already have the comparison for dealing with TSC < base which would be covered by
if (delta >= data->max_delta) continue;
automatically. Non X86 gains this extra conditional, but I think it's worth to pay that price.
It won't solve the VM migration problem on it's own though. You still have to be careful about the inner workings of everything related to timekeeping itself.
One other thing that might be worth noting: there's another thread about "vmgenid". It's plausible that it's worth considering stopping the guest or perhaps interrupting all vCPUs to allow it to take some careful actions on migration for reasons that have nothing to do with timekeeping.
How surprising. Who could have thought about that?
OMG, virtualization seems to have gone off into a virtual reality long ago.
Thanks,
tglx
On Mon, 2020-12-07 at 10:04 -0800, Andy Lutomirski wrote:
On Dec 7, 2020, at 9:00 AM, Maxim Levitsky mlevitsk@redhat.com wrote:
On Mon, 2020-12-07 at 08:53 -0800, Andy Lutomirski wrote:
On Dec 7, 2020, at 8:38 AM, Thomas Gleixner tglx@linutronix.de wrote:
On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote: From a timekeeping POV and the guests expectation of TSC this is fundamentally wrong:
tscguest = scaled(hosttsc) + offset
The TSC has to be viewed systemwide and not per CPU. It's systemwide used for timekeeping and for that to work it has to be synchronized.
Why would this be different on virt? Just because it's virt or what?
Migration is a guest wide thing and you're not migrating single vCPUs.
This hackery just papers over he underlying design fail that KVM looks at the TSC per vCPU which is the root cause and that needs to be fixed.
I don't disagree with you. As far as I know the main reasons that kvm tracks TSC per guest are
- cases when host tsc is not stable
(hopefully rare now, and I don't mind making the new API just refuse to work when this is detected, and revert to old way of doing things).
That's a trainwreck to begin with and I really would just not support it for anything new which aims to be more precise and correct. TSC has become pretty reliable over the years.
- (theoretical) ability of the guest to introduce per core tsc offfset
by either using TSC_ADJUST (for which I got recently an idea to stop advertising this feature to the guest), or writing TSC directly which is allowed by Intel's PRM:
For anything halfways modern the write to TSC is reflected in TSC_ADJUST which means you get the precise offset.
The general principle still applies from a system POV.
TSC base (systemwide view) - The sane case
TSC CPU = TSC base + TSC_ADJUST
The guest TSC base is a per guest constant offset to the host TSC.
TSC guest base = TSC host base + guest base offset
If the guest want's this different per vCPU by writing to the MSR or to TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which is the offset to the TSC base of the guest.
How about, if the guest wants to write TSC_ADJUST, it can turn off all paravirt features and keep both pieces?
This is one of the things I had in mind recently.
Even better, we can stop advertising TSC_ADJUST in CPUID to the guest and forbid it from writing it at all.
Seems reasonable to me.
It also seems okay for some MSRs to stop working after the guest enabled new PV timekeeping.
This is a very good idea!
I do have a feature request, though: IMO it would be quite nifty if the new kvmclock structure could also expose NTP corrections. In other words, if you could expose enough info to calculate CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC, and CLOCK_REALTIME, then we could have paravirt NTP.
Bonus points if whatever you do for CLOCK_REALTIME also exposes leap seconds in a race free way :). But I suppose that just exposing TAI and letting the guest deal with the TAI - UTC offset itself would get the job done just fine.
This is a good idea too. As I understand it, this gives a justification to a new kvmclock purpose, which wouldn't be focused anymore on correcting the tsc shortcomings (unstable/unscalable tsc), but more on things like that. I like that idea.
Best regards, Maxim Levitsky
On Mon, Dec 07, 2020 at 05:38:36PM +0100, Thomas Gleixner wrote:
For anything halfways modern the write to TSC is reflected in TSC_ADJUST which means you get the precise offset.
IIRC this is true for everything that has TSC_ADJUST.
On Sun, Dec 06, 2020 at 05:19:16PM +0100, Thomas Gleixner wrote:
On Thu, Dec 03 2020 at 19:11, Maxim Levitsky wrote:
- case KVM_SET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
struct kvm_tsc_state tsc_state;
u64 host_tsc, wall_nsec;
u64 new_guest_tsc, new_guest_tsc_offset;
r = -EFAULT;
if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
goto out;
kvm_get_walltime(&wall_nsec, &host_tsc);
new_guest_tsc = tsc_state.tsc;
if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
s64 diff = wall_nsec - tsc_state.nsec;
if (diff >= 0)
new_guest_tsc += nsec_to_cycles(vcpu, diff);
else
new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
}
new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
From a timekeeping POV and the guests expectation of TSC this is
fundamentally wrong:
tscguest = scaled(hosttsc) + offset
The TSC has to be viewed systemwide and not per CPU. It's systemwide used for timekeeping and for that to work it has to be synchronized.
Why would this be different on virt? Just because it's virt or what?
Migration is a guest wide thing and you're not migrating single vCPUs.
This hackery just papers over he underlying design fail that KVM looks at the TSC per vCPU which is the root cause and that needs to be fixed.
It already does it: The unified TSC offset is kept at kvm->arch.cur_tsc_offset.
On Thu, Dec 3, 2020 at 11:12 AM Maxim Levitsky mlevitsk@redhat.com wrote:
These two new ioctls allow to more precisly capture and restore guest's TSC state.
Both ioctls are meant to be used to accurately migrate guest TSC even when there is a significant downtime during the migration.
Suggested-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 73 ++++++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 15 +++++++ 3 files changed, 153 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 70254eaa5229f..ebecfe4b414ce 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may experience inconsistent filtering behavior on MSR accesses.
+4.127 KVM_GET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
- #define KVM_TSC_STATE_TIMESTAMP_VALID 1
- #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
- struct kvm_tsc_state {
__u32 flags;
__u64 nsec;
__u64 tsc;
__u64 tsc_adjust;
- };
+flags values for ``struct kvm_tsc_state``:
+``KVM_TSC_STATE_TIMESTAMP_VALID``
- ``nsec`` contains nanoseconds from unix epoch.
- Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
+``KVM_TSC_STATE_TSC_ADJUST_VALID``
- ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
+This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST, +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix +epoch.
+4.128 KVM_SET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
+Otherwise KVM will set the guest TSC value to the exact value as given +in the struct.
+if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST, +then its value will be set to the given value from the struct.
+It is assumed that either both ioctls will be run on the same machine, +or that source and destination machines have synchronized clocks.
- The kvm_run structure
========================
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a3fdc16cfd6f3..9b8a2fe3a2398 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
}
+static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc) +{
struct timespec64 ts;
if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
*walltime_ns = timespec64_to_ns(&ts);
return;
}
*host_tsc = rdtsc();
*walltime_ns = ktime_get_real_ns();
+}
#endif
/* @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_X86_USER_SPACE_MSR: case KVM_CAP_X86_MSR_FILTER: case KVM_CAP_ENFORCE_PV_FEATURE_CPUID: +#ifdef CONFIG_X86_64
case KVM_CAP_PRECISE_TSC:
+#endif r = 1; break; case KVM_CAP_SYNC_REGS: @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp, case KVM_GET_SUPPORTED_HV_CPUID: r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp); break; +#ifdef CONFIG_X86_64
case KVM_GET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
u64 host_tsc;
struct kvm_tsc_state tsc_state = {
.flags = KVM_TSC_STATE_TIMESTAMP_VALID
};
kvm_get_walltime(&tsc_state.nsec, &host_tsc);
tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
}
[...]
r = -EFAULT;
if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
goto out;
r = 0;
break;
}
case KVM_SET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
struct kvm_tsc_state tsc_state;
u64 host_tsc, wall_nsec;
u64 new_guest_tsc, new_guest_tsc_offset;
r = -EFAULT;
if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
goto out;
kvm_get_walltime(&wall_nsec, &host_tsc);
new_guest_tsc = tsc_state.tsc;
if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
s64 diff = wall_nsec - tsc_state.nsec;
if (diff >= 0)
new_guest_tsc += nsec_to_cycles(vcpu, diff);
else
new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
}
new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
How would a VMM maintain the phase relationship between guest TSCs using these ioctls?
For as bugged as the old way of doing things is (i.e. the magic handling of IA32_TSC), it was at least possible to keep guest TSCs in sync across a live migration so long as you satisfied the conditions where KVM decides to fudge the TSCs on your behalf. However, if we migrate the TSCs by value with an optional timestamp to account for elapsed time, it would appear that the guest TSC offset is likely to be inconsistent across vCPUs as the offset calculations above do not use a fixed host tsc timestamp across all vCPUs.
The reason I'd suggested yielding control of the tsc offset controls to userspace is that they're unaffected by such variations in per-vCPU calculations. Not only that, userspace can decide how it wants TSCs to be handled across a migration explicitly instead of relying on the above computation being done in the kernel.
if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
How is this ioctl's handling of the TSC_ADJUST msr an improvement over KVM_{GET,SET}_MSRS like before? It seems orthogonal to the rest of the intended API as it isn't involved your computation above.
r = 0;
break;
}
+#endif default: r = -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 886802b8ffba3..bf4c38fd58291 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190 #define KVM_CAP_SYS_HYPERV_CPUID 191 #define KVM_CAP_DIRTY_LOG_RING 192 +#define KVM_CAP_PRECISE_TSC 193
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1169,6 +1170,16 @@ struct kvm_clock_data { __u32 pad[9]; };
+#define KVM_TSC_STATE_TIMESTAMP_VALID 1 +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2 +struct kvm_tsc_state {
__u32 flags;
__u64 nsec;
__u64 tsc;
__u64 tsc_adjust;
+};
/* For KVM_CAP_SW_TLB */
#define KVM_MMU_FSL_BOOKE_NOHV 0 @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd { /* Available with KVM_CAP_DIRTY_LOG_RING */ #define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc7)
+/* Available with KVM_CAP_PRECISE_TSC*/ +#define KVM_SET_TSC_STATE _IOW(KVMIO, 0xc8, struct kvm_tsc_state) +#define KVM_GET_TSC_STATE _IOR(KVMIO, 0xc9, struct kvm_tsc_state)
/* Secure Encrypted Virtualization command */ enum sev_cmd_id { /* Guest initialization commands */ -- 2.26.2
On Mon, 2020-12-07 at 11:29 -0600, Oliver Upton wrote:
On Thu, Dec 3, 2020 at 11:12 AM Maxim Levitsky mlevitsk@redhat.com wrote:
These two new ioctls allow to more precisly capture and restore guest's TSC state.
Both ioctls are meant to be used to accurately migrate guest TSC even when there is a significant downtime during the migration.
Suggested-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 73 ++++++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 15 +++++++ 3 files changed, 153 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 70254eaa5229f..ebecfe4b414ce 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may experience inconsistent filtering behavior on MSR accesses.
+4.127 KVM_GET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
- #define KVM_TSC_STATE_TIMESTAMP_VALID 1
- #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
- struct kvm_tsc_state {
__u32 flags;
__u64 nsec;
__u64 tsc;
__u64 tsc_adjust;
- };
+flags values for ``struct kvm_tsc_state``:
+``KVM_TSC_STATE_TIMESTAMP_VALID``
- ``nsec`` contains nanoseconds from unix epoch.
- Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
+``KVM_TSC_STATE_TSC_ADJUST_VALID``
- ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
+This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST, +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix +epoch.
+4.128 KVM_SET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
+Otherwise KVM will set the guest TSC value to the exact value as given +in the struct.
+if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST, +then its value will be set to the given value from the struct.
+It is assumed that either both ioctls will be run on the same machine, +or that source and destination machines have synchronized clocks.
- The kvm_run structure
========================
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a3fdc16cfd6f3..9b8a2fe3a2398 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
}
+static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc) +{
struct timespec64 ts;
if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
*walltime_ns = timespec64_to_ns(&ts);
return;
}
*host_tsc = rdtsc();
*walltime_ns = ktime_get_real_ns();
+}
#endif
/* @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_X86_USER_SPACE_MSR: case KVM_CAP_X86_MSR_FILTER: case KVM_CAP_ENFORCE_PV_FEATURE_CPUID: +#ifdef CONFIG_X86_64
case KVM_CAP_PRECISE_TSC:
+#endif r = 1; break; case KVM_CAP_SYNC_REGS: @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp, case KVM_GET_SUPPORTED_HV_CPUID: r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp); break; +#ifdef CONFIG_X86_64
case KVM_GET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
u64 host_tsc;
struct kvm_tsc_state tsc_state = {
.flags = KVM_TSC_STATE_TIMESTAMP_VALID
};
kvm_get_walltime(&tsc_state.nsec, &host_tsc);
tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
}
[...]
r = -EFAULT;
if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
goto out;
r = 0;
break;
}
case KVM_SET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
struct kvm_tsc_state tsc_state;
u64 host_tsc, wall_nsec;
u64 new_guest_tsc, new_guest_tsc_offset;
r = -EFAULT;
if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
goto out;
kvm_get_walltime(&wall_nsec, &host_tsc);
new_guest_tsc = tsc_state.tsc;
if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
s64 diff = wall_nsec - tsc_state.nsec;
if (diff >= 0)
new_guest_tsc += nsec_to_cycles(vcpu, diff);
else
new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
}
new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
How would a VMM maintain the phase relationship between guest TSCs using these ioctls?
By using the nanosecond timestamp.
While I did made it optional in the V2 it was done for the sole sake of being able to set TSC on (re)boot to 0 from qemu, and for cases when qemu migrates from a VM where the feature is not enabled. In this case the tsc is set to the given value exactly, just like you can do today with KVM_SET_MSRS. In all other cases the nanosecond timestamp will be given.
When the userspace uses the nanosecond timestamp, the phase relationship would not only be maintained but be exact, even if TSC reads were not synchronized and even if their restore on the target wasn't synchronized as well.
Here is an example:
Let's assume that TSC on source/target is synchronized, and that the guest TSC is synchronized as well.
Let's call the guest TSC frequency F (guest TSC increments by F each second)
We do KVM_GET_TSC_STATE on vcpu0 and receive (t0,tsc0). We do KVM_GET_TSC_STATE on vcpu1 after 1 second passed (exaggerated) and receive (t0 + 1s, tsc0 + F)
We do KVM_SET_TSC_STATE at t0 + 10s on vcpu0 after migration, and vcpu0's guest tsc is set to tsc0 + F[(t0 + 10s) - t0] = tsc0 + 10*F
We do KVM_SET_TSC_STATE at nsec0 + 12s on vcpu1 (also exaggerated) and get [tsc0+F] + F[(t0 + 12s) - (t0+1s)] = tsc0 + 12*F
Since 2 seconds passed by, both vCPUs have now their TSC set to tsc0 + 12*F.
I use kvm's own functions to read the CLOCK_REALTIME, which are done in such a way that you first read host TSC once and then convert it to nanoseconds by scaling/offsetting it as the kernel would, thus there is no random error introduced here.
So except numerical errors, (which are unavoidable anyway, and should be neglectable) this algorithm should both keep the TSC in sync, and even keep its absolute time reference as accurate as the clock synchronization between the host and the target is.
(an offset between source and destination clocks will affect all the TSCs in the same manner, as long as both source and destination clocks are stable)
For as bugged as the old way of doing things is (i.e. the magic handling of IA32_TSC), it was at least possible to keep guest TSCs in sync across a live migration so long as you satisfied the conditions where KVM decides to fudge the TSCs on your behalf. However, if we migrate the TSCs by value with an optional timestamp to account for elapsed time, it would appear that the guest TSC offset is likely to be inconsistent across vCPUs as the offset calculations above do not use a fixed host tsc timestamp across all vCPUs.
The reason I'd suggested yielding control of the tsc offset controls to userspace is that they're unaffected by such variations in per-vCPU calculations. Not only that, userspace can decide how it wants TSCs to be handled across a migration explicitly instead of relying on the above computation being done in the kernel.
if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
How is this ioctl's handling of the TSC_ADJUST msr an improvement over KVM_{GET,SET}_MSRS like before? It seems orthogonal to the rest of the intended API as it isn't involved your computation above.
It's more a refactoring thing. The goal is to avoid 'magic' handling of host accesses in KVM_{GET,SET}_MSRS and instead make them behave the same way as if the guest read that msr. That can be useful for debug and such.
The second patch adds a KVM quirk, which should be disabled when the new API is used.
When disabled, it makes it hard to use the KVM_{GET,SET}_MSRS to set both TSC and TSC_ADJUST at the same time to given values, since these msrs are tied to each other when guest writes them, and the quirk disables the special (untied) write we had for host writes to these msrs.
Think of these new ioctls as a way to saving and restoring the internal TSC state, without bothering even to think what is inside. Kind of like we save/restore the nested state.
Best regards, Maxim Levitsky
r = 0;
break;
}
+#endif default: r = -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 886802b8ffba3..bf4c38fd58291 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190 #define KVM_CAP_SYS_HYPERV_CPUID 191 #define KVM_CAP_DIRTY_LOG_RING 192 +#define KVM_CAP_PRECISE_TSC 193
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1169,6 +1170,16 @@ struct kvm_clock_data { __u32 pad[9]; };
+#define KVM_TSC_STATE_TIMESTAMP_VALID 1 +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2 +struct kvm_tsc_state {
__u32 flags;
__u64 nsec;
__u64 tsc;
__u64 tsc_adjust;
+};
/* For KVM_CAP_SW_TLB */
#define KVM_MMU_FSL_BOOKE_NOHV 0 @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd { /* Available with KVM_CAP_DIRTY_LOG_RING */ #define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc7)
+/* Available with KVM_CAP_PRECISE_TSC*/ +#define KVM_SET_TSC_STATE _IOW(KVMIO, 0xc8, struct kvm_tsc_state) +#define KVM_GET_TSC_STATE _IOR(KVMIO, 0xc9, struct kvm_tsc_state)
/* Secure Encrypted Virtualization command */ enum sev_cmd_id { /* Guest initialization commands */ -- 2.26.2
On Tue, Dec 8, 2020 at 5:13 AM Maxim Levitsky mlevitsk@redhat.com wrote:
On Mon, 2020-12-07 at 11:29 -0600, Oliver Upton wrote:
On Thu, Dec 3, 2020 at 11:12 AM Maxim Levitsky mlevitsk@redhat.com wrote:
These two new ioctls allow to more precisly capture and restore guest's TSC state.
Both ioctls are meant to be used to accurately migrate guest TSC even when there is a significant downtime during the migration.
Suggested-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 73 ++++++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 15 +++++++ 3 files changed, 153 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 70254eaa5229f..ebecfe4b414ce 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may experience inconsistent filtering behavior on MSR accesses.
+4.127 KVM_GET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
- #define KVM_TSC_STATE_TIMESTAMP_VALID 1
- #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
- struct kvm_tsc_state {
__u32 flags;
__u64 nsec;
__u64 tsc;
__u64 tsc_adjust;
- };
+flags values for ``struct kvm_tsc_state``:
+``KVM_TSC_STATE_TIMESTAMP_VALID``
- ``nsec`` contains nanoseconds from unix epoch.
- Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
+``KVM_TSC_STATE_TSC_ADJUST_VALID``
- ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
+This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST, +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix +epoch.
+4.128 KVM_SET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
+Otherwise KVM will set the guest TSC value to the exact value as given +in the struct.
+if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST, +then its value will be set to the given value from the struct.
+It is assumed that either both ioctls will be run on the same machine, +or that source and destination machines have synchronized clocks.
- The kvm_run structure
========================
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a3fdc16cfd6f3..9b8a2fe3a2398 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
}
+static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc) +{
struct timespec64 ts;
if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
*walltime_ns = timespec64_to_ns(&ts);
return;
}
*host_tsc = rdtsc();
*walltime_ns = ktime_get_real_ns();
+}
#endif
/* @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_X86_USER_SPACE_MSR: case KVM_CAP_X86_MSR_FILTER: case KVM_CAP_ENFORCE_PV_FEATURE_CPUID: +#ifdef CONFIG_X86_64
case KVM_CAP_PRECISE_TSC:
+#endif r = 1; break; case KVM_CAP_SYNC_REGS: @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp, case KVM_GET_SUPPORTED_HV_CPUID: r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp); break; +#ifdef CONFIG_X86_64
case KVM_GET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
u64 host_tsc;
struct kvm_tsc_state tsc_state = {
.flags = KVM_TSC_STATE_TIMESTAMP_VALID
};
kvm_get_walltime(&tsc_state.nsec, &host_tsc);
tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
}
[...]
r = -EFAULT;
if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
goto out;
r = 0;
break;
}
case KVM_SET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
struct kvm_tsc_state tsc_state;
u64 host_tsc, wall_nsec;
u64 new_guest_tsc, new_guest_tsc_offset;
r = -EFAULT;
if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
goto out;
kvm_get_walltime(&wall_nsec, &host_tsc);
new_guest_tsc = tsc_state.tsc;
if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
s64 diff = wall_nsec - tsc_state.nsec;
if (diff >= 0)
new_guest_tsc += nsec_to_cycles(vcpu, diff);
else
new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
}
new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
How would a VMM maintain the phase relationship between guest TSCs using these ioctls?
By using the nanosecond timestamp.
While I did made it optional in the V2 it was done for the sole sake of being able to set TSC on (re)boot to 0 from qemu, and for cases when qemu migrates from a VM where the feature is not enabled. In this case the tsc is set to the given value exactly, just like you can do today with KVM_SET_MSRS. In all other cases the nanosecond timestamp will be given.
When the userspace uses the nanosecond timestamp, the phase relationship would not only be maintained but be exact, even if TSC reads were not synchronized and even if their restore on the target wasn't synchronized as well.
Here is an example:
Let's assume that TSC on source/target is synchronized, and that the guest TSC is synchronized as well.
Can this assumption be reasonably made though?
NTP could very well step or scale CLOCK_REALTIME when we are in the middle of saving or restoring TSCs, which could possibly result in observable drift between vCPUs. Calculating elapsed time between save/restore once per VM would avoid this issue altogether.
Let's call the guest TSC frequency F (guest TSC increments by F each second)
We do KVM_GET_TSC_STATE on vcpu0 and receive (t0,tsc0). We do KVM_GET_TSC_STATE on vcpu1 after 1 second passed (exaggerated) and receive (t0 + 1s, tsc0 + F)
We do KVM_SET_TSC_STATE at t0 + 10s on vcpu0 after migration, and vcpu0's guest tsc is set to tsc0 + F[(t0 + 10s) - t0] = tsc0 + 10*F
We do KVM_SET_TSC_STATE at nsec0 + 12s on vcpu1 (also exaggerated) and get [tsc0+F] + F[(t0 + 12s) - (t0+1s)] = tsc0 + 12*F
Since 2 seconds passed by, both vCPUs have now their TSC set to tsc0 + 12*F.
I use kvm's own functions to read the CLOCK_REALTIME, which are done in such a way that you first read host TSC once and then convert it to nanoseconds by scaling/offsetting it as the kernel would, thus there is no random error introduced here.
Agreed. In fact, my suggestion of yielding TSC offset controls to userspace falls short in this regard, since userspace can't make the same guarantee that the clockread was derived from its paired TSC value.
So except numerical errors, (which are unavoidable anyway, and should be neglectable) this algorithm should both keep the TSC in sync, and even keep its absolute time reference as accurate as the clock synchronization between the host and the target is.
(an offset between source and destination clocks will affect all the TSCs in the same manner, as long as both source and destination clocks are stable)
For as bugged as the old way of doing things is (i.e. the magic handling of IA32_TSC), it was at least possible to keep guest TSCs in sync across a live migration so long as you satisfied the conditions where KVM decides to fudge the TSCs on your behalf. However, if we migrate the TSCs by value with an optional timestamp to account for elapsed time, it would appear that the guest TSC offset is likely to be inconsistent across vCPUs as the offset calculations above do not use a fixed host tsc timestamp across all vCPUs.
The reason I'd suggested yielding control of the tsc offset controls to userspace is that they're unaffected by such variations in per-vCPU calculations. Not only that, userspace can decide how it wants TSCs to be handled across a migration explicitly instead of relying on the above computation being done in the kernel.
if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
How is this ioctl's handling of the TSC_ADJUST msr an improvement over KVM_{GET,SET}_MSRS like before? It seems orthogonal to the rest of the intended API as it isn't involved your computation above.
It's more a refactoring thing. The goal is to avoid 'magic' handling of host accesses in KVM_{GET,SET}_MSRS and instead make them behave the same way as if the guest read that msr. That can be useful for debug and such.
The second patch adds a KVM quirk, which should be disabled when the new API is used.
When disabled, it makes it hard to use the KVM_{GET,SET}_MSRS to set both TSC and TSC_ADJUST at the same time to given values, since these msrs are tied to each other when guest writes them, and the quirk disables the special (untied) write we had for host writes to these msrs.
Think of these new ioctls as a way to saving and restoring the internal TSC state, without bothering even to think what is inside. Kind of like we save/restore the nested state.
I agree that the quirk is useful for the guest touching TSC and TSC_ADJUST, but host writes to the TSC_ADJUST MSR are unaffected by any sync issues. As such, it seems the existing plumbing for KVM_{GET,SET}_MSRS VMMs are using seems sufficient.
Best regards, Maxim Levitsky
r = 0;
break;
}
+#endif default: r = -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 886802b8ffba3..bf4c38fd58291 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190 #define KVM_CAP_SYS_HYPERV_CPUID 191 #define KVM_CAP_DIRTY_LOG_RING 192 +#define KVM_CAP_PRECISE_TSC 193
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1169,6 +1170,16 @@ struct kvm_clock_data { __u32 pad[9]; };
+#define KVM_TSC_STATE_TIMESTAMP_VALID 1 +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2 +struct kvm_tsc_state {
__u32 flags;
__u64 nsec;
__u64 tsc;
__u64 tsc_adjust;
+};
/* For KVM_CAP_SW_TLB */
#define KVM_MMU_FSL_BOOKE_NOHV 0 @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd { /* Available with KVM_CAP_DIRTY_LOG_RING */ #define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc7)
+/* Available with KVM_CAP_PRECISE_TSC*/ +#define KVM_SET_TSC_STATE _IOW(KVMIO, 0xc8, struct kvm_tsc_state) +#define KVM_GET_TSC_STATE _IOR(KVMIO, 0xc9, struct kvm_tsc_state)
/* Secure Encrypted Virtualization command */ enum sev_cmd_id { /* Guest initialization commands */ -- 2.26.2
+cc Sean's new handle
On Tue, Dec 8, 2020 at 9:57 AM Oliver Upton oupton@google.com wrote:
On Tue, Dec 8, 2020 at 5:13 AM Maxim Levitsky mlevitsk@redhat.com wrote:
On Mon, 2020-12-07 at 11:29 -0600, Oliver Upton wrote:
On Thu, Dec 3, 2020 at 11:12 AM Maxim Levitsky mlevitsk@redhat.com wrote:
These two new ioctls allow to more precisly capture and restore guest's TSC state.
Both ioctls are meant to be used to accurately migrate guest TSC even when there is a significant downtime during the migration.
Suggested-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 73 ++++++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 15 +++++++ 3 files changed, 153 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 70254eaa5229f..ebecfe4b414ce 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may experience inconsistent filtering behavior on MSR accesses.
+4.127 KVM_GET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
- #define KVM_TSC_STATE_TIMESTAMP_VALID 1
- #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
- struct kvm_tsc_state {
__u32 flags;
__u64 nsec;
__u64 tsc;
__u64 tsc_adjust;
- };
+flags values for ``struct kvm_tsc_state``:
+``KVM_TSC_STATE_TIMESTAMP_VALID``
- ``nsec`` contains nanoseconds from unix epoch.
- Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
+``KVM_TSC_STATE_TSC_ADJUST_VALID``
- ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
+This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST, +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix +epoch.
+4.128 KVM_SET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
+Otherwise KVM will set the guest TSC value to the exact value as given +in the struct.
+if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST, +then its value will be set to the given value from the struct.
+It is assumed that either both ioctls will be run on the same machine, +or that source and destination machines have synchronized clocks.
- The kvm_run structure
========================
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a3fdc16cfd6f3..9b8a2fe3a2398 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
}
+static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc) +{
struct timespec64 ts;
if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
*walltime_ns = timespec64_to_ns(&ts);
return;
}
*host_tsc = rdtsc();
*walltime_ns = ktime_get_real_ns();
+}
#endif
/* @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_X86_USER_SPACE_MSR: case KVM_CAP_X86_MSR_FILTER: case KVM_CAP_ENFORCE_PV_FEATURE_CPUID: +#ifdef CONFIG_X86_64
case KVM_CAP_PRECISE_TSC:
+#endif r = 1; break; case KVM_CAP_SYNC_REGS: @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp, case KVM_GET_SUPPORTED_HV_CPUID: r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp); break; +#ifdef CONFIG_X86_64
case KVM_GET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
u64 host_tsc;
struct kvm_tsc_state tsc_state = {
.flags = KVM_TSC_STATE_TIMESTAMP_VALID
};
kvm_get_walltime(&tsc_state.nsec, &host_tsc);
tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
}
[...]
r = -EFAULT;
if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
goto out;
r = 0;
break;
}
case KVM_SET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
struct kvm_tsc_state tsc_state;
u64 host_tsc, wall_nsec;
u64 new_guest_tsc, new_guest_tsc_offset;
r = -EFAULT;
if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
goto out;
kvm_get_walltime(&wall_nsec, &host_tsc);
new_guest_tsc = tsc_state.tsc;
if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
s64 diff = wall_nsec - tsc_state.nsec;
if (diff >= 0)
new_guest_tsc += nsec_to_cycles(vcpu, diff);
else
new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
}
new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
How would a VMM maintain the phase relationship between guest TSCs using these ioctls?
By using the nanosecond timestamp.
While I did made it optional in the V2 it was done for the sole sake of being able to set TSC on (re)boot to 0 from qemu, and for cases when qemu migrates from a VM where the feature is not enabled. In this case the tsc is set to the given value exactly, just like you can do today with KVM_SET_MSRS. In all other cases the nanosecond timestamp will be given.
When the userspace uses the nanosecond timestamp, the phase relationship would not only be maintained but be exact, even if TSC reads were not synchronized and even if their restore on the target wasn't synchronized as well.
Here is an example:
Let's assume that TSC on source/target is synchronized, and that the guest TSC is synchronized as well.
Can this assumption be reasonably made though?
NTP could very well step or scale CLOCK_REALTIME when we are in the middle of saving or restoring TSCs, which could possibly result in observable drift between vCPUs. Calculating elapsed time between save/restore once per VM would avoid this issue altogether.
Let's call the guest TSC frequency F (guest TSC increments by F each second)
We do KVM_GET_TSC_STATE on vcpu0 and receive (t0,tsc0). We do KVM_GET_TSC_STATE on vcpu1 after 1 second passed (exaggerated) and receive (t0 + 1s, tsc0 + F)
We do KVM_SET_TSC_STATE at t0 + 10s on vcpu0 after migration, and vcpu0's guest tsc is set to tsc0 + F[(t0 + 10s) - t0] = tsc0 + 10*F
We do KVM_SET_TSC_STATE at nsec0 + 12s on vcpu1 (also exaggerated) and get [tsc0+F] + F[(t0 + 12s) - (t0+1s)] = tsc0 + 12*F
Since 2 seconds passed by, both vCPUs have now their TSC set to tsc0 + 12*F.
I use kvm's own functions to read the CLOCK_REALTIME, which are done in such a way that you first read host TSC once and then convert it to nanoseconds by scaling/offsetting it as the kernel would, thus there is no random error introduced here.
Agreed. In fact, my suggestion of yielding TSC offset controls to userspace falls short in this regard, since userspace can't make the same guarantee that the clockread was derived from its paired TSC value.
So except numerical errors, (which are unavoidable anyway, and should be neglectable) this algorithm should both keep the TSC in sync, and even keep its absolute time reference as accurate as the clock synchronization between the host and the target is.
(an offset between source and destination clocks will affect all the TSCs in the same manner, as long as both source and destination clocks are stable)
For as bugged as the old way of doing things is (i.e. the magic handling of IA32_TSC), it was at least possible to keep guest TSCs in sync across a live migration so long as you satisfied the conditions where KVM decides to fudge the TSCs on your behalf. However, if we migrate the TSCs by value with an optional timestamp to account for elapsed time, it would appear that the guest TSC offset is likely to be inconsistent across vCPUs as the offset calculations above do not use a fixed host tsc timestamp across all vCPUs.
The reason I'd suggested yielding control of the tsc offset controls to userspace is that they're unaffected by such variations in per-vCPU calculations. Not only that, userspace can decide how it wants TSCs to be handled across a migration explicitly instead of relying on the above computation being done in the kernel.
if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
How is this ioctl's handling of the TSC_ADJUST msr an improvement over KVM_{GET,SET}_MSRS like before? It seems orthogonal to the rest of the intended API as it isn't involved your computation above.
It's more a refactoring thing. The goal is to avoid 'magic' handling of host accesses in KVM_{GET,SET}_MSRS and instead make them behave the same way as if the guest read that msr. That can be useful for debug and such.
The second patch adds a KVM quirk, which should be disabled when the new API is used.
When disabled, it makes it hard to use the KVM_{GET,SET}_MSRS to set both TSC and TSC_ADJUST at the same time to given values, since these msrs are tied to each other when guest writes them, and the quirk disables the special (untied) write we had for host writes to these msrs.
Think of these new ioctls as a way to saving and restoring the internal TSC state, without bothering even to think what is inside. Kind of like we save/restore the nested state.
I agree that the quirk is useful for the guest touching TSC and TSC_ADJUST, but host writes to the TSC_ADJUST MSR are unaffected by any sync issues. As such, it seems the existing plumbing for KVM_{GET,SET}_MSRS VMMs are using seems sufficient.
Best regards, Maxim Levitsky
r = 0;
break;
}
+#endif default: r = -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 886802b8ffba3..bf4c38fd58291 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190 #define KVM_CAP_SYS_HYPERV_CPUID 191 #define KVM_CAP_DIRTY_LOG_RING 192 +#define KVM_CAP_PRECISE_TSC 193
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1169,6 +1170,16 @@ struct kvm_clock_data { __u32 pad[9]; };
+#define KVM_TSC_STATE_TIMESTAMP_VALID 1 +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2 +struct kvm_tsc_state {
__u32 flags;
__u64 nsec;
__u64 tsc;
__u64 tsc_adjust;
+};
/* For KVM_CAP_SW_TLB */
#define KVM_MMU_FSL_BOOKE_NOHV 0 @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd { /* Available with KVM_CAP_DIRTY_LOG_RING */ #define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc7)
+/* Available with KVM_CAP_PRECISE_TSC*/ +#define KVM_SET_TSC_STATE _IOW(KVMIO, 0xc8, struct kvm_tsc_state) +#define KVM_GET_TSC_STATE _IOR(KVMIO, 0xc9, struct kvm_tsc_state)
/* Secure Encrypted Virtualization command */ enum sev_cmd_id { /* Guest initialization commands */ -- 2.26.2
On Tue, 2020-12-08 at 09:58 -0600, Oliver Upton wrote:
+cc Sean's new handle
On Tue, Dec 8, 2020 at 9:57 AM Oliver Upton oupton@google.com wrote:
On Tue, Dec 8, 2020 at 5:13 AM Maxim Levitsky mlevitsk@redhat.com wrote:
On Mon, 2020-12-07 at 11:29 -0600, Oliver Upton wrote:
On Thu, Dec 3, 2020 at 11:12 AM Maxim Levitsky mlevitsk@redhat.com wrote:
These two new ioctls allow to more precisly capture and restore guest's TSC state.
Both ioctls are meant to be used to accurately migrate guest TSC even when there is a significant downtime during the migration.
Suggested-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 73 ++++++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 15 +++++++ 3 files changed, 153 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 70254eaa5229f..ebecfe4b414ce 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may experience inconsistent filtering behavior on MSR accesses.
+4.127 KVM_GET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
- #define KVM_TSC_STATE_TIMESTAMP_VALID 1
- #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
- struct kvm_tsc_state {
__u32 flags;
__u64 nsec;
__u64 tsc;
__u64 tsc_adjust;
- };
+flags values for ``struct kvm_tsc_state``:
+``KVM_TSC_STATE_TIMESTAMP_VALID``
- ``nsec`` contains nanoseconds from unix epoch.
- Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
+``KVM_TSC_STATE_TSC_ADJUST_VALID``
- ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
+This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST, +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix +epoch.
+4.128 KVM_SET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
+Otherwise KVM will set the guest TSC value to the exact value as given +in the struct.
+if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST, +then its value will be set to the given value from the struct.
+It is assumed that either both ioctls will be run on the same machine, +or that source and destination machines have synchronized clocks.
- The kvm_run structure
========================
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a3fdc16cfd6f3..9b8a2fe3a2398 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
}
+static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc) +{
struct timespec64 ts;
if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
*walltime_ns = timespec64_to_ns(&ts);
return;
}
*host_tsc = rdtsc();
*walltime_ns = ktime_get_real_ns();
+}
#endif
/* @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_X86_USER_SPACE_MSR: case KVM_CAP_X86_MSR_FILTER: case KVM_CAP_ENFORCE_PV_FEATURE_CPUID: +#ifdef CONFIG_X86_64
case KVM_CAP_PRECISE_TSC:
+#endif r = 1; break; case KVM_CAP_SYNC_REGS: @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp, case KVM_GET_SUPPORTED_HV_CPUID: r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp); break; +#ifdef CONFIG_X86_64
case KVM_GET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
u64 host_tsc;
struct kvm_tsc_state tsc_state = {
.flags = KVM_TSC_STATE_TIMESTAMP_VALID
};
kvm_get_walltime(&tsc_state.nsec, &host_tsc);
tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
}
[...]
r = -EFAULT;
if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
goto out;
r = 0;
break;
}
case KVM_SET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
struct kvm_tsc_state tsc_state;
u64 host_tsc, wall_nsec;
u64 new_guest_tsc, new_guest_tsc_offset;
r = -EFAULT;
if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
goto out;
kvm_get_walltime(&wall_nsec, &host_tsc);
new_guest_tsc = tsc_state.tsc;
if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
s64 diff = wall_nsec - tsc_state.nsec;
if (diff >= 0)
new_guest_tsc += nsec_to_cycles(vcpu, diff);
else
new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
}
new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
How would a VMM maintain the phase relationship between guest TSCs using these ioctls?
By using the nanosecond timestamp.
While I did made it optional in the V2 it was done for the sole sake of being able to set TSC on (re)boot to 0 from qemu, and for cases when qemu migrates from a VM where the feature is not enabled. In this case the tsc is set to the given value exactly, just like you can do today with KVM_SET_MSRS. In all other cases the nanosecond timestamp will be given.
When the userspace uses the nanosecond timestamp, the phase relationship would not only be maintained but be exact, even if TSC reads were not synchronized and even if their restore on the target wasn't synchronized as well.
Here is an example:
Let's assume that TSC on source/target is synchronized, and that the guest TSC is synchronized as well.
Can this assumption be reasonably made though?
NTP could very well step or scale CLOCK_REALTIME when we are in the middle of saving or restoring TSCs, which could possibly result in observable drift between vCPUs. Calculating elapsed time between save/restore once per VM would avoid this issue altogether.
You raise a good point, which adds even more justification to the solution that Thomas is proposing.
Thanks!
Best regards, Maxim Levitsky
Let's call the guest TSC frequency F (guest TSC increments by F each second)
We do KVM_GET_TSC_STATE on vcpu0 and receive (t0,tsc0). We do KVM_GET_TSC_STATE on vcpu1 after 1 second passed (exaggerated) and receive (t0 + 1s, tsc0 + F)
We do KVM_SET_TSC_STATE at t0 + 10s on vcpu0 after migration, and vcpu0's guest tsc is set to tsc0 + F[(t0 + 10s) - t0] = tsc0 + 10*F
We do KVM_SET_TSC_STATE at nsec0 + 12s on vcpu1 (also exaggerated) and get [tsc0+F] + F[(t0 + 12s) - (t0+1s)] = tsc0 + 12*F
Since 2 seconds passed by, both vCPUs have now their TSC set to tsc0 + 12*F.
I use kvm's own functions to read the CLOCK_REALTIME, which are done in such a way that you first read host TSC once and then convert it to nanoseconds by scaling/offsetting it as the kernel would, thus there is no random error introduced here.
Agreed. In fact, my suggestion of yielding TSC offset controls to userspace falls short in this regard, since userspace can't make the same guarantee that the clockread was derived from its paired TSC value.
So except numerical errors, (which are unavoidable anyway, and should be neglectable) this algorithm should both keep the TSC in sync, and even keep its absolute time reference as accurate as the clock synchronization between the host and the target is.
(an offset between source and destination clocks will affect all the TSCs in the same manner, as long as both source and destination clocks are stable)
For as bugged as the old way of doing things is (i.e. the magic handling of IA32_TSC), it was at least possible to keep guest TSCs in sync across a live migration so long as you satisfied the conditions where KVM decides to fudge the TSCs on your behalf. However, if we migrate the TSCs by value with an optional timestamp to account for elapsed time, it would appear that the guest TSC offset is likely to be inconsistent across vCPUs as the offset calculations above do not use a fixed host tsc timestamp across all vCPUs. The reason I'd suggested yielding control of the tsc offset controls to userspace is that they're unaffected by such variations in per-vCPU calculations. Not only that, userspace can decide how it wants TSCs to be handled across a migration explicitly instead of relying on the above computation being done in the kernel.
if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
How is this ioctl's handling of the TSC_ADJUST msr an improvement over KVM_{GET,SET}_MSRS like before? It seems orthogonal to the rest of the intended API as it isn't involved your computation above.
It's more a refactoring thing. The goal is to avoid 'magic' handling of host accesses in KVM_{GET,SET}_MSRS and instead make them behave the same way as if the guest read that msr. That can be useful for debug and such.
The second patch adds a KVM quirk, which should be disabled when the new API is used.
When disabled, it makes it hard to use the KVM_{GET,SET}_MSRS to set both TSC and TSC_ADJUST at the same time to given values, since these msrs are tied to each other when guest writes them, and the quirk disables the special (untied) write we had for host writes to these msrs.
Think of these new ioctls as a way to saving and restoring the internal TSC state, without bothering even to think what is inside. Kind of like we save/restore the nested state.
I agree that the quirk is useful for the guest touching TSC and TSC_ADJUST, but host writes to the TSC_ADJUST MSR are unaffected by any sync issues. As such, it seems the existing plumbing for KVM_{GET,SET}_MSRS VMMs are using seems sufficient.
Best regards, Maxim Levitsky
r = 0;
break;
}
+#endif default: r = -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 886802b8ffba3..bf4c38fd58291 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190 #define KVM_CAP_SYS_HYPERV_CPUID 191 #define KVM_CAP_DIRTY_LOG_RING 192 +#define KVM_CAP_PRECISE_TSC 193
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1169,6 +1170,16 @@ struct kvm_clock_data { __u32 pad[9]; };
+#define KVM_TSC_STATE_TIMESTAMP_VALID 1 +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2 +struct kvm_tsc_state {
__u32 flags;
__u64 nsec;
__u64 tsc;
__u64 tsc_adjust;
+};
/* For KVM_CAP_SW_TLB */
#define KVM_MMU_FSL_BOOKE_NOHV 0 @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd { /* Available with KVM_CAP_DIRTY_LOG_RING */ #define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc7)
+/* Available with KVM_CAP_PRECISE_TSC*/ +#define KVM_SET_TSC_STATE _IOW(KVMIO, 0xc8, struct kvm_tsc_state) +#define KVM_GET_TSC_STATE _IOR(KVMIO, 0xc9, struct kvm_tsc_state)
/* Secure Encrypted Virtualization command */ enum sev_cmd_id { /* Guest initialization commands */ -- 2.26.2
On Tue, Dec 08 2020 at 13:13, Maxim Levitsky wrote:
On Mon, 2020-12-07 at 11:29 -0600, Oliver Upton wrote:
How would a VMM maintain the phase relationship between guest TSCs using these ioctls?
By using the nanosecond timestamp. While I did made it optional in the V2 it was done for the sole sake of being able to set TSC on (re)boot to 0 from qemu, and for cases when qemu migrates from a VM where the feature is not enabled. In this case the tsc is set to the given value exactly, just like you can do today with KVM_SET_MSRS. In all other cases the nanosecond timestamp will be given. When the userspace uses the nanosecond timestamp, the phase relationship would not only be maintained but be exact, even if TSC reads were not synchronized and even if their restore on the target wasn't synchronized as well. Here is an example: Let's assume that TSC on source/target is synchronized, and that the guest TSC is synchronized as well.
Let's call the guest TSC frequency F (guest TSC increments by F each second) We do KVM_GET_TSC_STATE on vcpu0 and receive (t0,tsc0). We do KVM_GET_TSC_STATE on vcpu1 after 1 second passed (exaggerated) and receive (t0 + 1s, tsc0 + F)
Why?
You freeeze the VM and store the realtime timestamp of doing that. At that point assuming a full sync host system the only interesting thing to store is the guest offset which is the same on all vCPUs and it is known already.
So on restore the only thing which needs to be adjusted is the guest wide offset.
newoffset = oldoffset + (now - tfreeze)
Then set newoffset for all vCPUs. Anything else is complexity for no value and bound to fall apart in hard to debug ways.
The offset is still the same for all vCPUs whether you can restore them in the same nanosecond or whether you need 3 minutes for each one. It does not matter because when you restore vCPU1 3 minutes after vCPU0 then TSC has advanced 3 minutes as well. It's still correct from the guest POV.
Even if you support TSCADJUST and let the guest write to it does not change the per guest offset at all. TSCADJUST is per [v]CPU and adds on top:
tscvcpu = tsc_host + guest_offset + TSC_ADJUST
Scaling is just orthogonal and does not change any of this.
Thanks,
tglx
On Tue, 2020-12-08 at 17:40 +0100, Thomas Gleixner wrote:
On Tue, Dec 08 2020 at 13:13, Maxim Levitsky wrote:
On Mon, 2020-12-07 at 11:29 -0600, Oliver Upton wrote:
How would a VMM maintain the phase relationship between guest TSCs using these ioctls?
By using the nanosecond timestamp. While I did made it optional in the V2 it was done for the sole sake of being able to set TSC on (re)boot to 0 from qemu, and for cases when qemu migrates from a VM where the feature is not enabled. In this case the tsc is set to the given value exactly, just like you can do today with KVM_SET_MSRS. In all other cases the nanosecond timestamp will be given. When the userspace uses the nanosecond timestamp, the phase relationship would not only be maintained but be exact, even if TSC reads were not synchronized and even if their restore on the target wasn't synchronized as well. Here is an example: Let's assume that TSC on source/target is synchronized, and that the guest TSC is synchronized as well.
Let's call the guest TSC frequency F (guest TSC increments by F each second) We do KVM_GET_TSC_STATE on vcpu0 and receive (t0,tsc0). We do KVM_GET_TSC_STATE on vcpu1 after 1 second passed (exaggerated) and receive (t0 + 1s, tsc0 + F)
Why?
You freeeze the VM and store the realtime timestamp of doing that. At that point assuming a full sync host system the only interesting thing to store is the guest offset which is the same on all vCPUs and it is known already.
So on restore the only thing which needs to be adjusted is the guest wide offset.
newoffset = oldoffset + (now - tfreeze)
Then set newoffset for all vCPUs. Anything else is complexity for no value and bound to fall apart in hard to debug ways.
The offset is still the same for all vCPUs whether you can restore them in the same nanosecond or whether you need 3 minutes for each one. It does not matter because when you restore vCPU1 3 minutes after vCPU0 then TSC has advanced 3 minutes as well. It's still correct from the guest POV.
Even if you support TSCADJUST and let the guest write to it does not change the per guest offset at all. TSCADJUST is per [v]CPU and adds on top:
tscvcpu = tsc_host + guest_offset + TSC_ADJUST
Scaling is just orthogonal and does not change any of this.
I agree with this, and I think that this is what we will end up doing. Paulo, what do you think about this?
Best regards, Maxim Levitsky
Thanks,
tglx
On 08/12/20 18:08, Maxim Levitsky wrote:
Even if you support TSCADJUST and let the guest write to it does not change the per guest offset at all. TSCADJUST is per [v]CPU and adds on top:
tscvcpu = tsc_host + guest_offset + TSC_ADJUST
Scaling is just orthogonal and does not change any of this.
I agree with this, and I think that this is what we will end up doing. Paulo, what do you think about this?
Yes, you can have a VM ioctl that saves/restores cur_tsc_nsec and cur_tsc_write. The restore side would loop on all vcpus.
However, it is not so easy: 1) it would have to be usable only if KVM_X86_QUIRK_TSC_HOST_ACCESS is false, 2) it would fail if kvm->arch.nr_vcpus_matched_tsc == kvm->online_vcpus (which basically means that userspace didn't mess up the TSC configuration). If not, it would return -EINVAL.
Also, while at it let's burn and pour salt on the support for KVM_SET_TSC_KHZ unless TSC scaling is supported, together with vcpu->tsc_catchup and all the "tolerance" crap that is in kvm_set_tsc_khz. And initialize vcpu->arch.virtual_tsc_khz to kvm->arch.last_tsc_khz before calling kvm_synchronize_tsc.
Paolo
On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
On 08/12/20 18:08, Maxim Levitsky wrote:
Even if you support TSCADJUST and let the guest write to it does not change the per guest offset at all. TSCADJUST is per [v]CPU and adds on top:
tscvcpu = tsc_host + guest_offset + TSC_ADJUST
Scaling is just orthogonal and does not change any of this.
I agree with this, and I think that this is what we will end up doing. Paulo, what do you think about this?
Yes, you can have a VM ioctl that saves/restores cur_tsc_nsec and cur_tsc_write. The restore side would loop on all vcpus.
Why would the restore need to run on all VCPUs though?
The picture that I have in mind is that we we will restore the cur_tsc_nsec/cur_tsc_write pair once per VM, and then restore the TSC_ADJUST value on all vCPUs as if the guest wrote it (with the quirk disabled), which will restore all the horrible things that the guest did to its TSC.
Since TSC adjust is enabled by default, if the user disables it in the CPUID, we might as well just disable the whole thing, make TSC readonly or something similar.
This way we don't need to worry about TSC writes as those will be reflected in the TSC_ADJUST.
However, it is not so easy: 1) it would have to be usable only if KVM_X86_QUIRK_TSC_HOST_ACCESS is false, 2) it would fail if kvm->arch.nr_vcpus_matched_tsc == kvm->online_vcpus (which basically means that userspace didn't mess up the TSC configuration). If not, it would return -EINVAL.
Note though that we don't track the guest tsc/tsc_adjust writes anymore via the tsc sync code, and with the quirk disabled we don't track them even for the host writes, thus the (2) condition will always be true (the only call left to kvm_synchronize_tsc is in the kvm_arch_vcpu_postcreate)
However I indeed see no reason to allow new per VM API when the masterclock is disabled, and therefore using cur_tsc_nsec/cur_tsc_write is reasonable.
The cur_tsc_nsec should IMHO be converted to absolute time (CLOCK_REALTIME or similiar) or should the conversion be done in the userspace? I don't know yet if I can convert between different POSIX clocks in race/error free manner. (e.g get the offset between them).
Also, while at it let's burn and pour salt on the support for KVM_SET_TSC_KHZ unless TSC scaling is supported, together with vcpu->tsc_catchup and all the "tolerance" crap that is in kvm_set_tsc_khz. And initialize vcpu->arch.virtual_tsc_khz to kvm->arch.last_tsc_khz before calling kvm_synchronize_tsc.
The tsc_catchup is enabled when host TSC is unstable, so that guest TSC at least roughtly follows real time (which host kernel gets by other means).
We push the guest tsc forward roughtly each time VM entry from userspace happens:
(On each kvm_arch_vcpu_load, we raise KVM_REQ_GLOBAL_CLOCK_UPDATE request which (with small delay) makes all vcpus go through kvm_guest_time_update which pushes the guest tsc forward)
However we also have the 'tsc_always_catchup' mode which is indeed something I would like to remove.
It is a poor man simulation of the TSC scaling which is enabled when the host doesn't support TSC scaling, but we were asked to run at frequency faster than host TSC frequency is.
This way guest tsc runs slower than it should, but on each VM exit, we punt the guest tsc offset forward so on average the guest tsc doesn't lag behind.
Unless backward compatibility is an issue, I have no objections to remove that code.
The tolerance thing might be needed. On many systems (including mine new 3970X), the hardware doesn't have means to report the unscaled host TSC frequency, thus the kernel has to measure it, and since the measurement is not 100% accurate, a slightly different value is used every time the host boots.
Thus without a small tolerance, we will always have to use TSC scaling, while migrating even between two identical systems. I don't know if this is an issue.
Best regards, Maxim Levitsky
Paolo
On Thu, Dec 03, 2020 at 07:11:16PM +0200, Maxim Levitsky wrote:
These two new ioctls allow to more precisly capture and restore guest's TSC state.
Both ioctls are meant to be used to accurately migrate guest TSC even when there is a significant downtime during the migration.
Suggested-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 73 ++++++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 15 +++++++ 3 files changed, 153 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 70254eaa5229f..ebecfe4b414ce 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may experience inconsistent filtering behavior on MSR accesses. +4.127 KVM_GET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
- #define KVM_TSC_STATE_TIMESTAMP_VALID 1
- #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
- struct kvm_tsc_state {
- __u32 flags;
- __u64 nsec;
- __u64 tsc;
- __u64 tsc_adjust;
- };
+flags values for ``struct kvm_tsc_state``:
+``KVM_TSC_STATE_TIMESTAMP_VALID``
- ``nsec`` contains nanoseconds from unix epoch.
- Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
+``KVM_TSC_STATE_TSC_ADJUST_VALID``
- ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
+This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST, +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix +epoch.
Why is CLOCK_REALTIME necessary at all? kvmclock uses the host clock as a time base, but for TSC it should not be necessary.
+4.128 KVM_SET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
This introduces the wraparound bug in Linux timekeeping, doesnt it?
+Otherwise KVM will set the guest TSC value to the exact value as given +in the struct.
+if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST, +then its value will be set to the given value from the struct.
+It is assumed that either both ioctls will be run on the same machine, +or that source and destination machines have synchronized clocks.
- The kvm_run structure
======================== diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a3fdc16cfd6f3..9b8a2fe3a2398 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts, return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp)); }
+static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc) +{
- struct timespec64 ts;
- if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
*walltime_ns = timespec64_to_ns(&ts);
return;
- }
- *host_tsc = rdtsc();
- *walltime_ns = ktime_get_real_ns();
+}
#endif /* @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_X86_USER_SPACE_MSR: case KVM_CAP_X86_MSR_FILTER: case KVM_CAP_ENFORCE_PV_FEATURE_CPUID: +#ifdef CONFIG_X86_64
- case KVM_CAP_PRECISE_TSC:
+#endif r = 1; break; case KVM_CAP_SYNC_REGS: @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp, case KVM_GET_SUPPORTED_HV_CPUID: r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp); break; +#ifdef CONFIG_X86_64
- case KVM_GET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
u64 host_tsc;
struct kvm_tsc_state tsc_state = {
.flags = KVM_TSC_STATE_TIMESTAMP_VALID
};
kvm_get_walltime(&tsc_state.nsec, &host_tsc);
tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
}
r = -EFAULT;
if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
goto out;
r = 0;
break;
- }
- case KVM_SET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
struct kvm_tsc_state tsc_state;
u64 host_tsc, wall_nsec;
u64 new_guest_tsc, new_guest_tsc_offset;
r = -EFAULT;
if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
goto out;
kvm_get_walltime(&wall_nsec, &host_tsc);
new_guest_tsc = tsc_state.tsc;
if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
s64 diff = wall_nsec - tsc_state.nsec;
if (diff >= 0)
new_guest_tsc += nsec_to_cycles(vcpu, diff);
else
new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
}
new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
r = 0;
break;
- }
+#endif default: r = -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 886802b8ffba3..bf4c38fd58291 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190 #define KVM_CAP_SYS_HYPERV_CPUID 191 #define KVM_CAP_DIRTY_LOG_RING 192 +#define KVM_CAP_PRECISE_TSC 193 #ifdef KVM_CAP_IRQ_ROUTING @@ -1169,6 +1170,16 @@ struct kvm_clock_data { __u32 pad[9]; };
+#define KVM_TSC_STATE_TIMESTAMP_VALID 1 +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2 +struct kvm_tsc_state {
- __u32 flags;
- __u64 nsec;
- __u64 tsc;
- __u64 tsc_adjust;
+};
/* For KVM_CAP_SW_TLB */ #define KVM_MMU_FSL_BOOKE_NOHV 0 @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd { /* Available with KVM_CAP_DIRTY_LOG_RING */ #define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc7) +/* Available with KVM_CAP_PRECISE_TSC*/ +#define KVM_SET_TSC_STATE _IOW(KVMIO, 0xc8, struct kvm_tsc_state) +#define KVM_GET_TSC_STATE _IOR(KVMIO, 0xc9, struct kvm_tsc_state)
/* Secure Encrypted Virtualization command */ enum sev_cmd_id { /* Guest initialization commands */ -- 2.26.2
On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
On Thu, Dec 03, 2020 at 07:11:16PM +0200, Maxim Levitsky wrote:
These two new ioctls allow to more precisly capture and restore guest's TSC state.
Both ioctls are meant to be used to accurately migrate guest TSC even when there is a significant downtime during the migration.
Suggested-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 73 ++++++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 15 +++++++ 3 files changed, 153 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 70254eaa5229f..ebecfe4b414ce 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may experience inconsistent filtering behavior on MSR accesses. +4.127 KVM_GET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
- #define KVM_TSC_STATE_TIMESTAMP_VALID 1
- #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
- struct kvm_tsc_state {
- __u32 flags;
- __u64 nsec;
- __u64 tsc;
- __u64 tsc_adjust;
- };
+flags values for ``struct kvm_tsc_state``:
+``KVM_TSC_STATE_TIMESTAMP_VALID``
- ``nsec`` contains nanoseconds from unix epoch.
- Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
+``KVM_TSC_STATE_TSC_ADJUST_VALID``
- ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
+This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST, +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix +epoch.
Why is CLOCK_REALTIME necessary at all? kvmclock uses the host clock as a time base, but for TSC it should not be necessary.
CLOCK_REALTIME is used as an absolute time reference that should match on both computers. I could have used CLOCK_TAI instead for example.
The reference allows to account for time passed between saving and restoring the TSC as explained above.
+4.128 KVM_SET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
This introduces the wraparound bug in Linux timekeeping, doesnt it?
It does. Could you prepare a reproducer for this bug so I get a better idea about what are you talking about?
I assume you need very long (like days worth) jump to trigger this bug and for such case we can either work around it in qemu / kernel or fix it in the guest kernel and I strongly prefer the latter.
Thomas, what do you think about it?
Best regards, Maxim Levitsky
+Otherwise KVM will set the guest TSC value to the exact value as given +in the struct.
+if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST, +then its value will be set to the given value from the struct.
+It is assumed that either both ioctls will be run on the same machine, +or that source and destination machines have synchronized clocks.
- The kvm_run structure
======================== diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a3fdc16cfd6f3..9b8a2fe3a2398 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts, return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp)); }
+static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc) +{
- struct timespec64 ts;
- if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
*walltime_ns = timespec64_to_ns(&ts);
return;
- }
- *host_tsc = rdtsc();
- *walltime_ns = ktime_get_real_ns();
+}
#endif /* @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_X86_USER_SPACE_MSR: case KVM_CAP_X86_MSR_FILTER: case KVM_CAP_ENFORCE_PV_FEATURE_CPUID: +#ifdef CONFIG_X86_64
- case KVM_CAP_PRECISE_TSC:
+#endif r = 1; break; case KVM_CAP_SYNC_REGS: @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp, case KVM_GET_SUPPORTED_HV_CPUID: r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp); break; +#ifdef CONFIG_X86_64
- case KVM_GET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
u64 host_tsc;
struct kvm_tsc_state tsc_state = {
.flags = KVM_TSC_STATE_TIMESTAMP_VALID
};
kvm_get_walltime(&tsc_state.nsec, &host_tsc);
tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
}
r = -EFAULT;
if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
goto out;
r = 0;
break;
- }
- case KVM_SET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
struct kvm_tsc_state tsc_state;
u64 host_tsc, wall_nsec;
u64 new_guest_tsc, new_guest_tsc_offset;
r = -EFAULT;
if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
goto out;
kvm_get_walltime(&wall_nsec, &host_tsc);
new_guest_tsc = tsc_state.tsc;
if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
s64 diff = wall_nsec - tsc_state.nsec;
if (diff >= 0)
new_guest_tsc += nsec_to_cycles(vcpu, diff);
else
new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
}
new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
r = 0;
break;
- }
+#endif default: r = -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 886802b8ffba3..bf4c38fd58291 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190 #define KVM_CAP_SYS_HYPERV_CPUID 191 #define KVM_CAP_DIRTY_LOG_RING 192 +#define KVM_CAP_PRECISE_TSC 193 #ifdef KVM_CAP_IRQ_ROUTING @@ -1169,6 +1170,16 @@ struct kvm_clock_data { __u32 pad[9]; };
+#define KVM_TSC_STATE_TIMESTAMP_VALID 1 +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2 +struct kvm_tsc_state {
- __u32 flags;
- __u64 nsec;
- __u64 tsc;
- __u64 tsc_adjust;
+};
/* For KVM_CAP_SW_TLB */ #define KVM_MMU_FSL_BOOKE_NOHV 0 @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd { /* Available with KVM_CAP_DIRTY_LOG_RING */ #define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc7) +/* Available with KVM_CAP_PRECISE_TSC*/ +#define KVM_SET_TSC_STATE _IOW(KVMIO, 0xc8, struct kvm_tsc_state) +#define KVM_GET_TSC_STATE _IOR(KVMIO, 0xc9, struct kvm_tsc_state)
/* Secure Encrypted Virtualization command */ enum sev_cmd_id { /* Guest initialization commands */ -- 2.26.2
On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
This introduces the wraparound bug in Linux timekeeping, doesnt it?
Which bug?
It does. Could you prepare a reproducer for this bug so I get a better idea about what are you talking about?
I assume you need very long (like days worth) jump to trigger this bug and for such case we can either work around it in qemu / kernel or fix it in the guest kernel and I strongly prefer the latter.
Thomas, what do you think about it?
For one I have no idea which bug you are talking about and if the bug is caused by the VMM then why would you "fix" it in the guest kernel.
Aside of that I think I made it pretty clear what the right thing to do is.
Thanks,
tglx
On Tue, 2020-12-08 at 17:02 +0100, Thomas Gleixner wrote:
On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
This introduces the wraparound bug in Linux timekeeping, doesnt it?
Which bug?
It does. Could you prepare a reproducer for this bug so I get a better idea about what are you talking about?
I assume you need very long (like days worth) jump to trigger this bug and for such case we can either work around it in qemu / kernel or fix it in the guest kernel and I strongly prefer the latter.
Thomas, what do you think about it?
For one I have no idea which bug you are talking about and if the bug is caused by the VMM then why would you "fix" it in the guest kernel.
The "bug" is that if VMM moves a hardware time counter (tsc or anything else) forward by large enough value in one go, then the guest kernel will supposingly have an overflow in the time code. I don't consider this to be a buggy VMM behavior, but rather a kernel bug that should be fixed (if this bug actually exists)
Purely in theory this can even happen on real hardware if for example SMM handler blocks a CPU from running for a long duration, or hardware debugging interface does, or some other hardware transparent sleep mechanism kicks in and blocks a CPU from running. (We do handle this gracefully for S3/S4)
Aside of that I think I made it pretty clear what the right thing to do is.
This is orthogonal to this issue of the 'bug'. Here we are not talking about per-vcpu TSC offsets, something that I said that I do agree with you that it would be very nice to get rid of.
We are talking about the fact that TSC can jump forward by arbitrary large value if the migration took arbitrary amount of time, which (assuming that the bug is real) can crash the guest kernel.
This will happen even if we use per VM global tsc offset.
So what do you think?
Best regards, Maxim Levitsky
Thanks,
tglx
On Tue, Dec 8, 2020 at 8:26 AM Maxim Levitsky mlevitsk@redhat.com wrote:
On Tue, 2020-12-08 at 17:02 +0100, Thomas Gleixner wrote:
On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
This introduces the wraparound bug in Linux timekeeping, doesnt it?
Which bug?
It does. Could you prepare a reproducer for this bug so I get a better idea about what are you talking about?
I assume you need very long (like days worth) jump to trigger this bug and for such case we can either work around it in qemu / kernel or fix it in the guest kernel and I strongly prefer the latter.
Thomas, what do you think about it?
For one I have no idea which bug you are talking about and if the bug is caused by the VMM then why would you "fix" it in the guest kernel.
The "bug" is that if VMM moves a hardware time counter (tsc or anything else) forward by large enough value in one go, then the guest kernel will supposingly have an overflow in the time code. I don't consider this to be a buggy VMM behavior, but rather a kernel bug that should be fixed (if this bug actually exists)
Purely in theory this can even happen on real hardware if for example SMM handler blocks a CPU from running for a long duration, or hardware debugging interface does, or some other hardware transparent sleep mechanism kicks in and blocks a CPU from running. (We do handle this gracefully for S3/S4)
IIRC we introduced mul_u64_u32_shift() for roughly this reason,but we don't seem to be using it in the relevant code paths. We should be able to use the same basic math with wider intermediates to allow very large intervals between updates.
On Tue, Dec 08 2020 at 09:33, Andy Lutomirski wrote:
On Tue, Dec 8, 2020 at 8:26 AM Maxim Levitsky mlevitsk@redhat.com wrote:
IIRC we introduced mul_u64_u32_shift() for roughly this reason,but we don't seem to be using it in the relevant code paths. We should be able to use the same basic math with wider intermediates to allow very large intervals between updates.
That's 128 bit math and _NO_ we are not going there for various reasons.
Hint1: There is a world outside of x8664 and TSC. Hint2: Even on x8664 and TSC it comes with performance overhead
Thanks,
tglx
On Tue, Dec 08, 2020 at 06:25:13PM +0200, Maxim Levitsky wrote:
On Tue, 2020-12-08 at 17:02 +0100, Thomas Gleixner wrote:
On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
This introduces the wraparound bug in Linux timekeeping, doesnt it?
Which bug?
It does. Could you prepare a reproducer for this bug so I get a better idea about what are you talking about?
I assume you need very long (like days worth) jump to trigger this bug and for such case we can either work around it in qemu / kernel or fix it in the guest kernel and I strongly prefer the latter.
Thomas, what do you think about it?
For one I have no idea which bug you are talking about and if the bug is caused by the VMM then why would you "fix" it in the guest kernel.
The "bug" is that if VMM moves a hardware time counter (tsc or anything else) forward by large enough value in one go, then the guest kernel will supposingly have an overflow in the time code. I don't consider this to be a buggy VMM behavior, but rather a kernel bug that should be fixed (if this bug actually exists)
It exists.
Purely in theory this can even happen on real hardware if for example SMM handler blocks a CPU from running for a long duration, or hardware debugging interface does, or some other hardware transparent sleep mechanism kicks in and blocks a CPU from running. (We do handle this gracefully for S3/S4)
Aside of that I think I made it pretty clear what the right thing to do is.
This is orthogonal to this issue of the 'bug'. Here we are not talking about per-vcpu TSC offsets, something that I said that I do agree with you that it would be very nice to get rid of. We are talking about the fact that TSC can jump forward by arbitrary large value if the migration took arbitrary amount of time, which (assuming that the bug is real) can crash the guest kernel.
QE reproduced it.
This will happen even if we use per VM global tsc offset.
So what do you think?
Best regards, Maxim Levitsky
Thanks,
tglx
On Tue, Dec 08 2020 at 15:12, Marcelo Tosatti wrote:
On Tue, Dec 08, 2020 at 06:25:13PM +0200, Maxim Levitsky wrote:
On Tue, 2020-12-08 at 17:02 +0100, Thomas Gleixner wrote: The "bug" is that if VMM moves a hardware time counter (tsc or anything else) forward by large enough value in one go, then the guest kernel will supposingly have an overflow in the time code. I don't consider this to be a buggy VMM behavior, but rather a kernel bug that should be fixed (if this bug actually exists)
It exists.
In the VMM.
We are talking about the fact that TSC can jump forward by arbitrary large value if the migration took arbitrary amount of time, which (assuming that the bug is real) can crash the guest kernel.
QE reproduced it.
Sure, that's what QE is about. Just your conclusion is wrong.
Thanks,
tglx
On Tue, Dec 08 2020 at 18:25, Maxim Levitsky wrote:
On Tue, 2020-12-08 at 17:02 +0100, Thomas Gleixner wrote:
For one I have no idea which bug you are talking about and if the bug is caused by the VMM then why would you "fix" it in the guest kernel.
The "bug" is that if VMM moves a hardware time counter (tsc or anything else) forward by large enough value in one go, then the guest kernel will supposingly have an overflow in the time code. I don't consider this to be a buggy VMM behavior, but rather a kernel bug that should be fixed (if this bug actually exists)
Well, that's debatable. The kernel has a safe guard in place for each clocksource which calculates the maximum time before an update needs to take place. That limit comes from:
1) Hardware counter wraparound time 2) Math limitation
#1 is a non-issue on TSC, but it is on pm-timer, hpet and lots of other non-x86 devices
#2 The overflow surely can happen if you're long enough out. For TSC it's ~ 800 / f [seconds/GHz TSC frequency], i.e. 200 seconds for a 4Ghz TSC.
Purely in theory this can even happen on real hardware if for example SMM handler blocks a CPU from running for a long duration, or hardware debugging interface does, or some other hardware transparent sleep mechanism kicks in and blocks a CPU from running. (We do handle this gracefully for S3/S4)
We had this discussion before. People got upset the stuff didn't work when they resumed debugging after leaving the box in the breakpoint over the weekend. *Shrug*
If SMM goes out for lunch for > 200 seconds it's broken. End of story, really. There are bigger problems than timekeeping when that happens.
Hardware transparent sleep mechanisms which are doing this behind the kernels back without giving it a mechanism to configure it is pretty much like SMM: It's broken.
So now life migration comes a long time after timekeeping had set the limits and just because it's virt it expects that everything works and it just can ignore these limits.
TBH. That's not any different than SMM or hard/firmware taking the machine out for lunch. It's exactly the same: It's broken.
And of course since that migration muck started _nobody_ bothered until today to talk to me about that.
It's not a kernel bug. The kernel works as designed for the purpose and the design clearly had these goals:
1) Correctness 2) Performance 3) Scalability
and for that we introduced limitations which were perfectly reasonable at the time because SMM and hardware/firmware wreckage definitely cannot be the limiting factor and for the fast wrapping stuff there is no design at all. These limitations are still reasonable because lifting them hurts performance and depending on the length has effects on correctness as well. Timekeeping is a complex problem.
It's a virt bug caused by pure ignorance of the underlying and already existing technology and the unwillingness to talk to people who actually understand it. I don't even want to know what kind of magic workarounds VMMs have dreamed up for that. I'm seriously grumpy that more than 10 years after I reported that time can be observed going backwards this is still not fixed and that has absolutely nothing to do with guest migration. Ignoring the simple and trivial requirement for timekeeping correctness in the first place and then having the chuzpah to claim that the kernel is buggy because virt decided it can do what it wants is beyond my comprehension and yet another proof for the theorem that virt creates more problems than it solves.
</rant>
The question how it can be made work is a different problem. I carefully said 'made work' because you can't 'fix' it.
- It can't be fixed at the VMM side at all
- It can't be fixed for fast wrapping clock sources by just fiddling with the timekeeping and time accessor code at all.
- Even for TSC it can't be just fixed without imposing overhead on every time read including VDSO. And just fixing it for x86 and TSC does not cut it. There is a whole world outside of x86 and we are not going to impose any x86/TSC specific insanity on everybody else. We are neither going to make generic code have TSC specific hoops and loops just to deal with that.
This needs orchestration and collaboration from both the VMM and the guest kernel to make this work proper and reliably.
There are two ways to do that:
1) Suspend / resume the guest kernel
2) Have a protocol which is safe under all circumstances.
If #2 is not there then #1 is the only correct option unless the VMM can guarantee that the guest is restarted _before_ time goes south.
Doing #2 correctly is not rocket science either. The kernel has mechanisms to deal with such problems already. All it requires is to expose and utilize them.
The only requirement there is to bring the kernel into a state where no CPU can observe that the time goes backwards. The kernel has two mechanisms for that:
1) Suspend / resume. Trivial because all CPUs except the (usually) boot CPU are unplugged or in a state which does not matter
2) Switching clocksources. A runtime operation which is safe and correct utilizing stop_machine()
If you really think about it then this migration problem is nothing else than switching the underlying clocksource. The only difference is that in case of a regular clocksource switch the previous clocksource is still accessible up to the point where the switchover happens which is obviously not the case for VM migration.
But instead of switching the clocksource via stop machine we can just use the same mechanism to update the current clocksource so that the time jump does not matter and cannot be observed.
That needs a few things to be done:
VMM: - Disable NMI delivery to the guest - Inject a magic VMM to guest IPI which starts the operation
Guest: - Schedule work from the IPI
- work handles nested VMs if necessary
- work invokes stop_machine(update_guest_clocksource, NULL, NULL)
- When all vCPUs rendevouzed then the one vCPU which actually runs update_guest_clocksource() and reports to the VMM via hypercall or whatever that it reached the state and spin waits on a hyperpage shared between guest and VMM to wait for the VMM to signal to proceed.
At that point it's 100% safe to freeze it. Probably not only from a timekeeping POV.
VMM: - Freeze VM and send image to destination VMM
New VMM:
- Install the image
- Setup the vCPU TSC muck
- Store necessary information in the hyperpage and then flip the bit which makes the guest waiting in update_guest_clocksource() proceed.
- Schedule the guest vCPUs
Guest:
- The one vCPU waiting in update_guest_clocksource() observes the GO bit, updates timekeeping and returns.
- All CPUs leave stomp_machine() and everything is fine
- work resumes and handles nested VMs
- work tells VMM that everything is done
All the bits and pieces are there already except for the VMM/guest contract and the extra 20 lines of code in the timekeeping core.
There is one caveat vs. the NMI safe time keeper, but we have the mechanism to deal with that in place already for suspend so that's just another 5 lines of code to deal with at the core side.
Now you combine this with a proper mechanism to deal with the TSC offset as I outlined before and your problems are pretty much gone in a very clean and understandable way.
Thanks,
tglx
On 08/12/20 22:20, Thomas Gleixner wrote:
So now life migration comes a long time after timekeeping had set the limits and just because it's virt it expects that everything works and it just can ignore these limits.
TBH. That's not any different than SMM or hard/firmware taking the machine out for lunch. It's exactly the same: It's broken.
I agree. If *live* migration stops the VM for 200 seconds, it's broken.
Sure, there's the case of snapshotting the VM over the weekend. My favorite solution would be to just put it in S3 before doing that. *Do what bare metal does* and you can't go that wrong.
In general it's userspace policy whether to keep the TSC value the same across live migration. There's pros and cons to both approaches, so KVM should provide the functionality to keep the TSC running (which the guest will see as a very long, but not extreme SMI), and this is what this series does. Maxim will change it to operate per-VM. Thanks Thomas, Oliver and everyone else for the input.
Paolo
On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
On 08/12/20 22:20, Thomas Gleixner wrote:
So now life migration comes a long time after timekeeping had set the limits and just because it's virt it expects that everything works and it just can ignore these limits.
TBH. That's not any different than SMM or hard/firmware taking the machine out for lunch. It's exactly the same: It's broken.
I agree. If *live* migration stops the VM for 200 seconds, it's broken.
Sure, there's the case of snapshotting the VM over the weekend. My favorite solution would be to just put it in S3 before doing that. *Do what bare metal does* and you can't go that wrong.
Note though that qemu has a couple of issues with s3, and it is disabled by default in libvirt. I would be very happy to work on improving this if there is a need for that.
In general it's userspace policy whether to keep the TSC value the same across live migration. There's pros and cons to both approaches, so KVM should provide the functionality to keep the TSC running (which the guest will see as a very long, but not extreme SMI), and this is what this series does. Maxim will change it to operate per-VM. Thanks Thomas, Oliver and everyone else for the input.
I agree with that.
I still think though that we should have a discussion on feasibility of making the kernel time code deal with large *forward* tsc jumps without crashing.
If that is indeed hard to do, or will cause performance issues, then I agree that we might indeed inform the guest of time jumps instead.
In fact kvmclock already have such a mechanism (KVM_KVMCLOCK_CTRL ioctl, which sets the PVCLOCK_GUEST_STOPPED bit in the PV clock struct). That informs the guest that it was stopped (guest clears this bit), and currently that makes the guest touch various watchdogs.
I think that the guest uses it only when kvmclock is used but we can think about extending this to make guest use it even when bare tsc is used, and also implement whatever logic is needed to jump the guest clock forward when this bit is set.
What do you think?
Best regards, Maxim Levitsky
Paolo
On Dec 10, 2020, at 6:52 AM, Maxim Levitsky mlevitsk@redhat.com wrote:
On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
On 08/12/20 22:20, Thomas Gleixner wrote: So now life migration comes a long time after timekeeping had set the limits and just because it's virt it expects that everything works and it just can ignore these limits.
TBH. That's not any different than SMM or hard/firmware taking the machine out for lunch. It's exactly the same: It's broken.
I agree. If *live* migration stops the VM for 200 seconds, it's broken.
Sure, there's the case of snapshotting the VM over the weekend. My favorite solution would be to just put it in S3 before doing that. *Do what bare metal does* and you can't go that wrong.
Note though that qemu has a couple of issues with s3, and it is disabled by default in libvirt. I would be very happy to work on improving this if there is a need for that.
There’s also the case where someone has a VM running on a laptop and someone closes the lid. The host QEMU might not have a chance to convince the guest to enter S3.
In general it's userspace policy whether to keep the TSC value the same across live migration. There's pros and cons to both approaches, so KVM should provide the functionality to keep the TSC running (which the guest will see as a very long, but not extreme SMI), and this is what this series does. Maxim will change it to operate per-VM. Thanks Thomas, Oliver and everyone else for the input.
I agree with that.
I still think though that we should have a discussion on feasibility of making the kernel time code deal with large *forward* tsc jumps without crashing.
If that is indeed hard to do, or will cause performance issues, then I agree that we might indeed inform the guest of time jumps instead.
Tglx, even without fancy shared host/guest timekeeping, count the guest kernel manage to update its timekeeping if the host sent the guest an interrupt or NMI on all CPUs synchronously on resume?
Alternatively, if we had the explicit “max TSC value that makes sense right now” in the timekeeping data, the guest would reliably notice the large jump and could at least do something intelligent about it instead of overflowing its internal calculation.
On Thu, Dec 10, 2020 at 9:16 AM Andy Lutomirski luto@amacapital.net wrote:
On Dec 10, 2020, at 6:52 AM, Maxim Levitsky mlevitsk@redhat.com wrote:
On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
On 08/12/20 22:20, Thomas Gleixner wrote: So now life migration comes a long time after timekeeping had set the limits and just because it's virt it expects that everything works and it just can ignore these limits.
TBH. That's not any different than SMM or hard/firmware taking the machine out for lunch. It's exactly the same: It's broken.
I agree. If *live* migration stops the VM for 200 seconds, it's broken.
Sure, there's the case of snapshotting the VM over the weekend. My favorite solution would be to just put it in S3 before doing that. *Do what bare metal does* and you can't go that wrong.
Note though that qemu has a couple of issues with s3, and it is disabled by default in libvirt. I would be very happy to work on improving this if there is a need for that.
There’s also the case where someone has a VM running on a laptop and someone closes the lid. The host QEMU might not have a chance to convince the guest to enter S3.
In general it's userspace policy whether to keep the TSC value the same across live migration. There's pros and cons to both approaches, so KVM should provide the functionality to keep the TSC running (which the guest will see as a very long, but not extreme SMI), and this is what this series does. Maxim will change it to operate per-VM. Thanks Thomas, Oliver and everyone else for the input.
So, to be clear, this per-VM ioctl will work something like the following:
static u64 kvm_read_tsc_base(struct kvm *kvm, u64 host_tsc) { return kvm_scale_tsc(kvm, host_tsc) + kvm->host_tsc_offset; }
case KVM_GET_TSC_BASE: struct kvm_tsc_base base = { .flags = KVM_TSC_BASE_TIMESTAMP_VALID; }; u64 host_tsc;
kvm_get_walltime(&base.nsec, &host_tsc); base.tsc = kvm_read_tsc_base(kvm, host_tsc);
copy_to_user(...);
[...]
case KVM_SET_TSC_BASE: struct kvm_tsc_base base; u64 host_tsc, nsec; s64 delta = 0;
copy_from_user(...);
kvm_get_walltime(&nsec, &host_tsc); delta += base.tsc - kvm_read_tsc_base(kvm, host_tsc);
if (base.flags & KVM_TSC_BASE_TIMESTAMP_VALID) { u64 delta_nsec = nsec - base.nsec;
if (delta_nsec > 0) delta += nsec_to_cycles(kvm, diff); else delta -= nsec_to_cycles(kvm, -diff); }
kvm->host_tsc_offset += delta; /* plumb host_tsc_offset through to each vcpu */
However, I don't believe we can assume the guest's TSCs to be synchronized, even if sane guests will never touch them. In this case, I think a per-vCPU ioctl is still warranted, allowing userspace to get at the guest CPU adjust component of Thomas' equation below (paraphrased):
TSC guest CPU = host tsc base + guest base offset + guest CPU adjust
Alternatively, a write from userspace to the guest's IA32_TSC_ADJUST with KVM_X86_QUIRK_TSC_HOST_ACCESS could have the same effect, but that seems to be problematic for a couple reasons. First, depending on the guest's CPUID the TSC_ADJUST MSR may not even be available, meaning that the guest could've used IA32_TSC to adjust the TSC (eww). Second, userspace replaying writes to IA32_TSC (in the case IA32_TSC_ADJUST doesn't exist for the guest) seems _very_ unlikely to work given all the magic handling that KVM does for writes to it.
Is this roughly where we are or have I entirely missed the mark? :-)
-- Thanks, Oliver
I agree with that.
I still think though that we should have a discussion on feasibility of making the kernel time code deal with large *forward* tsc jumps without crashing.
If that is indeed hard to do, or will cause performance issues, then I agree that we might indeed inform the guest of time jumps instead.
Tglx, even without fancy shared host/guest timekeeping, count the guest kernel manage to update its timekeeping if the host sent the guest an interrupt or NMI on all CPUs synchronously on resume?
Alternatively, if we had the explicit “max TSC value that makes sense right now” in the timekeeping data, the guest would reliably notice the large jump and could at least do something intelligent about it instead of overflowing its internal calculation.
On 10/12/20 18:59, Oliver Upton wrote:
However, I don't believe we can assume the guest's TSCs to be synchronized, even if sane guests will never touch them. In this case, I think a per-vCPU ioctl is still warranted, allowing userspace to get at the guest CPU adjust component of Thomas' equation below (paraphrased):
TSC guest CPU = host tsc base + guest base offset + guest CPU adjust
Right now that would be:
- KVM_GET_TSC_STATE (vm) returns host tsc base + guest base offset (plus the associated time)
- KVM_GET_MSR *without* KVM_X86_QUIRK_TSC_HOST_ACCESS for guest CPU adjust
and the corresponding SET ioctls. What am *I* missing?
Alternatively, a write from userspace to the guest's IA32_TSC_ADJUST with KVM_X86_QUIRK_TSC_HOST_ACCESS could have the same effect, but that seems to be problematic for a couple reasons. First, depending on the guest's CPUID the TSC_ADJUST MSR may not even be available, meaning that the guest could've used IA32_TSC to adjust the TSC (eww).
Indeed, the host should always be able to read/write IA32_TSC and IA32_TSC_ADJUST.
Thanks,
Paolo
Second, userspace replaying writes to IA32_TSC (in the case IA32_TSC_ADJUST doesn't exist for the guest) seems_very_ unlikely to work given all the magic handling that KVM does for writes to it.
Is this roughly where we are or have I entirely missed the mark?:-)
On Thu, Dec 10, 2020 at 12:05 PM Paolo Bonzini pbonzini@redhat.com wrote:
On 10/12/20 18:59, Oliver Upton wrote:
However, I don't believe we can assume the guest's TSCs to be synchronized, even if sane guests will never touch them. In this case, I think a per-vCPU ioctl is still warranted, allowing userspace to get at the guest CPU adjust component of Thomas' equation below (paraphrased):
TSC guest CPU = host tsc base + guest base offset + guest CPU adjust
Right now that would be:
- KVM_GET_TSC_STATE (vm) returns host tsc base + guest base offset (plus
the associated time)
- KVM_GET_MSR *without* KVM_X86_QUIRK_TSC_HOST_ACCESS for guest CPU adjust
and the corresponding SET ioctls. What am *I* missing?
Alternatively, a write from userspace to the guest's IA32_TSC_ADJUST with KVM_X86_QUIRK_TSC_HOST_ACCESS could have the same effect, but that seems to be problematic for a couple reasons. First, depending on the guest's CPUID the TSC_ADJUST MSR may not even be available, meaning that the guest could've used IA32_TSC to adjust the TSC (eww).
Indeed, the host should always be able to read/write IA32_TSC and IA32_TSC_ADJUST.
So long as it is guaranteed that guest manipulations of IA32_TSC are reflected in IA32_TSC_ADJUST even if it isn't in the guest's CPUID, then this seems OK. I think having clear documentation on this subject is also necessary, as we're going to rely on the combination of KVM_{GET,SET}_TSC_STATE, disabling KVM_X86_QUIRK_TSC_HOST_ACCESS, and userspace reading/writing a possibly hidden MSR to pull this off right.
-- Thanks, Oliver
Thanks,
Paolo
Second, userspace replaying writes to IA32_TSC (in the case IA32_TSC_ADJUST doesn't exist for the guest) seems_very_ unlikely to work given all the magic handling that KVM does for writes to it.
Is this roughly where we are or have I entirely missed the mark?:-)
Andy,
On Thu, Dec 10 2020 at 07:16, Andy Lutomirski wrote:
On Dec 10, 2020, at 6:52 AM, Maxim Levitsky mlevitsk@redhat.com wrote: On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
On 08/12/20 22:20, Thomas Gleixner wrote: So now life migration comes a long time after timekeeping had set the limits and just because it's virt it expects that everything works and it just can ignore these limits.
TBH. That's not any different than SMM or hard/firmware taking the machine out for lunch. It's exactly the same: It's broken.
I agree. If *live* migration stops the VM for 200 seconds, it's broken.
I'm glad we are on the same page here.
Sure, there's the case of snapshotting the VM over the weekend. My favorite solution would be to just put it in S3 before doing that. *Do what bare metal does* and you can't go that wrong.
:)
Note though that qemu has a couple of issues with s3, and it is disabled by default in libvirt. I would be very happy to work on improving this if there is a need for that.
There’s also the case where someone has a VM running on a laptop and someone closes the lid. The host QEMU might not have a chance to convince the guest to enter S3.
But the host kernel can do something sensible before going off into lala land. It knows that it is about to do that and it knows that there are guests running.
I still think though that we should have a discussion on feasibility of making the kernel time code deal with large *forward* tsc jumps without crashing.
I'm not opposed against that as I said before.
If that is indeed hard to do, or will cause performance issues, then I agree that we might indeed inform the guest of time jumps instead.
Tglx, even without fancy shared host/guest timekeeping, count the guest kernel manage to update its timekeeping if the host sent the guest an interrupt or NMI on all CPUs synchronously on resume?
Tell it before it takes a nap is simpler and does not require an NMI which is horrible anyway because we can't do much in the NMI and scheduling irq_work from NMI does not help either because the guest could be in the middle of ... timekeeping. See below.
Alternatively, if we had the explicit “max TSC value that makes sense right now” in the timekeeping data, the guest would reliably notice the large jump and could at least do something intelligent about it instead of overflowing its internal calculation.
Yes. We can do that and we should do that for robustness sake.
But, there is more than the robustness problem on the reader side, which is trivial as we discussed in the other part of this thread already.
There is also the problem on the timekeeping core in various aspects which need a very close look.
So there are various things to solve:
1) Readerside delta limit
Trivial to provide and trivial to implement for the VDSO because the VDSO just can loop forever.
Not so trivial for kernel side usage due to the fact that being caught in a read can prevent timekeeping from being updated.
Hint: NOHZ entry path. That would simply livelock and never reach the timekeeping code. Also any interrupt disabled region can cause that.
I looked into using 128 bit math as well, but that only works for wide clock sources like TSC and needs to be conditional on 64bit as 32bit would really suffer badly in the hotpath and even on 64bit it's measurable.
So we could keep 64bit math, use the limit and if the delta is larger than the limit take a slowpath which does wider math.
But that still needs thoughts about clocksources with smaller counterwidth and therefore a fast wraparound time.
There is another issue with larger deltas. The extrapolation might be off and then cause other side effects like observable time going backwards etc.
That has to be analyzed together with the writer side because that's where we need to ensure the continuity/monotonicity etc.
2) Writer side issues
The core timekeeping code is pretty robust against large deltas already, but there are some limitations nevertheless and it was obviously not designed to be taken out in the middle of the updates. Haven't wrapped my head around that yet.
But there is more than the timekeeper update, there are other things like NTP, PTP and consumers of the more raw timekeeping internals which might get surprised by large deltas and run into similar problems because they were not designed to deal with that.
So yes, it can and should be done, but it's not a project for friday afternoon and it's going to be hard to backport. I know that distro people do not care because another 500 patches on their frankenkernels are just noise, but it leaves everybody else out in the dark and I have zero interest to proliferate that.
I'm still convinced that a notification about 'we take a nap' will be more robust, less complex and more trivial to backport.
Both life migration and suspend on the host know it upfront which means that not using this knowledge and instead of that trying to cure the symptom is violating the basic engineering principles and TBH outright stupid.
As I said before we have most of the bits and pieces in place and I'm sure we can come up with an even simpler solution as the one I outlined before and once that is solved (or in parallel) make the time keeping more robust.
Thanks,
tglx
On Thu, Dec 10, 2020 at 1:25 PM Thomas Gleixner tglx@linutronix.de wrote:
Andy,
I'm still convinced that a notification about 'we take a nap' will be more robust, less complex and more trivial to backport.
What do you have in mind? Suppose the host kernel sends the guest an interrupt on all vCPUs saying "I'm about to take a nap". What happens if the guest is busy with IRQs off for a little bit? Does the host guarantee the guest a certain about of time to try to get the interrupt delivered before allowing the host to enter S3? How about if the host wants to reboot for a security fix -- how long is a guest allowed to delay the process?
I'm sure this can all be made to work 99% of time, but I'm a bit concerned about that last 1%.
--Andy
On Thu, Dec 10 2020 at 14:01, Andy Lutomirski wrote:
On Thu, Dec 10, 2020 at 1:25 PM Thomas Gleixner tglx@linutronix.de wrote:
I'm still convinced that a notification about 'we take a nap' will be more robust, less complex and more trivial to backport.
What do you have in mind? Suppose the host kernel sends the guest an interrupt on all vCPUs saying "I'm about to take a nap". What happens if the guest is busy with IRQs off for a little bit? Does the host guarantee the guest a certain about of time to try to get the interrupt delivered before allowing the host to enter S3? How about if the host wants to reboot for a security fix -- how long is a guest allowed to delay the process?
I'm sure this can all be made to work 99% of time, but I'm a bit concerned about that last 1%.
Seriously?
If the guest has interrupts disabled for ages, i.e. it went for out for lunch on its own, then surely the hypervisor can just pull the plug and wreckage it. It's like you hit the reset button or pull the powerplug of the machine which is not responding anymore.
Reboot waits already today for guests to shut down/hibernate/supsend or whatever they are supposed to do. systemd sits there and waits for minutes until it decides to kill them. Just crash a guest kernel and forget to reset or force power off the guest before you reboot the host. Twiddle thumbs for a while and watch the incomprehensible time display.
If your security fix reboot is so urgent that it can't wait then just pull the plug and be done with it, i.e. kill the guest which makes it start from a known state which is a gazillion times better than bringing it into a state which it can't handle anymore.
Again, that's not any different than hitting the reset button on the host or pulling and reinserting the host powerplug which you would do anyway in an emergency case.
Can we please focus on real problems instead of making up new ones?
Correctness of time is a real problem despite the believe of virt folks that it can be ignored or duct taped to death.
Thanks,
tglx
On Dec 10, 2020, at 2:28 PM, Thomas Gleixner tglx@linutronix.de wrote:
On Thu, Dec 10 2020 at 14:01, Andy Lutomirski wrote:
On Thu, Dec 10, 2020 at 1:25 PM Thomas Gleixner tglx@linutronix.de wrote: I'm still convinced that a notification about 'we take a nap' will be more robust, less complex and more trivial to backport.
What do you have in mind? Suppose the host kernel sends the guest an interrupt on all vCPUs saying "I'm about to take a nap". What happens if the guest is busy with IRQs off for a little bit? Does the host guarantee the guest a certain about of time to try to get the interrupt delivered before allowing the host to enter S3? How about if the host wants to reboot for a security fix -- how long is a guest allowed to delay the process?
I'm sure this can all be made to work 99% of time, but I'm a bit concerned about that last 1%.
Seriously?
If the guest has interrupts disabled for ages, i.e. it went for out for lunch on its own, then surely the hypervisor can just pull the plug and wreckage it. It's like you hit the reset button or pull the powerplug of the machine which is not responding anymore.
Reboot waits already today for guests to shut down/hibernate/supsend or whatever they are supposed to do. systemd sits there and waits for minutes until it decides to kill them. Just crash a guest kernel and forget to reset or force power off the guest before you reboot the host. Twiddle thumbs for a while and watch the incomprehensible time display.
If your security fix reboot is so urgent that it can't wait then just pull the plug and be done with it, i.e. kill the guest which makes it start from a known state which is a gazillion times better than bringing it into a state which it can't handle anymore.
Again, that's not any different than hitting the reset button on the host or pulling and reinserting the host powerplug which you would do anyway in an emergency case.
Can we please focus on real problems instead of making up new ones?
Correctness of time is a real problem despite the believe of virt folks that it can be ignored or duct taped to death.
I’m fine with this as long as it’s intentional. If we say “guest timekeeping across host suspend is correct because we notify the guest”, then we have a hole. But if we say “the host will try to notify the guest, and if the guest is out to lunch then the host reserves the right to suspend without waiting, and the guest should deal with this”, then okay.
On Thu, Dec 10 2020 at 15:19, Andy Lutomirski wrote:
On Dec 10, 2020, at 2:28 PM, Thomas Gleixner tglx@linutronix.de wrote: Can we please focus on real problems instead of making up new ones?
Correctness of time is a real problem despite the believe of virt folks that it can be ignored or duct taped to death.
I’m fine with this as long as it’s intentional. If we say “guest timekeeping across host suspend is correct because we notify the guest”, then we have a hole. But if we say “the host will try to notify the guest, and if the guest is out to lunch then the host reserves the right to suspend without waiting, and the guest should deal with this”, then okay.
Yes of course this would be intentional and documented behaviour, which is an infinite improvement over the current situation.
Thanks,
tglx
On Tue, Dec 08, 2020 at 05:02:07PM +0100, Thomas Gleixner wrote:
On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
This introduces the wraparound bug in Linux timekeeping, doesnt it?
Which bug?
max_cycles overflow. Sent a message to Maxim describing it.
It does. Could you prepare a reproducer for this bug so I get a better idea about what are you talking about?
I assume you need very long (like days worth) jump to trigger this bug and for such case we can either work around it in qemu / kernel or fix it in the guest kernel and I strongly prefer the latter.
Thomas, what do you think about it?
For one I have no idea which bug you are talking about and if the bug is caused by the VMM then why would you "fix" it in the guest kernel.
1) Stop guest, save TSC value of cpu-0 = V. 2) Wait for some amount of time = W. 3) Start guest, load TSC value with V+W.
Can cause an overflow on Linux timekeeping.
Aside of that I think I made it pretty clear what the right thing to do is.
Sure: the notion of a "unique TSC offset" already exists (it is detected by write TSC logic, and not explicit in the interface, though).
But AFAIK it works pretty well.
Exposing a single TSC value on the interface level seems alright to me...
On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
On Tue, Dec 08, 2020 at 05:02:07PM +0100, Thomas Gleixner wrote:
On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
This introduces the wraparound bug in Linux timekeeping, doesnt it?
Which bug?
max_cycles overflow. Sent a message to Maxim describing it.
Truly helpful. Why the hell did you not talk to me when you ran into that the first time?
For one I have no idea which bug you are talking about and if the bug is caused by the VMM then why would you "fix" it in the guest kernel.
- Stop guest, save TSC value of cpu-0 = V.
- Wait for some amount of time = W.
- Start guest, load TSC value with V+W.
Can cause an overflow on Linux timekeeping.
Yes, because you violate the basic assumption which Linux timekeeping makes. See the other mail in this thread.
Thanks,
tglx
On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
On Tue, Dec 08, 2020 at 05:02:07PM +0100, Thomas Gleixner wrote:
On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
This introduces the wraparound bug in Linux timekeeping, doesnt it?
Which bug?
max_cycles overflow. Sent a message to Maxim describing it.
Truly helpful. Why the hell did you not talk to me when you ran into that the first time?
Because
1) Users wanted CLOCK_BOOTTIME to stop counting while the VM is paused (so we wanted to stop guest clock when VM is paused anyway).
2) The solution to inject NMIs to the guest seemed overly complicated.
For one I have no idea which bug you are talking about and if the bug is caused by the VMM then why would you "fix" it in the guest kernel.
- Stop guest, save TSC value of cpu-0 = V.
- Wait for some amount of time = W.
- Start guest, load TSC value with V+W.
Can cause an overflow on Linux timekeeping.
Yes, because you violate the basic assumption which Linux timekeeping makes. See the other mail in this thread.
Thanks,
tglx
Marcelo,
On Wed, Dec 09 2020 at 13:34, Marcelo Tosatti wrote:
On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
max_cycles overflow. Sent a message to Maxim describing it.
Truly helpful. Why the hell did you not talk to me when you ran into that the first time?
Because
- Users wanted CLOCK_BOOTTIME to stop counting while the VM
is paused (so we wanted to stop guest clock when VM is paused anyway).
How is that supposed to work w/o the guest kernels help if you have to keep clock realtime up to date?
- The solution to inject NMIs to the guest seemed overly
complicated.
Why do you need NMIs?
All you need is a way to communicate to the guest that it should prepare for clock madness to happen. Whether that's an IPI or a bit in a hyperpage which gets checked during the update of the guest timekeeping does not matter at all.
But you certainly do not need an NMI because there is nothing useful you can do within an NMI.
Thanks,
tglx
On Wed, Dec 09, 2020 at 09:58:23PM +0100, Thomas Gleixner wrote:
Marcelo,
On Wed, Dec 09 2020 at 13:34, Marcelo Tosatti wrote:
On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
max_cycles overflow. Sent a message to Maxim describing it.
Truly helpful. Why the hell did you not talk to me when you ran into that the first time?
Because
- Users wanted CLOCK_BOOTTIME to stop counting while the VM
is paused (so we wanted to stop guest clock when VM is paused anyway).
How is that supposed to work w/o the guest kernels help if you have to keep clock realtime up to date?
Upon VM resume, we notify NTP daemon in the guest to sync realtime clock.
- The solution to inject NMIs to the guest seemed overly
complicated.
Why do you need NMIs?
All you need is a way to communicate to the guest that it should prepare for clock madness to happen. Whether that's an IPI or a bit in a hyperpage which gets checked during the update of the guest timekeeping does not matter at all.
But you certainly do not need an NMI because there is nothing useful you can do within an NMI.
Thanks,
tglx
On Thu, Dec 10 2020 at 12:26, Marcelo Tosatti wrote:
On Wed, Dec 09, 2020 at 09:58:23PM +0100, Thomas Gleixner wrote:
Marcelo,
On Wed, Dec 09 2020 at 13:34, Marcelo Tosatti wrote:
On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
max_cycles overflow. Sent a message to Maxim describing it.
Truly helpful. Why the hell did you not talk to me when you ran into that the first time?
Because
- Users wanted CLOCK_BOOTTIME to stop counting while the VM
is paused (so we wanted to stop guest clock when VM is paused anyway).
How is that supposed to work w/o the guest kernels help if you have to keep clock realtime up to date?
Upon VM resume, we notify NTP daemon in the guest to sync realtime clock.
Brilliant. What happens if there is no NTP daemon? What happens if the NTP daemon is not part of the virt orchestration magic and cannot be notified, then it will notice the time jump after the next update interval.
What about correctness?
ALL CLOCK_* stop and resume when the VM is resumed at the point where they stopped.
So up to the point where NTP catches up and corrects clock realtime and TAI other processes can observe that time jumped in the outside world, e.g. via a network packet or whatever, but there is no reason why time should have jumped outside vs. the local one.
You really all live in a seperate universe creating your own rules how things which other people work hard on to get it correct can be screwed over.
Of course this all is nowhere documented in detail. At least a quick search with about 10 different keyword combinations revealed absolutely nothing.
This features first, correctness later frenzy is insane and it better stops now before you pile even more crap on the existing steaming pile of insanities.
Thanks,
tglx
On Thu, Dec 10, 2020 at 10:48:10PM +0100, Thomas Gleixner wrote:
On Thu, Dec 10 2020 at 12:26, Marcelo Tosatti wrote:
On Wed, Dec 09, 2020 at 09:58:23PM +0100, Thomas Gleixner wrote:
Marcelo,
On Wed, Dec 09 2020 at 13:34, Marcelo Tosatti wrote:
On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
max_cycles overflow. Sent a message to Maxim describing it.
Truly helpful. Why the hell did you not talk to me when you ran into that the first time?
Because
- Users wanted CLOCK_BOOTTIME to stop counting while the VM
is paused (so we wanted to stop guest clock when VM is paused anyway).
How is that supposed to work w/o the guest kernels help if you have to keep clock realtime up to date?
Upon VM resume, we notify NTP daemon in the guest to sync realtime clock.
Brilliant. What happens if there is no NTP daemon? What happens if the NTP daemon is not part of the virt orchestration magic and cannot be notified, then it will notice the time jump after the next update interval.
What about correctness?
ALL CLOCK_* stop and resume when the VM is resumed at the point where they stopped.
So up to the point where NTP catches up and corrects clock realtime and TAI other processes can observe that time jumped in the outside world, e.g. via a network packet or whatever, but there is no reason why time should have jumped outside vs. the local one.
You really all live in a seperate universe creating your own rules how things which other people work hard on to get it correct can be screwed over.
1. T = read timestamp. 2. migrate (VM stops for a certain period). 3. use timestamp T.
Of course this all is nowhere documented in detail. At least a quick search with about 10 different keyword combinations revealed absolutely nothing.
This features first, correctness later frenzy is insane and it better stops now before you pile even more crap on the existing steaming pile of insanities.
Sure.
On Thu, Dec 10 2020 at 21:27, Marcelo Tosatti wrote:
On Thu, Dec 10, 2020 at 10:48:10PM +0100, Thomas Gleixner wrote:
You really all live in a seperate universe creating your own rules how things which other people work hard on to get it correct can be screwed over.
- T = read timestamp.
- migrate (VM stops for a certain period).
- use timestamp T.
This is exactly the problem. Time stops at pause and continues where it stopped on resume.
But CLOCK_REALTIME and CLOCK_TAI advanced in reality. So up to the point where NTP fixes this - if there is NTP at all - the guest CLOCK_REALTIME and CLOCK_TAI are off by tpause.
Now the application gets a packet from the outside world with a CLOCK_REALTIME timestamp which is suddenly ahead of the value it reads from clock_gettime(CLOCK_REALTIME) by tpause. So what is it supposed to do with that? Make stupid assumptions that the other end screwed up timekeeping, throw an error that the system it is running on screwed up timekeeping? And a second later when NTP catched up it gets the next surprise because the systems CLOCK_REALTIME jumped forward unexpectedly or if there is no NTP it's confused forever.
How can you even assume that this is correct?
It is exactly the same problem as we had many years ago with hardware clocks suddenly stopping to tick which caused quite some stuff to go belly up.
In a proper suspend/resume scenario CLOCK_REALTIME/TAI are advanced (with a certain degree of accuracy) to compensate for the sleep time, so the other end of a communication is at least in the same ballpark, but not 50 seconds off.
This features first, correctness later frenzy is insane and it better stops now before you pile even more crap on the existing steaming pile of insanities.
Sure.
I wish that would be true. OS people - you should know that - are fighting forever with hardware people over feature madness and the attitude of 'we can fix that in software' which turns often enough out to be wrong.
Now sadly enough people who suffered from that madness work on virtualization and instead of trying to avoid the same problem they go off and make it even worse.
It's the same problem again as with hardware people. Not talking to the other people _before_ making uninformed assumptions and decisions.
We did it that way because big customer asked for it is not a justification for inflicting this on everybody else and thereby violating correctness. Works for me and my big customer is not a proof of correctness either.
It's another proof that this industry just "works" by chance.
Thanks,
tglx
On Fri, Dec 11, 2020 at 02:30:34PM +0100, Thomas Gleixner wrote:
On Thu, Dec 10 2020 at 21:27, Marcelo Tosatti wrote:
On Thu, Dec 10, 2020 at 10:48:10PM +0100, Thomas Gleixner wrote:
You really all live in a seperate universe creating your own rules how things which other people work hard on to get it correct can be screwed over.
- T = read timestamp.
- migrate (VM stops for a certain period).
- use timestamp T.
This is exactly the problem. Time stops at pause and continues where it stopped on resume.
But CLOCK_REALTIME and CLOCK_TAI advanced in reality. So up to the point where NTP fixes this - if there is NTP at all - the guest CLOCK_REALTIME and CLOCK_TAI are off by tpause.
Now the application gets a packet from the outside world with a CLOCK_REALTIME timestamp which is suddenly ahead of the value it reads from clock_gettime(CLOCK_REALTIME) by tpause. So what is it supposed to do with that? Make stupid assumptions that the other end screwed up timekeeping, throw an error that the system it is running on screwed up timekeeping? And a second later when NTP catched up it gets the next surprise because the systems CLOCK_REALTIME jumped forward unexpectedly or if there is no NTP it's confused forever.
This can happen even with a "perfect" solution that syncs time instantly on the migration destination. See steps 1,2,3.
Unless you notify applications to invalidate their time reads, i can't see a way to fix this.
Therefore if you use VM migration in the first place, a certain amount of timestamp accuracy error must be tolerated.
How can you even assume that this is correct?
As noted above, even without a window of unsynchronized time (due to delay for NTP to sync time), time reads can be stale.
It is exactly the same problem as we had many years ago with hardware clocks suddenly stopping to tick which caused quite some stuff to go belly up.
Customers complained when it was 5 seconds off, now its 0.1ms (and people seem happy).
In a proper suspend/resume scenario CLOCK_REALTIME/TAI are advanced (with a certain degree of accuracy) to compensate for the sleep time, so the other end of a communication is at least in the same ballpark, but not 50 seconds off.
Its 100ms off with migration, and can be reduced further (customers complained about 5 seconds but seem happy with 0.1ms).
This features first, correctness later frenzy is insane and it better stops now before you pile even more crap on the existing steaming pile of insanities.
Sure.
I wish that would be true. OS people - you should know that - are fighting forever with hardware people over feature madness and the attitude of 'we can fix that in software' which turns often enough out to be wrong.
Now sadly enough people who suffered from that madness work on virtualization and instead of trying to avoid the same problem they go off and make it even worse.
So you think its important to reduce the 100ms offset?
It's the same problem again as with hardware people. Not talking to the other people _before_ making uninformed assumptions and decisions.
We did it that way because big customer asked for it is not a justification for inflicting this on everybody else and thereby violating correctness. Works for me and my big customer is not a proof of correctness either.
It's another proof that this industry just "works" by chance.
Thanks,
tglx
OK, makes sense, then reducing the 0.1ms window even further is a useful thing to do. What would be an acceptable CLOCK_REALTIME accuracy error, on migration?
On Fri, Dec 11 2020 at 11:18, Marcelo Tosatti wrote:
On Fri, Dec 11, 2020 at 02:30:34PM +0100, Thomas Gleixner wrote: Unless you notify applications to invalidate their time reads, i can't see a way to fix this.
This is just wrong. Suspend/resume handles that fine and the system is guaranteed to come back with time which is very close to the reality.
And for suspend/resume everything from kernel to userspace can have a notification before suspend and post resume. So applications or those parts of the kernel which are e.g. time sensitive can prepare upfront for the disruption and mop up on resume.
Therefore if you use VM migration in the first place, a certain amount of timestamp accuracy error must be tolerated.
That's just because it was never designed in the right way. And you simply declared that all applications have to deal with that.
Again, where is this documented? VMs are subject to migration whether the customer who pays for it wants it or not. None of the virt tool docs mentions that pausing a VM for a long time makes timekeeping go south.
I still have no sensible explanation WHY time should not advance accross a migration. All you told me is that customers complained. Which customers? The ones running the hosts or the ones paying for the VM?
It's all just decided by some folks to "fix" a problem with the pause/ migration mechanism they designed instead of fixing the design fail.
How can you even assume that this is correct?
As noted above, even without a window of unsynchronized time (due to delay for NTP to sync time), time reads can be stale.
So with suspend/resume we have:
app: t = clock_gettime() <---------------- tsuspend <-----------------tresume So t is now behind reality by tresume - tsuspend
packet -> check timestamp .... ooops recheck t = clock_gettime() and t and timestamp are in the same ballpark again
Now with your thing:
app: t = clock_gettime() <---------------- tpause <-----------------tresume So t is now behind reality by tresume - tpause
packet -> check timestamp .... ooops recheck t = clock_gettime() and t and timestamp are still apart by ~ (tresume - tpause)
this persists until NTP kicks in, if and only if NTP is running.
Can you spot the difference?
It is exactly the same problem as we had many years ago with hardware clocks suddenly stopping to tick which caused quite some stuff to go belly up.
Customers complained when it was 5 seconds off, now its 0.1ms (and people seem happy).
And because customers complained you decided to create a scenario which is completely different to all other scenarios and from a time keeping POV not making any sense at all.
In a proper suspend/resume scenario CLOCK_REALTIME/TAI are advanced (with a certain degree of accuracy) to compensate for the sleep time, so the other end of a communication is at least in the same ballpark, but not 50 seconds off.
Its 100ms off with migration, and can be reduced further (customers complained about 5 seconds but seem happy with 0.1ms).
What is 100ms? Guaranteed maximum migration time?
CLOCK_REALTIME and CLOCK_TAI are off by the time the VM is paused and this state persists up to the point where NTP corrects it with a time jump.
So if migration takes 5 seconds then CLOCK_REALTIME is not off by 100ms it's off by 5 seconds.
CLOCK_MONOTONIC/BOOTTIME might be off by 100ms between pause and resume.
OK, makes sense, then reducing the 0.1ms window even further is a useful thing to do. What would be an acceptable CLOCK_REALTIME accuracy error, on migration?
Can you please explain how you can achive 0.1ms accuracy when migration time is more than that and guest TSC is just restored to the value at which it was stopped?
Then ALL clocks including CLOCK_REALTIME and CLOCK_TAI continue from the point at which they were stopped. Ergo:
t(CLOCK_REALTIME) = t(REALITY) - t(STOPPED)
CLOCK_REALTIME and CLOCK_TAI are global clocks and they have rules which have to be respected in order to make stuff work.
CLOCK_MONOTONIC and CLOCK_BOOTTIME are local to a system (host, guests). So manipulating them is a completely different story albeit the kernel has explicit guarantees for the relationship between CLOCK_MONOTONIC, CLOCK_BOOTTIME and CLOCK_REALTIME/TAI
If you could guarantee t(STOPPED) < 100ms and therefore
t(REALITY) - t(CLOCK_REALTIME) < 100ms
under _all_ circumstances then we would not even have that discussion.
Even < 1000ms might be acceptable. That's the margin of error which is also happening accross bare metal suspend/resume in the case that the sleep time has to be read from the RTC when TSC stops accross suspend.
But suspend/resume is still substantially different because everything from kernel to userspace can have a notification before suspend. So applications or those parts of the kernel which are e.g. time sensitive can prepare upfront for the disruption. In your case, not at all. They just have to cope with the fallout. Brilliant.
Your design or the lack of it just decides that everything has to cope with what you decided is the right thing to do and for how long it takes.
Yes it "works" by some definition of works, but that does not mean it works correctly.
Thanks,
tglx
On 11/12/20 22:04, Thomas Gleixner wrote:
Its 100ms off with migration, and can be reduced further (customers complained about 5 seconds but seem happy with 0.1ms).
What is 100ms? Guaranteed maximum migration time?
I suppose it's the length between the time from KVM_GET_CLOCK and KVM_GET_MSR(IA32_TSC) to KVM_SET_CLOCK and KVM_SET_MSR(IA32_TSC). But the VM is paused for much longer, the sequence for the non-live part of the migration (aka brownout) is as follows:
pause finish sending RAM receive RAM ~1 sec send paused-VM state finish receiving RAM \ receive paused-VM state ) 0.1 sec restart /
The nanosecond and TSC times are sent as part of the paused-VM state at the very end of the live migration process.
So it's still true that the time advances during live migration brownout; 0.1 seconds is just the final part of the live migration process. But for _live_ migration there is no need to design things according to "people are happy if their clock is off by 0.1 seconds only". Again, save-to-disk, reverse debugging and the like are a different story, which is why KVM should delegate policy to userspace (while documenting how to do it right).
Paolo
CLOCK_REALTIME and CLOCK_TAI are off by the time the VM is paused and this state persists up to the point where NTP corrects it with a time jump.
So if migration takes 5 seconds then CLOCK_REALTIME is not off by 100ms it's off by 5 seconds.
CLOCK_MONOTONIC/BOOTTIME might be off by 100ms between pause and resume.
On Fri, Dec 11 2020 at 22:59, Paolo Bonzini wrote:
On 11/12/20 22:04, Thomas Gleixner wrote: The nanosecond and TSC times are sent as part of the paused-VM state at the very end of the live migration process.
So it's still true that the time advances during live migration brownout; 0.1 seconds is just the final part of the live migration process. But for _live_ migration there is no need to design things according to "people are happy if their clock is off by 0.1 seconds only".
The problem is not the 0.1 second jump of the TSC. That's just a minor nuisance. The problem is the incorrectness of CLOCK_REALTIME after this operation which is far larger than 0.1s and this needs to be fixed.
Again, save-to-disk, reverse debugging and the like are a different story, which is why KVM should delegate policy to userspace (while documenting how to do it right).
One ready to use option would be suspend to idle. It's fast and readily available including all the notifications and kernel side handling of time and whatever.
Thanks,
tglx
On Fri, Dec 11, 2020 at 10:59:59PM +0100, Paolo Bonzini wrote:
On 11/12/20 22:04, Thomas Gleixner wrote:
Its 100ms off with migration, and can be reduced further (customers complained about 5 seconds but seem happy with 0.1ms).
What is 100ms? Guaranteed maximum migration time?
I suppose it's the length between the time from KVM_GET_CLOCK and KVM_GET_MSR(IA32_TSC) to KVM_SET_CLOCK and KVM_SET_MSR(IA32_TSC). But the VM is paused for much longer, the sequence for the non-live part of the migration (aka brownout) is as follows:
pause finish sending RAM receive RAM ~1 sec send paused-VM state finish receiving RAM \ receive paused-VM state ) 0.1 sec restart /
The nanosecond and TSC times are sent as part of the paused-VM state at the very end of the live migration process.
So it's still true that the time advances during live migration brownout; 0.1 seconds is just the final part of the live migration process. But for _live_ migration there is no need to design things according to "people are happy if their clock is off by 0.1 seconds only".
Agree. What would be a good way to fix this?
It seems to me using CLOCK_REALTIME as in the interface Maxim is proposing is prone to difference in CLOCK_REALTIME itself.
Perhaps there is another way to measure that 0.1 sec which is independent of the clock values of the source and destination hosts (say by sending a packet once the clock stops counting).
Then on destination measure delta = clock_restart_time - packet_receival and increase clock by that amount.
Again, save-to-disk, reverse debugging and the like are a different story, which is why KVM should delegate policy to userspace (while documenting how to do it right).
Paolo
CLOCK_REALTIME and CLOCK_TAI are off by the time the VM is paused and this state persists up to the point where NTP corrects it with a time jump.
So if migration takes 5 seconds then CLOCK_REALTIME is not off by 100ms it's off by 5 seconds.
CLOCK_MONOTONIC/BOOTTIME might be off by 100ms between pause and resume.
On Tue, Dec 15, 2020 at 3:35 AM Marcelo Tosatti mtosatti@redhat.com wrote:
On Fri, Dec 11, 2020 at 10:59:59PM +0100, Paolo Bonzini wrote:
On 11/12/20 22:04, Thomas Gleixner wrote:
Its 100ms off with migration, and can be reduced further (customers complained about 5 seconds but seem happy with 0.1ms).
What is 100ms? Guaranteed maximum migration time?
I suppose it's the length between the time from KVM_GET_CLOCK and KVM_GET_MSR(IA32_TSC) to KVM_SET_CLOCK and KVM_SET_MSR(IA32_TSC). But the VM is paused for much longer, the sequence for the non-live part of the migration (aka brownout) is as follows:
pause finish sending RAM receive RAM ~1 sec send paused-VM state finish receiving RAM \ receive paused-VM state ) 0.1 sec restart /
The nanosecond and TSC times are sent as part of the paused-VM state at the very end of the live migration process.
So it's still true that the time advances during live migration brownout; 0.1 seconds is just the final part of the live migration process. But for _live_ migration there is no need to design things according to "people are happy if their clock is off by 0.1 seconds only".
Agree. What would be a good way to fix this?
Could you implement the Hyper-V clock interface? It's much, much simpler than the kvmclock interface. It has the downside that CLOCK_BOOTTIME won't do what you want, but I'm not really convinced that's a problem, and you could come up with a minimal extension to fix that.
On Tue, Dec 15 2020 at 07:59, Marcelo Tosatti wrote:
On Fri, Dec 11, 2020 at 10:59:59PM +0100, Paolo Bonzini wrote:
So it's still true that the time advances during live migration brownout; 0.1 seconds is just the final part of the live migration process. But for _live_ migration there is no need to design things according to "people are happy if their clock is off by 0.1 seconds only".
Agree. What would be a good way to fix this?
None of what's proposed is fixing the underlying problem of wreckaging CLOCK_REALTIME.
Stop proliferating broken behaviour and please answer the questions I asked several times now:
1) Why has the TSC has to restart at the same value?
2) What is the technical argument that it is correct and acceptable to wreckage CLOCK_REALTIME by doing #1?
Thanks,
tglx
On 11/12/20 01:27, Marcelo Tosatti wrote:
This features first, correctness later frenzy is insane and it better stops now before you pile even more crap on the existing steaming pile of insanities.
FWIW I agree, in fact I wish I knew exactly where to start simplifying it; the timekeeping code in KVM is perhaps the last remaining bastion where I'm afraid just to look at it.
At least I know there's a couple of features that IMHO are completely useless, and that's even before you reach the ones that might "only" be obsolete. So perhaps that's where to start, together with the messiest userspace interfaces.
Paolo
On Tue, Dec 08, 2020 at 04:50:53PM +0200, Maxim Levitsky wrote:
On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
On Thu, Dec 03, 2020 at 07:11:16PM +0200, Maxim Levitsky wrote:
These two new ioctls allow to more precisly capture and restore guest's TSC state.
Both ioctls are meant to be used to accurately migrate guest TSC even when there is a significant downtime during the migration.
Suggested-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 73 ++++++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 15 +++++++ 3 files changed, 153 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 70254eaa5229f..ebecfe4b414ce 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may experience inconsistent filtering behavior on MSR accesses. +4.127 KVM_GET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
- #define KVM_TSC_STATE_TIMESTAMP_VALID 1
- #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
- struct kvm_tsc_state {
- __u32 flags;
- __u64 nsec;
- __u64 tsc;
- __u64 tsc_adjust;
- };
+flags values for ``struct kvm_tsc_state``:
+``KVM_TSC_STATE_TIMESTAMP_VALID``
- ``nsec`` contains nanoseconds from unix epoch.
- Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
+``KVM_TSC_STATE_TSC_ADJUST_VALID``
- ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
+This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST, +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix +epoch.
Why is CLOCK_REALTIME necessary at all? kvmclock uses the host clock as a time base, but for TSC it should not be necessary.
CLOCK_REALTIME is used as an absolute time reference that should match on both computers. I could have used CLOCK_TAI instead for example.
The reference allows to account for time passed between saving and restoring the TSC as explained above.
As mentioned we don't want this due to the overflow.
Again, i think higher priority is to allow enablement of invariant TSC by default (to disable kvmclock).
+4.128 KVM_SET_TSC_STATE +----------------------------
+:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_state +:Returns: 0 on success, < 0 on error
+::
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags, +KVM will adjust the guest TSC value by the time that passed since the moment +CLOCK_REALTIME timestamp was saved in the struct and current value of +CLOCK_REALTIME, and set the guest's TSC to the new value.
This introduces the wraparound bug in Linux timekeeping, doesnt it?
It does. Could you prepare a reproducer for this bug so I get a better idea about what are you talking about?
Enable CONFIG_DEBUG_TIMEKEEPING, check what max_cycles is from the TSC clocksource:
#ifdef CONFIG_DEBUG_TIMEKEEPING #define WARNING_FREQ (HZ*300) /* 5 minute rate-limiting */
static void timekeeping_check_update(struct timekeeper *tk, u64 offset) {
u64 max_cycles = tk->tkr_mono.clock->max_cycles; const char *name = tk->tkr_mono.clock->name;
if (offset > max_cycles) { printk_deferred("WARNING: timekeeping: Cycle offset (%lld) is larger than allowed by the '%s' clock's max_cycles value (%lld): time overflow danger\n", offset, name, max_cycles); printk_deferred(" timekeeping: Your kernel is sick, but tries to cope by capping time updates\n"); } else { if (offset > (max_cycles >> 1)) { printk_deferred("INFO: timekeeping: Cycle offset (%lld) is larger than the '%s' clock's 50%% safety margin (%lld)\n", offset, name, max_cycles >> 1); printk_deferred(" timekeeping: Your kernel is still fine, but is feeling a bit nervous\n"); } }
if (tk->underflow_seen) { if (jiffies - tk->last_warning > WARNING_FREQ) { printk_deferred("WARNING: Underflow in clocksource '%s' observed, time update ignored.\n", name); printk_deferred(" Please report this, consider using a different clocksource, if possible.\n"); printk_deferred(" Your kernel is probably still fine.\n"); tk->last_warning = jiffies; } tk->underflow_seen = 0; }
I assume you need very long (like days worth) jump to trigger this bug
Exactly. max_cycles worth (for kvmclock one or two days vmstop/vmstart was sufficient to trigger the bug).
and for such case we can either work around it in qemu / kernel or fix it in the guest kernel and I strongly prefer the latter.
Well, what about older kernels? Can't fix those in the guest kernel.
Moreover:
https://patchwork.kernel.org/project/kvm/patch/20130618233825.GA19042@amt.cn...
2) Users rely on CLOCK_MONOTONIC to count run time, that is, time which OS has been in a runnable state (see CLOCK_BOOTTIME).
I think the current 100ms delta (on migration) can be reduced without checking the clock delta between source and destination hosts.
So to reiterate: the idea to pass a tuple (tsc, tsc_adjust) is good because you can fix the issues introduced by writing the values separately.
However, IMHO the patchset lacks a clear problem (or set of problems) that its addressing.
Thomas, what do you think about it?
Best regards, Maxim Levitsky
+Otherwise KVM will set the guest TSC value to the exact value as given +in the struct.
+if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST, +then its value will be set to the given value from the struct.
+It is assumed that either both ioctls will be run on the same machine, +or that source and destination machines have synchronized clocks.
- The kvm_run structure
======================== diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a3fdc16cfd6f3..9b8a2fe3a2398 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts, return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp)); }
+static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc) +{
- struct timespec64 ts;
- if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
*walltime_ns = timespec64_to_ns(&ts);
return;
- }
- *host_tsc = rdtsc();
- *walltime_ns = ktime_get_real_ns();
+}
#endif /* @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_X86_USER_SPACE_MSR: case KVM_CAP_X86_MSR_FILTER: case KVM_CAP_ENFORCE_PV_FEATURE_CPUID: +#ifdef CONFIG_X86_64
- case KVM_CAP_PRECISE_TSC:
+#endif r = 1; break; case KVM_CAP_SYNC_REGS: @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp, case KVM_GET_SUPPORTED_HV_CPUID: r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp); break; +#ifdef CONFIG_X86_64
- case KVM_GET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
u64 host_tsc;
struct kvm_tsc_state tsc_state = {
.flags = KVM_TSC_STATE_TIMESTAMP_VALID
};
kvm_get_walltime(&tsc_state.nsec, &host_tsc);
tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
}
r = -EFAULT;
if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
goto out;
r = 0;
break;
- }
- case KVM_SET_TSC_STATE: {
struct kvm_tsc_state __user *user_tsc_state = argp;
struct kvm_tsc_state tsc_state;
u64 host_tsc, wall_nsec;
u64 new_guest_tsc, new_guest_tsc_offset;
r = -EFAULT;
if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
goto out;
kvm_get_walltime(&wall_nsec, &host_tsc);
new_guest_tsc = tsc_state.tsc;
if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
s64 diff = wall_nsec - tsc_state.nsec;
if (diff >= 0)
new_guest_tsc += nsec_to_cycles(vcpu, diff);
else
new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
}
new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
r = 0;
break;
- }
+#endif default: r = -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 886802b8ffba3..bf4c38fd58291 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190 #define KVM_CAP_SYS_HYPERV_CPUID 191 #define KVM_CAP_DIRTY_LOG_RING 192 +#define KVM_CAP_PRECISE_TSC 193 #ifdef KVM_CAP_IRQ_ROUTING @@ -1169,6 +1170,16 @@ struct kvm_clock_data { __u32 pad[9]; };
+#define KVM_TSC_STATE_TIMESTAMP_VALID 1 +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2 +struct kvm_tsc_state {
- __u32 flags;
- __u64 nsec;
- __u64 tsc;
- __u64 tsc_adjust;
+};
/* For KVM_CAP_SW_TLB */ #define KVM_MMU_FSL_BOOKE_NOHV 0 @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd { /* Available with KVM_CAP_DIRTY_LOG_RING */ #define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc7) +/* Available with KVM_CAP_PRECISE_TSC*/ +#define KVM_SET_TSC_STATE _IOW(KVMIO, 0xc8, struct kvm_tsc_state) +#define KVM_GET_TSC_STATE _IOR(KVMIO, 0xc9, struct kvm_tsc_state)
/* Secure Encrypted Virtualization command */ enum sev_cmd_id { /* Guest initialization commands */ -- 2.26.2
This quirk reflects the fact that we currently treat MSR_IA32_TSC and MSR_TSC_ADJUST access by the host (e.g qemu) in a way that is different compared to an access from the guest.
For host's MSR_IA32_TSC read we currently always return L1 TSC value, and for host's write we do the tsc synchronization.
For host's MSR_TSC_ADJUST write, we don't make the tsc 'jump' as we should for this msr.
When the hypervisor uses the new TSC GET/SET state ioctls, all of this is no longer needed, thus leave this enabled only with a quirk which the hypervisor can disable.
Suggested-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/include/uapi/asm/kvm.h | 1 + arch/x86/kvm/x86.c | 19 ++++++++++++++----- 2 files changed, 15 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index 8e76d3701db3f..2a60fc6674164 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -404,6 +404,7 @@ struct kvm_sync_regs { #define KVM_X86_QUIRK_LAPIC_MMIO_HOLE (1 << 2) #define KVM_X86_QUIRK_OUT_7E_INC_RIP (1 << 3) #define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT (1 << 4) +#define KVM_X86_QUIRK_TSC_HOST_ACCESS (1 << 5)
#define KVM_STATE_NESTED_FORMAT_VMX 0 #define KVM_STATE_NESTED_FORMAT_SVM 1 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9b8a2fe3a2398..aabded17abae4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3091,7 +3091,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) break; case MSR_IA32_TSC_ADJUST: if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) { - if (!msr_info->host_initiated) { + if (!msr_info->host_initiated || + !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) { s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr; adjust_tsc_offset_guest(vcpu, adj); } @@ -3118,7 +3119,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) vcpu->arch.msr_ia32_power_ctl = data; break; case MSR_IA32_TSC: - if (msr_info->host_initiated) { + if (msr_info->host_initiated && + kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) { kvm_synchronize_tsc(vcpu, data); } else { u64 adj = kvm_compute_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset; @@ -3409,17 +3411,24 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) msr_info->data = vcpu->arch.msr_ia32_power_ctl; break; case MSR_IA32_TSC: { + u64 tsc_offset; + /* * Intel SDM states that MSR_IA32_TSC read adds the TSC offset * even when not intercepted. AMD manual doesn't explicitly * state this but appears to behave the same. * - * On userspace reads and writes, however, we unconditionally + * On userspace reads and writes, when KVM_X86_QUIRK_SPECIAL_TSC_READ + * is present, however, we unconditionally * return L1's TSC value to ensure backwards-compatible * behavior for migration. */ - u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset : - vcpu->arch.tsc_offset; + + if (msr_info->host_initiated && + kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) + tsc_offset = vcpu->arch.l1_tsc_offset; + else + tsc_offset = vcpu->arch.tsc_offset;
msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset; break;
Run the test once with quirk enabled and once disabled, and adjust the expected values accordingly.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- .../selftests/kvm/x86_64/tsc_msrs_test.c | 79 ++++++++++++++++--- 1 file changed, 69 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86_64/tsc_msrs_test.c b/tools/testing/selftests/kvm/x86_64/tsc_msrs_test.c index e357d8e222d47..3900c543a7ee1 100644 --- a/tools/testing/selftests/kvm/x86_64/tsc_msrs_test.c +++ b/tools/testing/selftests/kvm/x86_64/tsc_msrs_test.c @@ -79,8 +79,6 @@ static void run_vcpu(struct kvm_vm *vm, uint32_t vcpuid, int stage) { struct ucall uc;
- vcpu_args_set(vm, vcpuid, 1, vcpuid); - vcpu_ioctl(vm, vcpuid, KVM_RUN, NULL);
switch (get_ucall(vm, vcpuid, &uc)) { @@ -101,7 +99,7 @@ static void run_vcpu(struct kvm_vm *vm, uint32_t vcpuid, int stage) } }
-int main(void) +void run_test(bool quirk_disabled) { struct kvm_vm *vm; uint64_t val; @@ -109,6 +107,14 @@ int main(void) vm = vm_create_default(VCPU_ID, 0, guest_code);
val = 0; + if (quirk_disabled) { + struct kvm_enable_cap cap = { + .cap = KVM_CAP_DISABLE_QUIRKS, + .args[0] = KVM_X86_QUIRK_TSC_HOST_ACCESS, + }; + vm_enable_cap(vm, &cap); + } + ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC), val); ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC_ADJUST), val);
@@ -124,20 +130,67 @@ int main(void) ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC), val); ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC_ADJUST), val);
- /* - * Host: writes to MSR_IA32_TSC set the host-side offset - * and therefore do not change MSR_IA32_TSC_ADJUST. - */ - vcpu_set_msr(vm, 0, MSR_IA32_TSC, HOST_ADJUST + val); + if (quirk_disabled) { + struct kvm_tsc_state state = { + .tsc = HOST_ADJUST + val, + .flags = 0 + }; + vcpu_ioctl(vm, VCPU_ID, KVM_SET_TSC_STATE, &state); + } else { + /* + * Host: writes to MSR_IA32_TSC set the host-side offset + * and therefore do not change MSR_IA32_TSC_ADJUST + */ + vcpu_set_msr(vm, 0, MSR_IA32_TSC, HOST_ADJUST + val); + } + ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC), HOST_ADJUST + val); ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC_ADJUST), val); + + if (quirk_disabled) { + /* + * Host: writes to MSR_IA32_TSC work like in the guest + * when quirk is disabled + */ + vcpu_set_msr(vm, 0, MSR_IA32_TSC, val); + ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC), val); + ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC_ADJUST), val - HOST_ADJUST); + + /* Restore the value */ + vcpu_set_msr(vm, 0, MSR_IA32_TSC, HOST_ADJUST + val); + ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC_ADJUST), val); + } + run_vcpu(vm, VCPU_ID, 3);
- /* Host: writes to MSR_IA32_TSC_ADJUST do not modify the TSC. */ + /* + * Host: writes to MSR_IA32_TSC_ADJUST do not modify the TSC, + * (unless the quirk is disabled) + */ vcpu_set_msr(vm, 0, MSR_IA32_TSC_ADJUST, UNITY * 123456); - ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC), HOST_ADJUST + val); + ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC), + quirk_disabled ? UNITY * 123456 + HOST_ADJUST : HOST_ADJUST + val); ASSERT_EQ(vcpu_get_msr(vm, 0, MSR_IA32_TSC_ADJUST), UNITY * 123456);
+ if (quirk_disabled) { + /* + * Host: writes via KVM_SET_TSC_STATE + * to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST can be done + * independently + */ + struct kvm_tsc_state state = { + .tsc = UNITY * 42, + .tsc_adjust = UNITY * 42 - HOST_ADJUST, + .flags = KVM_TSC_STATE_TSC_ADJUST_VALID + }; + + vcpu_ioctl(vm, VCPU_ID, KVM_SET_TSC_STATE, &state); + + ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC), UNITY * 42); + ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC_ADJUST), + UNITY * 42 - HOST_ADJUST); + } + /* Restore previous value. */ vcpu_set_msr(vm, 0, MSR_IA32_TSC_ADJUST, val); ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC), HOST_ADJUST + val); @@ -162,6 +215,12 @@ int main(void) ASSERT_EQ(rounded_host_rdmsr(MSR_IA32_TSC_ADJUST), val - HOST_ADJUST);
kvm_vm_free(vm); +} +
+int main(void) +{ + run_test(false); + run_test(true); return 0; }
On Thu, Dec 03, 2020 at 07:11:15PM +0200, Maxim Levitsky wrote:
Hi!
This is the second version of the work to make TSC migration more accurate, as was defined by Paulo at: https://www.spinics.net/lists/kvm/msg225525.html
Maxim,
Can you please make a description of what is the practical problem that is being fixed, preferably with instructions on how to reproduce ?
I omitted most of the semi-offtopic points I raised related to TSC in the previous RFC where we can continue the discussion.
I do want to raise another thing that I almost forgot.
On AMD systems, the Linux kernel will mark the guest tsc as unstable unless invtsc is set which is set on recent AMD hardware.
Take a look at 'unsynchronized_tsc()' to verify this.
This is another thing that IMHO should be fixed at least when running under KVM.
Note that I forgot to mention that X86_FEATURE_TSC_RELIABLE also short-circuits this code, thus giving another reason to enable it under KVM.
Changes from V1:
- added KVM_TSC_STATE_TIMESTAMP_VALID instead of testing ns == 0
- allow diff < 0, because it is still better that capping it to 0
- updated tsc_msr_test unit test to cover this feature
- refactoring
Patches to enable this feature in qemu are in the process of being sent to qemu-devel mailing list.
Best regards, Maxim Levitsky
Maxim Levitsky (3): KVM: x86: implement KVM_{GET|SET}_TSC_STATE KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS kvm/selftests: update tsc_msrs_test to cover KVM_X86_QUIRK_TSC_HOST_ACCESS
Documentation/virt/kvm/api.rst | 65 +++++++++++++ arch/x86/include/uapi/asm/kvm.h | 1 + arch/x86/kvm/x86.c | 92 ++++++++++++++++++- include/uapi/linux/kvm.h | 15 +++ .../selftests/kvm/x86_64/tsc_msrs_test.c | 79 ++++++++++++++-- 5 files changed, 237 insertions(+), 15 deletions(-)
-- 2.26.2
I'm queueing patches 2 and 3 for now, since the sane MSR interface is a prerequisite anyway.
Paolo
On 03/12/20 18:11, Maxim Levitsky wrote:
Hi!
This is the second version of the work to make TSC migration more accurate, as was defined by Paulo at: https://www.spinics.net/lists/kvm/msg225525.html
I omitted most of the semi-offtopic points I raised related to TSC in the previous RFC where we can continue the discussion.
I do want to raise another thing that I almost forgot.
On AMD systems, the Linux kernel will mark the guest tsc as unstable unless invtsc is set which is set on recent AMD hardware.
Take a look at 'unsynchronized_tsc()' to verify this.
This is another thing that IMHO should be fixed at least when running under KVM.
Note that I forgot to mention that X86_FEATURE_TSC_RELIABLE also short-circuits this code, thus giving another reason to enable it under KVM.
Changes from V1:
- added KVM_TSC_STATE_TIMESTAMP_VALID instead of testing ns == 0
- allow diff < 0, because it is still better that capping it to 0
- updated tsc_msr_test unit test to cover this feature
- refactoring
Patches to enable this feature in qemu are in the process of being sent to qemu-devel mailing list.
Best regards, Maxim Levitsky
Maxim Levitsky (3): KVM: x86: implement KVM_{GET|SET}_TSC_STATE KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS kvm/selftests: update tsc_msrs_test to cover KVM_X86_QUIRK_TSC_HOST_ACCESS
Documentation/virt/kvm/api.rst | 65 +++++++++++++ arch/x86/include/uapi/asm/kvm.h | 1 + arch/x86/kvm/x86.c | 92 ++++++++++++++++++- include/uapi/linux/kvm.h | 15 +++ .../selftests/kvm/x86_64/tsc_msrs_test.c | 79 ++++++++++++++-- 5 files changed, 237 insertions(+), 15 deletions(-)
linux-kselftest-mirror@lists.linaro.org