On Thu, Dec 10, 2020 at 9:16 AM Andy Lutomirski luto@amacapital.net wrote:
On Dec 10, 2020, at 6:52 AM, Maxim Levitsky mlevitsk@redhat.com wrote:
On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
On 08/12/20 22:20, Thomas Gleixner wrote: So now life migration comes a long time after timekeeping had set the limits and just because it's virt it expects that everything works and it just can ignore these limits.
TBH. That's not any different than SMM or hard/firmware taking the machine out for lunch. It's exactly the same: It's broken.
I agree. If *live* migration stops the VM for 200 seconds, it's broken.
Sure, there's the case of snapshotting the VM over the weekend. My favorite solution would be to just put it in S3 before doing that. *Do what bare metal does* and you can't go that wrong.
Note though that qemu has a couple of issues with s3, and it is disabled by default in libvirt. I would be very happy to work on improving this if there is a need for that.
There’s also the case where someone has a VM running on a laptop and someone closes the lid. The host QEMU might not have a chance to convince the guest to enter S3.
In general it's userspace policy whether to keep the TSC value the same across live migration. There's pros and cons to both approaches, so KVM should provide the functionality to keep the TSC running (which the guest will see as a very long, but not extreme SMI), and this is what this series does. Maxim will change it to operate per-VM. Thanks Thomas, Oliver and everyone else for the input.
So, to be clear, this per-VM ioctl will work something like the following:
static u64 kvm_read_tsc_base(struct kvm *kvm, u64 host_tsc) { return kvm_scale_tsc(kvm, host_tsc) + kvm->host_tsc_offset; }
case KVM_GET_TSC_BASE: struct kvm_tsc_base base = { .flags = KVM_TSC_BASE_TIMESTAMP_VALID; }; u64 host_tsc;
kvm_get_walltime(&base.nsec, &host_tsc); base.tsc = kvm_read_tsc_base(kvm, host_tsc);
copy_to_user(...);
[...]
case KVM_SET_TSC_BASE: struct kvm_tsc_base base; u64 host_tsc, nsec; s64 delta = 0;
copy_from_user(...);
kvm_get_walltime(&nsec, &host_tsc); delta += base.tsc - kvm_read_tsc_base(kvm, host_tsc);
if (base.flags & KVM_TSC_BASE_TIMESTAMP_VALID) { u64 delta_nsec = nsec - base.nsec;
if (delta_nsec > 0) delta += nsec_to_cycles(kvm, diff); else delta -= nsec_to_cycles(kvm, -diff); }
kvm->host_tsc_offset += delta; /* plumb host_tsc_offset through to each vcpu */
However, I don't believe we can assume the guest's TSCs to be synchronized, even if sane guests will never touch them. In this case, I think a per-vCPU ioctl is still warranted, allowing userspace to get at the guest CPU adjust component of Thomas' equation below (paraphrased):
TSC guest CPU = host tsc base + guest base offset + guest CPU adjust
Alternatively, a write from userspace to the guest's IA32_TSC_ADJUST with KVM_X86_QUIRK_TSC_HOST_ACCESS could have the same effect, but that seems to be problematic for a couple reasons. First, depending on the guest's CPUID the TSC_ADJUST MSR may not even be available, meaning that the guest could've used IA32_TSC to adjust the TSC (eww). Second, userspace replaying writes to IA32_TSC (in the case IA32_TSC_ADJUST doesn't exist for the guest) seems _very_ unlikely to work given all the magic handling that KVM does for writes to it.
Is this roughly where we are or have I entirely missed the mark? :-)
-- Thanks, Oliver
I agree with that.
I still think though that we should have a discussion on feasibility of making the kernel time code deal with large *forward* tsc jumps without crashing.
If that is indeed hard to do, or will cause performance issues, then I agree that we might indeed inform the guest of time jumps instead.
Tglx, even without fancy shared host/guest timekeeping, count the guest kernel manage to update its timekeeping if the host sent the guest an interrupt or NMI on all CPUs synchronously on resume?
Alternatively, if we had the explicit “max TSC value that makes sense right now” in the timekeeping data, the guest would reliably notice the large jump and could at least do something intelligent about it instead of overflowing its internal calculation.