On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
On 08/12/20 18:08, Maxim Levitsky wrote:
Even if you support TSCADJUST and let the guest write to it does not change the per guest offset at all. TSCADJUST is per [v]CPU and adds on top:
tscvcpu = tsc_host + guest_offset + TSC_ADJUST
Scaling is just orthogonal and does not change any of this.
I agree with this, and I think that this is what we will end up doing. Paulo, what do you think about this?
Yes, you can have a VM ioctl that saves/restores cur_tsc_nsec and cur_tsc_write. The restore side would loop on all vcpus.
Why would the restore need to run on all VCPUs though?
The picture that I have in mind is that we we will restore the cur_tsc_nsec/cur_tsc_write pair once per VM, and then restore the TSC_ADJUST value on all vCPUs as if the guest wrote it (with the quirk disabled), which will restore all the horrible things that the guest did to its TSC.
Since TSC adjust is enabled by default, if the user disables it in the CPUID, we might as well just disable the whole thing, make TSC readonly or something similar.
This way we don't need to worry about TSC writes as those will be reflected in the TSC_ADJUST.
However, it is not so easy: 1) it would have to be usable only if KVM_X86_QUIRK_TSC_HOST_ACCESS is false, 2) it would fail if kvm->arch.nr_vcpus_matched_tsc == kvm->online_vcpus (which basically means that userspace didn't mess up the TSC configuration). If not, it would return -EINVAL.
Note though that we don't track the guest tsc/tsc_adjust writes anymore via the tsc sync code, and with the quirk disabled we don't track them even for the host writes, thus the (2) condition will always be true (the only call left to kvm_synchronize_tsc is in the kvm_arch_vcpu_postcreate)
However I indeed see no reason to allow new per VM API when the masterclock is disabled, and therefore using cur_tsc_nsec/cur_tsc_write is reasonable.
The cur_tsc_nsec should IMHO be converted to absolute time (CLOCK_REALTIME or similiar) or should the conversion be done in the userspace? I don't know yet if I can convert between different POSIX clocks in race/error free manner. (e.g get the offset between them).
Also, while at it let's burn and pour salt on the support for KVM_SET_TSC_KHZ unless TSC scaling is supported, together with vcpu->tsc_catchup and all the "tolerance" crap that is in kvm_set_tsc_khz. And initialize vcpu->arch.virtual_tsc_khz to kvm->arch.last_tsc_khz before calling kvm_synchronize_tsc.
The tsc_catchup is enabled when host TSC is unstable, so that guest TSC at least roughtly follows real time (which host kernel gets by other means).
We push the guest tsc forward roughtly each time VM entry from userspace happens:
(On each kvm_arch_vcpu_load, we raise KVM_REQ_GLOBAL_CLOCK_UPDATE request which (with small delay) makes all vcpus go through kvm_guest_time_update which pushes the guest tsc forward)
However we also have the 'tsc_always_catchup' mode which is indeed something I would like to remove.
It is a poor man simulation of the TSC scaling which is enabled when the host doesn't support TSC scaling, but we were asked to run at frequency faster than host TSC frequency is.
This way guest tsc runs slower than it should, but on each VM exit, we punt the guest tsc offset forward so on average the guest tsc doesn't lag behind.
Unless backward compatibility is an issue, I have no objections to remove that code.
The tolerance thing might be needed. On many systems (including mine new 3970X), the hardware doesn't have means to report the unscaled host TSC frequency, thus the kernel has to measure it, and since the measurement is not 100% accurate, a slightly different value is used every time the host boots.
Thus without a small tolerance, we will always have to use TSC scaling, while migrating even between two identical systems. I don't know if this is an issue.
Best regards, Maxim Levitsky
Paolo