On Mon, 2020-12-07 at 10:04 -0800, Andy Lutomirski wrote:
On Dec 7, 2020, at 9:00 AM, Maxim Levitsky mlevitsk@redhat.com wrote:
On Mon, 2020-12-07 at 08:53 -0800, Andy Lutomirski wrote:
On Dec 7, 2020, at 8:38 AM, Thomas Gleixner tglx@linutronix.de wrote:
On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote: From a timekeeping POV and the guests expectation of TSC this is fundamentally wrong:
tscguest = scaled(hosttsc) + offset
The TSC has to be viewed systemwide and not per CPU. It's systemwide used for timekeeping and for that to work it has to be synchronized.
Why would this be different on virt? Just because it's virt or what?
Migration is a guest wide thing and you're not migrating single vCPUs.
This hackery just papers over he underlying design fail that KVM looks at the TSC per vCPU which is the root cause and that needs to be fixed.
I don't disagree with you. As far as I know the main reasons that kvm tracks TSC per guest are
- cases when host tsc is not stable
(hopefully rare now, and I don't mind making the new API just refuse to work when this is detected, and revert to old way of doing things).
That's a trainwreck to begin with and I really would just not support it for anything new which aims to be more precise and correct. TSC has become pretty reliable over the years.
- (theoretical) ability of the guest to introduce per core tsc offfset
by either using TSC_ADJUST (for which I got recently an idea to stop advertising this feature to the guest), or writing TSC directly which is allowed by Intel's PRM:
For anything halfways modern the write to TSC is reflected in TSC_ADJUST which means you get the precise offset.
The general principle still applies from a system POV.
TSC base (systemwide view) - The sane case
TSC CPU = TSC base + TSC_ADJUST
The guest TSC base is a per guest constant offset to the host TSC.
TSC guest base = TSC host base + guest base offset
If the guest want's this different per vCPU by writing to the MSR or to TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which is the offset to the TSC base of the guest.
How about, if the guest wants to write TSC_ADJUST, it can turn off all paravirt features and keep both pieces?
This is one of the things I had in mind recently.
Even better, we can stop advertising TSC_ADJUST in CPUID to the guest and forbid it from writing it at all.
Seems reasonable to me.
It also seems okay for some MSRs to stop working after the guest enabled new PV timekeeping.
This is a very good idea!
I do have a feature request, though: IMO it would be quite nifty if the new kvmclock structure could also expose NTP corrections. In other words, if you could expose enough info to calculate CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC, and CLOCK_REALTIME, then we could have paravirt NTP.
Bonus points if whatever you do for CLOCK_REALTIME also exposes leap seconds in a race free way :). But I suppose that just exposing TAI and letting the guest deal with the TAI - UTC offset itself would get the job done just fine.
This is a good idea too. As I understand it, this gives a justification to a new kvmclock purpose, which wouldn't be focused anymore on correcting the tsc shortcomings (unstable/unscalable tsc), but more on things like that. I like that idea.
Best regards, Maxim Levitsky