On Tue, Dec 08 2020 at 12:32, Andy Lutomirski wrote:
On Dec 8, 2020, at 11:25 AM, Thomas Gleixner tglx@linutronix.de wrote: One issue here is that guests might want to run their own NTP/PTP. One reason to do that is that some people prefer the leap second smearing NTP servers.
I would hope that using this part would be optional on the guest’s part. Guests should be able to use just the CLOCK_MONOTONIC_RAW part or fancier stuff at their option.
(Hmm, it would, in principle, be possible for a guest to use the host’s TAI but still smear leap seconds. Even without virt, smearing could be a per-timens option.)
No. Don't even think about it. Read the thread:
https://lore.kernel.org/r/20201030110229.43f0773b@jawa
all the way through the end and then come up with a real proposal which solves all of the issues mentioned there.
I might be missing the obvious, but please before you make proposals about time keeping out of thin air, please do your homework. You would not ask these questions otherwise.
If it would be that simple we wouldn't be discussing this in the first place.
Sorry for being blunt, but this has been discussed to death already.
It can be solved on the theory level, but it's not practical.
You _cannot_ make leap second smearing or different notions of clock realtime an isolated problem. We have
CLOCK_MONOTONIC CLOCK_BOOTTIME CLOCK_REALTIME CLOCK_TAI
They share one fundamental property:
All frequency adjustments done by NTP/PTP/PPS or whatever affect _ALL_ of them in the exactly same way.
I.e. leap second smearing whether you like it or not is affecting all of them and there is nothing we can do about that. Why?
1) Because it's wrong to begin with. It creates a seperate universe of CLOCK_REALTIME and therefore of CLOCK_TAI because they are strictly coupled by definition.
2) It's user space ABI. adjtimex() can make the kernel do random crap. So if you extend that to time namespaces (which is pretty much the same as a guest time space) then you have to solve the following problems:
A) The tick based gradual adjustment of adjtimex() to avoid time jumping around have to be done for every time universe which has different parameters than the host.
Arguably this can be solved by having a seqcount based magic hack which forces the first time name space consumer into updating the per time name space notion of time instead of doing it from the host tick, but that requires to have fully synchronized nested sequence counts and if you extend this to nested virt it creates an exponential problem.
B) What to do about enforced time jumps on the host (settimeofday, adjtimex)?
C) Once you have solved #1 and #2 explain how timers (nanosleep, interval timers, ....) which are all user space ABI and have the fundamental guarantee to not expire early can be handled in a sane and scalable way.
Once you have a coherent answer for all of the above I'm happy to step down and hand over. In that case I'm more than happy not to have to deal with the inevitable regression reports.
tglx etc, I think that doing this really really nicely might involve promoting something like the current vDSO data structures to ABI -- a straightforward-ish implementation would be for the KVM host to export its vvar clock data to the guest and for the guest to use it, possibly with an offset applied. The offset could work a lot like timens works today.
Works nicely if the guest TSC is not scaled. But that means that on migration the raw TSC usage in the guest is borked because the new host might have a different TSC frequency.
If you use TSC scaling then the conversion needs to take TSC scaling into account which needs some thought. And the guest would need to read the host conversion from 'vdso data' and the scaling from the next page (per guest) and then still has to support timens. Doable but adds extra overhead on every time read operation.
Is the issue that scaling would result in a different guest vs host frequency? Perhaps we could limit each physical machine to exactly two modes: unscaled (use TSC ticks, convert in software) and scaled to nanoseconds (CLOCK_MONOTONIC_RAW is RDTSC + possible offset). Then the host could produce its data structures in exactly those two formats and export them as appropriate.
The latter - nanoseconds scaling - is the only reasonable solution but then _ALL_ involved hosts must agree on that.
If you want to avoid that you are back to the point where you need to chase all guest data when the host NTP/PTP adjusts the host side. Chasing and updating all this stuff in the tick was the reason why I was fighting the idea of clock realtime in namespaces.
I think that, if we can arrange for a small, bounded number of pages generated by the host, then this problem isn’t so bad.
Whishful thinking unless we have a very strict contract about
- scaling to 1 GHz, aka nanoseconds, - all guest argree with the host defined management of clock REALTIME and TAI
Hmm, leap second smearing is just a different linear mapping. I’m not sure how leap second smearing should interact with timens, but it seems to be that the host should be able to produce four data pages (scaled vs unscaled and smeared vs unsmeared) and one per-guest/timens offset page (where offset applies to MONOTONIC and MONOTONIC_RAW only) and cover all bases. Or do people actually want to offset their TAI and/or REALTIME, and what would that even mean if the offset crosses a leap second?
See above.
And yes people want to run different time universes where TAI != TAI.
(I haven’t though about the interaction of any of this with ART.)
Obviously not:) And I'm pretty sure that nobody else did, but ART is the least of my worries so far.
The current VM migration approach is: Get it done no matter what, we deal with the fallout later. Which means endless tinkering because you can't fix essential design fails after the fact.
Thanks,
tglx