On Dec 9, 2020, at 2:14 AM, Thomas Gleixner tglx@linutronix.de wrote:
But what's more problematic is the basic requirement that time all over the place has to be consistent.
On machines which use DISTORTED_REALTIME everything _IS_ consistent within the distorted universe they created. It's still inconsistent vs. the outside, but that's unsolvable and none of our problems.
TLDR: Do not even think about opening pandoras box.
This could be a per-machine/per-vm setting that nonetheless uses the same underlying implementation. There are, sadly, lots of people using smeared time, and there will probably be VM hosts that simultaneously have both styles of guest. Supporting full PV time for both would be nice. Obviously this gets a bit screwy if they are using a paravirt fs, but it’s also a problem with NFS, etc. So maybe the nasty corners could be narrow enough to just say “don’t do that”.
If we want to get fancy, we can make a change that I've contemplated for a while -- we could make t_end explicit and have two copies of all these data structures. The reader would use one copy if t < t_change and a different copy if t >= t_change.
See below.
This would allow NTP-like code in usermode to schedule a frequency shift to start at a specific time.
That's an orthogonal problem and can be done without changing the reader side.
Really? Right now, the switch happens whenever the kernel takes the seqlock, which it doesn’t have exact control over. But I meant something a little different:
With some care, it would also allow the timekeeping code to update the data structures without causing clock_gettime() to block while the timekeeping code is running on a different CPU.
It still has to block, i.e. retry, because the data set becomes invalid when t_end is reached. So the whole thing would be:
do { seq = read_seqcount_latch(); data = select_data(seq); delta = read_clocksource() - data->base; if (delta >= data->max_delta) continue; .... } while (read_seqcount_latch_retry());
TBH, I like the idea for exactly one reason: It increases robustness.
I like the max_delta for robustness, too.
What do you have in mind for select_data()? Is the idea that select_data() returns something like &data[seq & 1]?
But I actually meant something a little bit different: you would use delta >= data->max_delta as an indication that you need to look at the other copy. Perhaps the lower three bits of the seqcount would work like:
00: both copies are valid, but start with the first copy. 10: only the first copy is valid. 01: both copies are valid, but start with the second copy. 11: only the second copy is valid
You'd read it like this (with all the bugs that I surely have fixed);
do { seq = read_seqcount(); data = &data_array[seq & 1]; clock = read_clocksource(); delta = clock - data->base; if (delta->data->max_delta) { if (seq & 2) continue; data = &data_array[(seq + 1) & 1]; // <-- the other copy delta = clock - data->base; if (delta >= data->max_delta) continue; } ...; } while (seq == read_seqcount());
This has two main benefits. First, it allows the timekeeping code to run concurrently with readers, which is nice for tail latency -- readers would only ever need to spin if the timekeeper falls behind, intentionally disables both copies, or somehow manages to run one iteration for each reader attempt and livelocks the reader. The latter is very unlikely.) Second, it allows the timekeeping code to literally schedule an update to occur at a precise clocksource tick, which seems to be like it could make the timekeeping code simpler and more robust.
(If the timekeeper wants to simultaneously disable both copies, it sets one copy's max_delta to zero and uses seq to disable the other copy.)
--Andy
For X86 we already have the comparison for dealing with TSC < base which would be covered by
if (delta >= data->max_delta) continue;
automatically. Non X86 gains this extra conditional, but I think it's worth to pay that price.
It won't solve the VM migration problem on it's own though. You still have to be careful about the inner workings of everything related to timekeeping itself.
One other thing that might be worth noting: there's another thread about "vmgenid". It's plausible that it's worth considering stopping the guest or perhaps interrupting all vCPUs to allow it to take some careful actions on migration for reasons that have nothing to do with timekeeping.
How surprising. Who could have thought about that?
OMG, virtualization seems to have gone off into a virtual reality long ago.
Thanks,
tglx