On 21/01/15 17:29, John Stultz wrote:
On Wed, Jan 21, 2015 at 8:53 AM, Daniel Thompson daniel.thompson@linaro.org wrote:
Currently it is possible for an NMI (or FIQ on ARM) to come in and read sched_clock() whilst update_sched_clock() has half updated the state. This results in a bad time value being observed.
This patch fixes that problem in a similar manner to Thomas Gleixner's 4396e058c52e("timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC").
Note that ripping out the seqcount lock from sched_clock_register() and replacing it with a large comment is not nearly as bad as it looks! The locking here is actually pretty useless since most of the variables modified within the write lock are not covered by the read lock. As a result a big comment and the sequence bump implicit in the call to update_epoch() should work pretty much the same.
It still looks pretty bad, even with the current explanation.
I'm inclined to agree. Although to be clear, the code I proposed should not more broken than the code we have today (and arguably more honest).
raw_write_seqcount_begin(&cd.seq);
/*
* sched_clock will report a bad value if it executes
* concurrently with the following code. No locking exists to
* prevent this; we rely mostly on this function being called
* early during kernel boot up before we have lots of other
* stuff going on.
*/ read_sched_clock = read; sched_clock_mask = new_mask; cd.rate = rate; cd.wrap_kt = new_wrap_kt; cd.mult = new_mult; cd.shift = new_shift;
cd.epoch_cyc = new_epoch;
cd.epoch_ns = ns;
raw_write_seqcount_end(&cd.seq);
update_epoch(new_epoch, ns);
So looking at this, the sched_clock_register() function may not be called super early, so I was looking to see what prevented bad reads prior to registration.
Certainly not super early, but, from the WARN_ON() at the top of the function I thought it was intended to be called before start_kernel() unmasks interrupts...
And from quick inspection, its nothing. I suspect the undocumented trick that makes this work is that the mult value is initialzied to zero, so sched_clock returns 0 until things have been registered.
So it does seem like it would be worth while to do the initialization under the lock, or possibly use the suspend flag to make the first initialization safe.
As mentioned the existing write lock doesn't really do very much at the moment.
The simple and (I think) strictly correct approach is to duplicate the whole of the clock_data (minus the seqcount) and make the read lock in sched_clock cover all accesses to the structure.
This would substantially enlarge the critical section in sched_clock() meaning we might loop round the seqcount fractionally more often. However if that causes any real problems it would be a sign the epoch was being updated too frequently.
Unless I get any objections (or you really want me to look closely at using suspend) then I'll try this approach in the next day or two.
Daniel.