On 16/09/16 19:42, Srivatsa Vaddagiri wrote:
- Juri Lelli juri.lelli@arm.com [2016-09-16 10:22:52]:
We realize that not all architectures have hardware clock that is synchronized across CPUs and I think it should still be possible to have synchronized windows as long as the frequency of hardware clock is same on all cpus. That would be the next major change WALT need to address.
Do you think it might be actually possible to relax the synchronization constraint and implement WALT with un-synchronized windows?
The main difficulty would be to adjust busy counters when task migrate. Synchronized windows would make this pretty trivial. We subtract task's current window contribution from src_cpu and add that to dst_cpu.
Right. PELT does the removed_{load,utilization} atomic dance to solve this problem. But, of course, is not as immediate as an atomic src/dst update as you do. Signals are still an approximation of real execution though, so it remains to see how much the added locking pays off.
In our early version of WALT, we did not have synchronized windows across CPU. Windows applied to just tasks and not cpus. Each task tracked its own window_start and cpus did not even track windows. The cited benefit of WALT (rapid reclassification of tasks) can still be had from such a scheme.
Yes.
The additional advantage we get from synchronized windows and busy-time adjustment upon migration is related to frequency. Lets say task is migrating between little cpu cluster and big cpu cluster at the end of a window (because it got classified as big task towards end of window). Synchronized windows allow us to migrate task's busy time away from its little cpu to big cpu. The load reported for little cpu after this adjustment ensures that little cpu's frequency for next window does not include a representation of migrated task's needs. Vice versa for big cpu.
Yeah. I remember we had that problem on product codeline before switching to WALT and the way it was fixed involved kicking a utilization update in the src runqueue (little cpu in your example) and then a frequency re-evalutation for src runqueue's cluster. Big CPU update is already happening as consequence of an enqueue there.
I haven't given much thought on how impossible this would be to achieve on other architectures. Does anyone foresee this to be a show-stopper on any architecture?
I'm mostly afraid of the fact that you basically reintroduce double locking on migration after it has been removed (for CFS) a couple of years ago with commit 163122b7fcfa "sched/fair: Remove double_lock_balance() from load_balance()".
Regarding overheads associated with synchronization, there is only a small overhead during bootup when secondary cpus need to sync up on 'window_start' for first time. After that they roll on their own (provided there is some constant offset between hardware clock of various cpus).
Not sure we can assume this for all architectures.
The other subtle overhead related to synchronization is that we require both src_rq and dst_rq lock to be held during migration (so that we can fixup busy times). I need to think some more and see if we may be able to relax that requirement.
- vatsa
-- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project