Hi Dietmar,
Quoting Dietmar Eggemann (2015-03-17 10:13:39)
Hi Mike,
On 16/03/15 22:08, Michael Turquette wrote:
Instead of tracking "requests" by tagging cpus that have updated stats in a cpumask we could track "constraints", where a constraint is the return value of get_cpu_usage stored in a per-cpu variable whenever a task is migrated. (or whenever the load contribution for a task change significantly. more on that at the bottom)
I'm not sure this is necessary. The information that we have a cpu with a request is sufficient because we can always get the cpu usage just before we kick the kthread. (exactly what you do right now).
Well, I've never thought that calling get_cpu_usage from the kthread was really a good idea. Shouldn't there be some locking around accesses to the cfs_rq data structures?
This is one reason why I was considering migrating the data that the governor wants (per-cpu usage data) into per-cpu atomics, as opposed to iterating over every cpu with get_cpu_usage or utilization_load_avg_of (as it was called in my original series). It is basically a poor-mans IPC mechanism.
If accessing cfs_rq without locking is actually safe then please let me know.
The issue is that we could see a lot of requests during the time the em mechanism is throttled.
Why don't you just add:
cpumask_set_cpu(cpu_of(rq), &update_cpus);
if (energy_aware() && !cpumask_empty(&update_cpus)) arch_eval_cpu_freq(&update_cpus);
into task_tick_fair() to detect cpu usage changes due to a single task running in case there is no enqueue/dequeue happening?
Sure, and putting the change into update_entity_load_avg as I suggested achieves the same end result. That data will never be more stale than the rest of the PELT values.
Also I've been thinking a lot about the combination of throttling and the kthread. I'm starting to think that this design would end up being periodic (based on the throttle period), which isn't so different from how legacy CPUfreq governors work today. Do you have time to chat on the phone about this? I'll send out an invite.
[snip]
If we update constraints in per-cpu vars (as I outlined above) then could we move the code that does that out of {en,de}queue_fair_task and into some place common that solves this problem? I need some help on this one, but I was imagining something like:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6fce9e0..266f4aa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2785,6 +2785,7 @@ static inline void update_entity_load_avg(struct sched_entity *se, struct cfs_rq *cfs_rq = cfs_rq_of(se); long contrib_delta, utilization_delta; int cpu = cpu_of(rq_of(cfs_rq));
int cap; u64 now; /*
@@ -2818,6 +2819,10 @@ static inline void update_entity_load_avg(struct sched_entity *se, -utilization_delta); }
/* store updated per-cpu usage when update_cfs_rq is true */
cap = get_cpu_usage(cpu);
atomic_long_set(&per_cpu(usage, cpu), cap);
}trace_sched_load_avg_cpu(cpu, cfs_rq);
I guess it could work but why changing the current approach (calling arch_eval_cpu_freq() from CFS's enqueue, dequeue and *tick*)?
update_entity_load_avg() is all about updating PELT for the sched_entity se and the two cfs_rq PELT signals (cfs_rq->utilization_load_avg (blue*), cfs_rq->utilization_blocked_avg (green*)).
I just don't see the advantage right now to save cpu usage (red*) per CFS's enqueue, dequeue and tick operation comparing to save the
Thanks for the graph. I'll state some assumptions. Please let me know if I am wrong.
cpu0 & cpu1 are A15s whose max capacity is 1024. You are running some use case that mostly keeps those cpus fully utilized (or more correctly, over-utilized) at the highest frequency, which is why the red dot stays capped at SCHED_LOAD_SCALE.
cpu2 & cpu3 seem to show the same behavior but the different scales make it hard to be certain. I'll assume that they are both running at their highest frequency and over-utilized for most of the use case.
Assuming what I stated above is correct, this seems like useful data to me. The point of having the per-cpu atomic storage is to have lockless access to this useful info. If the cpufreq thread needs to iterate over this data to find the most utilized cpu in the freq domain then the per-cpu version of this removes any requirement to lock the cfs_rq structures.
information that a cpu requested a freq change. In case of 'pending requests' happening in a throttle period, what is the value of the per (cfs_)rq signal 'cpu usage' any way? It could be a stale value once the throttle period is over.
My proposal was to update the per-cpu data in update_entity_load_avg every time the load contribution changes. Thus it would never be stale (or no more or less stale than the rest of PELT values). Please let me know if I have any holes in that assumption.
When the governor thread finally is ready to perform a cpu frequency transition it has already had the latest data *pushed* to it via the shared per-cpu atomic.
- colors in attached diagram (data from TC2 (2 A15, 2 A7) in
update_entity_load_avg()
The above untested snippet only cares about the CFS usage. In the future we might want to aggregate it with DL and RT usage as well, or we might have a per-class, per-cpu atomic.
IMHO,we should focus on CFS (with the underlaying PELT mechanism) for now and solve this smaller set of problems first:
(1) Fixup the 'struct cpumask update_cpus' problem.
Can you elaborate on what this problem is? I wanted to replace this cpumask since I am not trying to solve the CONFIG_FAIR_GROUP_SCHED problem where a sched_entity is not a task. Is there another problem besides that?
(2) Fix the problem that we evaluate the cpu_usage exactly at the end of a throttling period if there were 'pending requests' even if there are no events at this moment.
Yes, I agree this is a problem.
(3) kick the logic whenever we cross the up_threshold for scaling cpu frequency up.
This can be taken care of by either kicking the thread in update_entity_load_avg or by adding the thread-kick to every caller of that function, namely task_tick_fair as you pointed out.
Any ideas on this approach?
Further thoughts:
Assuming that the above idea is agreeable, one of the questions I am grappling with is whether we should store the usage in a per-cpu variable or if we should add it as a member to struct rq (in which case it would be a sum of usage from every sched class).
I am currently only looking at the CFS-centric, per-cpu atomic variable method, following the philosophy of Keep It Simple, Stupid. However putting it in struct rq has benefits:
I like this 'CFS-centric, per-cpu atomic variable method' because IMHO it gives the playground for everything we need right now to let it work with EAS.
I am confused. At the top of this mail you said you didn't see the advantage, but you like it here. Hopefully we can discuss it over the call.
Thanks, Mike
- locking is already managed for us when we need to update with a new
usage value, as _THIS_ cpu is the one that will update its own usage. This means we can avoid costly atomic operations.
- locking is already managed for us when walking the rq's looking for a
max usage in places like load_balance. It is NOT managed when a new task is placed on the rq though, so this would need some synchronization? Or we could ignore the possibility of scaling cpu when a new task is enqueued.
IMHO, we can life with the little overhead of some atomics for now to get it working in CFS with EAS for the end of march RFC.
-- Dietmar
Regards, Mike