Hello,
I'm using the EASv5 3.18 tree with cpufreq_sched. With the sched governor enabled I've noticed that after a migration or after a switch from a non-fair task to the idle task, the source CPU goes idle and its (possibly max) capacity request stays in place, preventing other requests from going through until that source CPU decides to wake up and take up some work. I know that there are some ongoing discussions about how to actually enforce a frequency reduction when a CPU enters idle to save power, but this seems to be a more immediate problem since the other CPU(s)' requests are also basically ignored. How about a reset_capacity call in pick_next_task_idle? Throttling is a concern I suppose, but I think the check in dequeue_task_fair is doing the same thing already, so the following would just repeat for non_fair_class->idle_task.
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c index c65dac8..555c21d 100644 --- a/kernel/sched/idle_task.c +++ b/kernel/sched/idle_task.c @@ -28,6 +28,8 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev) { put_prev_task(rq, prev);
+ cpufreq_sched_reset_cap(cpu_of(rq)); + schedstat_inc(rq, sched_goidle); return rq->idle; }
Thanks, Vikram
Hi Vikram,
On 11/11/2015 11:16 AM, Vikram Mulukutla wrote:
I'm using the EASv5 3.18 tree with cpufreq_sched. With the sched governor enabled I've noticed that after a migration or after a switch from a non-fair task to the idle task, the source CPU goes idle and its (possibly max) capacity request stays in place, preventing other requests from going through until that source CPU decides to wake up and take up some work.
I haven't been able to reconcile this behavior with the code. There are hooks in dequeue_task_fair and the migration paths that should update the CFS capacity vote if the CPU CFS runqueue is empty.
Since the dequeue_task_fair path calls cpufreq_sched_reset_cap, it will zero out the CPU's capacity vote but this API does not trigger a re-evaluation of the overall required cluster capacity and set a new OPP. Another event (on any CPU in the cluster) will need to occur which will cause the cluster capacity to be re-evaluated. Any chance this is what you are seeing?
I know that there are some ongoing discussions about how to actually enforce a frequency reduction when a CPU enters idle to save power, but this seems to be a more immediate problem since the other CPU(s)' requests are also basically ignored.
If there's a sequence where we can ignore a CPU's request for higher capacity then agreed, that is a more immediate problem.
thanks, Steve
On 11/11/2015 05:02 PM, Steve Muckle wrote:
I'm using the EASv5 3.18 tree with cpufreq_sched. With the sched
governor enabled I've noticed that after a migration or after a switch from a non-fair task to the idle task, the source CPU goes idle and its (possibly max) capacity request stays in place, preventing other requests from going through until that source CPU decides to wake up and take up some work.
I haven't been able to reconcile this behavior with the code. There are hooks in dequeue_task_fair and the migration paths that should update the CFS capacity vote if the CPU CFS runqueue is empty.
There's a bit of strangeness in cpufreq_sched where the CPU driving the max request in the cluster must be the one to set the new OPP. Not sure if this could be related to what you're seeing. I'm about to send out another email on it as I want to take that logic out.
Hi Steve,
On 11/11/2015 5:02 PM, Steve Muckle wrote:
I haven't been able to reconcile this behavior with the code. There are hooks in dequeue_task_fair and the migration paths that should update the CFS capacity vote if the CPU CFS runqueue is empty.
Since the dequeue_task_fair path calls cpufreq_sched_reset_cap, it will zero out the CPU's capacity vote but this API does not trigger a re-evaluation of the overall required cluster capacity and set a new OPP. Another event (on any CPU in the cluster) will need to occur which will cause the cluster capacity to be re-evaluated. Any chance this is what you are seeing?
I took a second look - the problem is that the dequeue_task_fair will only kick in if the task being dequeued is going to sleep, which may not be true in case of forced preemption. So the dequeue before switching to the migration thread doesn't actually reset the capacity.
In the particular case that I was seeing, the max request was actually ignored because of the cpufreq_sched throttling scheme, which resulted in future requests being ignored:
1) There is a single task TaskA on CPU0. 2) CPU0 makes a max frequency request as part of enqueue_task_fair that is ignored because of throttling, but per_cpu(0,capacity) is set to max (in fact it's 1278 after the capacity margin is added). 3) TaskA is forcefully preempted. dequeue_task_fair is invoked but does not reset capacity since task_sleep=false. 4) migration/0 moves TaskA off of CPU0 and onto CPU1. 5) CPU0 switches to the swapper 6) Now CPU1 attempts to raise its frequency request (after some decay of taskA's util so not fmax), but all its requests are ignored since CPU0 has the max request. 7) CPU0 remains in idle for a long time.
It seems that pick_next_idle_task is a better place to reset the capacity unconditionally, but pulling the reset_cap out of the if(task_sleep) block would also work I think?
Thanks, Vikram
On 11/11/2015 7:42 PM, Vikram Mulukutla wrote:
Hi Steve,
On 11/11/2015 5:02 PM, Steve Muckle wrote:
I haven't been able to reconcile this behavior with the code. There are hooks in dequeue_task_fair and the migration paths that should update the CFS capacity vote if the CPU CFS runqueue is empty.
Since the dequeue_task_fair path calls cpufreq_sched_reset_cap, it will zero out the CPU's capacity vote but this API does not trigger a re-evaluation of the overall required cluster capacity and set a new OPP. Another event (on any CPU in the cluster) will need to occur which will cause the cluster capacity to be re-evaluated. Any chance this is what you are seeing?
I took a second look - the problem is that the dequeue_task_fair will only kick in if the task being dequeued is going to sleep, which may not be true in case of forced preemption. So the dequeue before switching to the migration thread doesn't actually reset the capacity.
In the particular case that I was seeing, the max request was actually ignored because of the cpufreq_sched throttling scheme, which resulted in future requests being ignored:
- There is a single task TaskA on CPU0.
- CPU0 makes a max frequency request as part of enqueue_task_fair
that is ignored because of throttling, but per_cpu(0,capacity) is set to max (in fact it's 1278 after the capacity margin is added).
To be precise, the max request in (2) is made because TaskA was enqueued on CPU0.
- TaskA is forcefully preempted. dequeue_task_fair is invoked but
does not reset capacity since task_sleep=false. 4) migration/0 moves TaskA off of CPU0 and onto CPU1. 5) CPU0 switches to the swapper 6) Now CPU1 attempts to raise its frequency request (after some decay of taskA's util so not fmax), but all its requests are ignored since CPU0 has the max request. 7) CPU0 remains in idle for a long time.
It seems that pick_next_idle_task is a better place to reset the capacity unconditionally, but pulling the reset_cap out of the if(task_sleep) block would also work I think?
Thanks, Vikram
Hi Vikram,
On 11/11/15, Vikram Mulukutla wrote:
On 11/11/2015 7:42 PM, Vikram Mulukutla wrote:
Hi Steve,
On 11/11/2015 5:02 PM, Steve Muckle wrote:
I haven't been able to reconcile this behavior with the code. There are hooks in dequeue_task_fair and the migration paths that should update the CFS capacity vote if the CPU CFS runqueue is empty.
Since the dequeue_task_fair path calls cpufreq_sched_reset_cap, it will zero out the CPU's capacity vote but this API does not trigger a re-evaluation of the overall required cluster capacity and set a new OPP. Another event (on any CPU in the cluster) will need to occur which will cause the cluster capacity to be re-evaluated. Any chance this is what you are seeing?
I took a second look - the problem is that the dequeue_task_fair will only kick in if the task being dequeued is going to sleep, which may not be true in case of forced preemption. So the dequeue before switching to the migration thread doesn't actually reset the capacity.
In the particular case that I was seeing, the max request was actually ignored because of the cpufreq_sched throttling scheme, which resulted in future requests being ignored:
- There is a single task TaskA on CPU0.
- CPU0 makes a max frequency request as part of enqueue_task_fair
that is ignored because of throttling, but per_cpu(0,capacity) is set to max (in fact it's 1278 after the capacity margin is added).
To be precise, the max request in (2) is made because TaskA was enqueued on CPU0.
- TaskA is forcefully preempted. dequeue_task_fair is invoked but
does not reset capacity since task_sleep=false.
So, why is TaskA preempted? Activation of some other task?
Also, we check for tash_sleep=false because we have triggering points for load_balancing operations (as you might move more than one task at a time). I'm wondering why those points don't cover this case.
Thanks,
- Juri
- migration/0 moves TaskA off of CPU0 and onto CPU1.
- CPU0 switches to the swapper
- Now CPU1 attempts to raise its frequency request (after some
decay of taskA's util so not fmax), but all its requests are ignored since CPU0 has the max request. 7) CPU0 remains in idle for a long time.
It seems that pick_next_idle_task is a better place to reset the capacity unconditionally, but pulling the reset_cap out of the if(task_sleep) block would also work I think?
Thanks, Vikram
On 11/12/2015 1:37 AM, Juri Lelli wrote:
Hi Vikram,
On 11/11/15, Vikram Mulukutla wrote:
On 11/11/2015 7:42 PM, Vikram Mulukutla wrote:
- There is a single task TaskA on CPU0.
- CPU0 makes a max frequency request as part of enqueue_task_fair
that is ignored because of throttling, but per_cpu(0,capacity) is set to max (in fact it's 1278 after the capacity margin is added).
To be precise, the max request in (2) is made because TaskA was enqueued on CPU0.
- TaskA is forcefully preempted. dequeue_task_fair is invoked but
does not reset capacity since task_sleep=false.
So, why is TaskA preempted? Activation of some other task?
Also, we check for tash_sleep=false because we have triggering points for load_balancing operations (as you might move more than one task at a time). I'm wondering why those points don't cover this case.
So in the particular trace I was looking at it's the sched_exec call that forces the process off of CPU0 to CPU1. So, TaskA started running on CPU0, called do_exec and that led to sched_exec->select_task_rq_fair. That resulted in CPU1 being selected which then caused migration/0 to run on the CPU0 to force TaskA on to CPU1. CPU0 then switched to idle and went to sleep with a max cap request.
I suppose a task setting its own affinity would also potentially result in this situation?
Thanks, Vikram
On 11/12/2015 02:03 PM, Vikram Mulukutla wrote:
So in the particular trace I was looking at it's the sched_exec call that forces the process off of CPU0 to CPU1. So, TaskA started running on CPU0, called do_exec and that led to sched_exec->select_task_rq_fair. That resulted in CPU1 being selected which then caused migration/0 to run on the CPU0 to force TaskA on to CPU1. CPU0 then switched to idle and went to sleep with a max cap request.
I suppose a task setting its own affinity would also potentially result in this situation?
Yep. It looks like the migration paths in core.c were missed and don't have the hooks to update CPU capacity votes.
I've added this to my todo list. Not sure what your tree looks like/what version of the code you've got but you probably want to put the same hooks used to adjust CPU capacity in the load balance path in migration_cpu_stop() as well. In the latest that's update_capacity_of(cpu).