Re: [Eas-dev] [PATCH v1 7/7] sched/fair: consider CPU overutilized only when it is not idle

13 Oct 2016


      On Wed, Oct 12, 2016 at 12:35:00PM +0100, Patrick Bellasi wrote:
[...]
...
On 12-Oct 10:33, Leo Yan wrote:
...
On Tue, Oct 11, 2016 at 04:58:28PM +0100, Patrick Bellasi wrote:
...
On 11-Oct 21:37, Leo Yan wrote:
...
On Tue, Oct 11, 2016 at 12:35:26PM +0100, Patrick Bellasi wrote:
...
On 10-Oct 16:35, Leo Yan wrote:
...
Energy aware scheduling sets tipping point when any CPU in the system is
overutilized. So there have several occasions to set root domain's
overutilized flag to indicate system is over tipping point, like
scheduler tick, load balance, enqueue task, on the other hand the
scheduler only utilize load balance's function update_sg_lb_stats() to
iterate all CPUs to make sure all CPUs are not overutilized and then
clear this flag after system is under tipping point,
For idle CPU, it will keep stale utilization value and this value will
not be updated until the CPU is waken up. In some worse case, the CPU
may stay in idle states for very long time (even may in second level),
so before the CPU enter idle state it has quite high utilization value
this will let scheduler always think the CPU is "overutilized" and will
not switch to state for under tipping point. As result, a very small
task stays on big core for long time due system cannot go back to energy
aware path.
What happen instead if a busy CPU has just entered idle but it's
likely to exit quite soon?
For the case if CPUs can update their utilization timely, the small task
usually can migrate back to LITTLE core quickly.
I was referring mainly to the case for example of an 80% task with a
short period (e.g. 16ms), which has just gone to sleep. Such a task
has a PELT signals which varies in [800:860] once stable.
In this case the CPU can appear to be IDLE but as soon as the task
wakeup we will not have dacayed its utilization below the tipping
point.
For this case, if load balance has cleared overutilied flag for idle
CPUs at previous time, I think we still can rely on the 80% task is
enqueued and at that time point it will set 'overutilized' flag again.
Mmm... don't we risk to schedule the 80% task using the energy-aware
path instead of select_idle_siblings?
The main issue to me is that we are going back to energy aware while
the system, in this specific scenario, is overutilized. Dunno what are
the implications, but that's a critical point to be considered... what
about scheduler stability? Don't we risk to increase the frequency of
entering/exiting EAS mode?
This is good point. I will try your proposal to update idle CPUs
utilization in function update_sg_lb_stats() so can let idle CPUs
signal also can decay step by step.
...
...
...
...
...
...
This patch is to check CPU idle state in function update_sg_lb_stats(),
so if CPU is in idle state then will simply consider the CPU is not
overutilized. So avoid to set tipping point by idle CPUs.
Maybe it's possible, just for idle CPUs marked as overutilized, to
trigger at this point an update_cfs_rq_load_avg and than verify if the
utilization signal has decayed enough for the CPU to be considered not
overutilized anymore?
Yes. Essentially this issue is caused by the util_avg signal has not
been updated timely, so it's better to find method to update idle
CPU's utilization properly.
IIRC Morten before in one email reminded me if want to update idle CPU's
load_avg, should acquire rq's lock.
True, but that's an idle CPU, thus it should not be a main issue.
Unless we already have a lock for another CPU, but I don't think
that's our case (see after).
...
Another thing should consider is,
idle CPU's stale utilization is not only impacted for this case, it
also introduce misunderstanding in wakenup path for CPU's selection,
so I sent another patch before to update idle CPU's utilization with
function update_blocked_averages [1].
[1] https://lists.linaro.org/pipermail/eas-dev/2016-September/000567.html
Exactly, does it makes sense to use the same update_blocked_averages()
for an idle CPU at this point?
Using update_blocked_averages() is safe I think, and your suggestion to
update idle CPUs utilization obviously has less workload than my patch
[1].
It's deserved to try this method, BTW, I'm just curious do you know
how to measure the scheduler workloads? I have no concept how much extra
workload introduced by this patch, so want to compare some performance
before and after applied the patch.
Probably you can collect some interesting stats using our new LISA's
functions profiling support. This should allow you to profile the time
spent in update_sg_lb_stats (or any other function).
Please note that you should probably change the definition of this
function from:
  static inline void update_sg_lb_stats
to be:
  noinline void update_sg_lb_stats
The usage of "noinline" will ensure that the function is available to
ftrace for profiling.
Here is an example notebook to get started:
   https://github.com/ARM-software/lisa/blob/master/ipynb/examples/trace_analys...
You can easily update the notebook to get a comparison among two
runs. Here is a notebook I've used to compare two different versions
of energy_diff:
  https://gist.github.com/derkling/fd9d36e004da977cd3be0001c8abaa96
Thanks for sharing. Will do this.
...
...
...
...
But I think the patch [1] is hard to be accepted due it introduces race
condition for rq's locks. So this is why I go back to use most simple
method to resolve part of the issue in this patch.
AFAIKS, at least on a 3.18 kernel, these are the calls which ends-up
to update_sg_lb_stats:
idle_balance
  /* release this_cpu lock */
rebalance_domanins
  /* does not hold any RQ lock */
  load_balance
    find_busiest_group
      update_sg_lb_stats
To me it seems that we always enter the update_sg_lb_stats without
holding any RQ lock. Am I missing something?
Otherwise, what are the races we expect if we try to get the RQ
lock for an idle CPU from within update_sg_lb_stats?
For example, there have two CPUs (e.g. CPU0 and CPU1) to update
CPU2's util_avg and load_avg:
   CPU0                                              CPU1


idle_balance                                     ttwu_queue(CPU2's rq)
   /* release this_cpu lock */                      /* acquire this_cpu lock */
   rebalance_domanins                               ttwu_do_activate
     /* does not hold any RQ lock */                  ttwu_activate
     load_balance                                       activate_task
       find_busiest_group                                 enqueue_task
       update_sg_lb_stats                                   enqueue_task_fair
         update_blocked_averages(CPU2's rq)                   enqueue_entity
                               `----------------------------->  update_load_avg
                                race condition      /* release this_cpu lock */
Ok, that's a race but it does not lead to a deadlock. Either CPU0 or
CPU1 will update the CPU2 stats and, than, the other CPU will get the
lock and just bails out because the stats have already been updated.
Is that an issue? Maybe I'm missing something...
You are right, this will not introduce deadlock issue; but system will
have more race condition for acquiring/releasing rq's lock, so it will
introduce potential performance downgrade by it.
[...]
Thanks,
Leo Yan

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [PATCH v1 7/7] sched/fair: consider CPU overutilized only when it is not idle