Hi Joonwoo,
On Mon, Oct 24, 2016 at 02:34:51PM -0700, Joonwoo Park wrote:
[...]
Boosted utilization does not mark a CPU overutilized, thus we should use task_misfits as well to move these tasks.
What about:
if (energy_aware) return (cpu_overutilized(cpu) || rq->misfit_task); else if (rq->nr_running >=2) return true;
I ran into same problem while I was testing upmigration latency with single CPU bound task. Above fix suggested by Patrick combined with 'sched/fair: replace capacity_of by capacity_orig_of' by Leo fixed my problem and reduced upmigration latency from ~1sec to ~250ms.
I found Patrick's fix will introduce significant IPIs for rescheduling; this is because after big core is "overutilized" then it will kick off nohz idle balance, even the big core has only one task is running. So after compared Patrick's suggestion with my v1 patch, the "Rescheduling interrupts" will increase > 50%. As a side effect, this also will harm energy by waking up CPUs.
So I prefer to go back to use v1 patch, need Patrick's reviewing if this is okay or not.
Okay. I haven't yet tried your original patch. I happened to make this fix below and it addressed my test case and found Patrick's suggestion and confirmed it also did same job.
if (rq->nr_running >= 2 && !energy_aware()) return true; if (energy_aware() && cpu_overutilized(cpu)) return true;
Looking at your original patch again. I'm bit worried upmigration latency delay since misfit_task will be set by scheduler tick path if there is 1 cpu bound task running without new wake up. I will see how Patrick thinks about and do another test.
Yeah, so how about below code:
int max_cap = cpu_rq(cpu)->rd->max_cpu_capacity.val;
if (rq->nr_running >= 2 && !energy_aware()) return true;
if (energy_aware() && cpu_overutilized(cpu)) { /* * If highest capacity CPU has only one CPU, it's pointless to * trigger load balance for this case. So if the CPU has >=2 * tasks, then kick load balance to spread tasks as possible. */ if ((capacity_orig_of(cpu) == max_cap) && rq->nr_running >= 2) return true;
/* * For low capacity CPU, always kick off load balance; if * has only one task, it's possible to migrate it to higher * capacity CPU. */ if (capacity_orig_of(cpu) < max_cap) return true; }
/* * Still need check misfit flag, this flag is set when switch * runnable task to running task. So use this flag for more * chance to kick load balance after task is running task. */ if (energy_aware() && rq->misfit_task) return true;
~250ms is still huge but CPU became overutilized after ~210ms of long ramp up time in my setup so it's separate problem.
The ramp up time is longer than expected, if set to highest OPP the util_avg takes 31ms to reach 50% level and takes 74ms to reach 80%: http://people.linaro.org/~leo.yan/eas_profiling/pelt/pelt_up_down_y%5e32_0.5...
This is with upstream pelt delacying factor? Interestingly I even
Yes.
have pelt half lift change running and having longer rampup time. So mine should have ramped even faster than yours.
This is weired, before I also generated similiar patch and can see PELT signal ramps up with shorter time. I pasted the modification in case you are interesting:
---8<---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1bb7efd..4819e17 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -665,9 +665,9 @@ static unsigned long task_h_load(struct task_struct *p); * Note: The tables runnable_avg_yN_inv and runnable_avg_yN_sum are * dependent on this value. */ -#define LOAD_AVG_PERIOD 32 -#define LOAD_AVG_MAX 47742 /* maximum possible load avg */ -#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_AVG_MAX */ +#define LOAD_AVG_PERIOD 16 +#define LOAD_AVG_MAX 24130 /* maximum possible load avg */ +#define LOAD_AVG_MAX_N 174 /* number of full periods to produce LOAD_AVG_MAX */
/* Give new sched_entity start runnable values to heavy its load in infant time */ void init_entity_runnable_average(struct sched_entity *se) @@ -2449,12 +2449,9 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq) #ifdef CONFIG_SMP /* Precomputed fixed inverse multiplies for multiplication by y^n */ static const u32 runnable_avg_yN_inv[] = { - 0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6, - 0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85, - 0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581, - 0xad583ee9, 0xa9a15ab4, 0xa5fed6a9, 0xa2704302, 0x9ef5325f, 0x9b8d39b9, - 0x9837f050, 0x94f4efa8, 0x91c3d373, 0x8ea4398a, 0x8b95c1e3, 0x88980e80, - 0x85aac367, 0x82cd8698, + 0xffffffff, 0xf5257d14, 0xeac0c6e6, 0xe0ccdeeb, 0xd744fcc9, 0xce248c14, + 0xc5672a10, 0xbd08a39e, 0xb504f333, 0xad583ee9, 0xa5fed6a9, 0x9ef5325f, + 0x9837f050, 0x91c3d373, 0x8b95c1e3, 0x85aac367, };
/* @@ -2462,9 +2459,8 @@ static const u32 runnable_avg_yN_inv[] = { * over-estimates when re-combining. */ static const u32 runnable_avg_yN_sum[] = { - 0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103, - 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082, - 17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371, + 0, 980, 1919, 2818, 3679, 4503, 5292, 6048, 6772, 7465, 8129, + 8764, 9373, 9956,10514,11048,11560, };
So what CPUFreq governor are you using? Before I used "ondemand" governor with long smapling window (80ms), I can see similiar behaviour.
I'm using sched-freq.
I'm not familiar with sched-freq related tunning, but you can use "performance" governor to check if PELT is ramping up as expected; after that can analyze if sched-freq set CPU low frequency for long time so finally suppress PELT's ramping up.
Thanks, Leo Yan