On 28 October 2016 at 10:13, Leo Yan leo.yan@linaro.org wrote:
On Thu, Oct 27, 2016 at 08:37:05PM +0100, Dietmar Eggemann wrote:
Hi Leo,
On 26/10/16 18:28, Leo Yan wrote:
o This patch series is to evaluate if can use rb tree to track task load and util on rq; there have some concern for this method is: rb tree has O(log(N)) computation complexity, so this will introduce extra workload by rb tree's maintainence. For this concern using hackbench to do stress testing, hackbench will generate mass tasks for message sender and receiver, so there will have many enqueue and dequeue operations, so we can use hackbench to get to know if rb tree will introduce big workload or not (Thanks a lot for Chris suggestion for this).
Another concern is scheduler has provided LB_MIN feature, after enable feature LB_MIN the scheduler will avoid to migrate task with load < 16. Somehow this also can help to filter out big tasks for migration. So we need compare power data between this patch series with directly setting LB_MIN.
I have difficulties to understand the whole idea here. Basically, you're still doing classical load-balancing (lb) (with the aim of setting env->imbalance ((runnable) load based) to 0. On a system like Hikey (SMP) any order (load or util related) of the tasks can potentially change how many tasks a dst cpu might pull (in case of an ordered list (large to small load) we potentially pull only one task (doesn't have to be the first one because 'task_h_load(p)/2 > env->imbalance', in case its load is smaller but close to env->imbalance/2). But how can this help to increase performance in a workload-agnostic way?
On 4.4, the result is better than simple list. Vincent also suggested me to do comparision on mainline kernel, the result shows my rt-tree patches introduce performance regression:
mainline kernel real 2m23.701s user 1m4.500s sys 4m34.604s
mainline kernel + fork regression patch real 2m24.377s user 1m3.952s sys 4m39.928s real 2m19.100s user 0m48.776s sys 3m33.440s
mainline with big task tracking: real 2m28.837s user 1m16.388s sys 5m26.864s real 2m28.501s user 1m18.104s sys 5m30.516s
That would be interesting to understand where the huge difference between mainline above and your 1st test with v4.4 come from: 1st results on v4.4 were real user system baseline 6m00.57s 1m41.72s 34m38.18s rb tree 5m55.79s 1m33.68s 34m08.38s
Is the difference linked to v4.4 vs mainline ? different version of hackbench ? different version of rootfs/distro ? something else ?
mainline with big task tracking + fork regression patch real 2m29.369s user 1m16.432s sys 5m27.668s real 2m28.888s user 1m14.792s sys 5m27.800s
fork regression patch: https://lkml.kernel.org/r/20161010173440.GA28945@linaro.org
Besides this periodic lb case, what's the benefit of potentially pulling the biggest task instead of an arbitrary one in active lb?
The original idea is to figure out method to only migrate big task from LITTLE core to big core and avoid to migrate small tasks to big core for saving power.
So at beginning I think rb-tree's potential benefit is for power (actually I also hope this also can benefit for single big task can be easily migrated to big core).
Now if we think about a big.LITTLE system and we are on DIE sd level, an ordered list (large to small) on little cpu and vice versa on big could potentially make sense. Not very scalable though because we're just exploiting the idea of _two_ items with an opposite property (big vs. LITTLE) and different order.
Good point. I only considered the case for task migration from LITTLE core to big core. Should iterate tasks from small to large when migrate from big core to LITTLE core.
Another remark, where do you see the benefit of using util instead of load? In both cases, you do 'task_h_load(p)/2 > env->imbalance), env->imbalance -= task_h_load(p) and if (env->imbalance <= 0) break' i.e. a load based policy any way?
For using util instead of load, as Juri has pointed out in previous meeting, this may pontentially introduce smp nice issue. But in practice, the load value includes running time + runnable time, but util value only includes running time. So util will select a longer running task than another one task, these two tasks may have similiar load value.
I don't change any policy for migration conditions. But this maybe is a good hint for optimization :)
The LB_MIN thing (avoiding to consider tiny tasks for pulling) is a related thing but IMHO, aren't we seeing results which are pretty much inside the noise-cloud here for a particular system/workload combination?
Sorry I'm not quite clear for your question. Do you mean we cannot get strong evidence from testing result to show LB_MIN or rb-tree which is better?
Let's talk about it further in tomorrow's meeting.
Sure.
Thanks, Leo Yan