On Thu, Oct 27, 2016 at 05:14:58PM +0100, Morten Rasmussen wrote:
[...]
Sorry in my previous testing I wrongly used WALT signals, so please ignore it.
I did more testing on it for PELT signals, below are testing result:
baseline: 5m57.86s real 1m42.36s user 34m23.30s system (PELT) 5m56.60s real 1m41.45s user 34m16.23s system (PELT)
with rb-tree patches: 5m43.84s real 1m35.82s user 32m56.43s system (PELT) 5m46.67s real 1m35.39s user 33m18.07s system (PELT)
The real run time is not decreased too much (10s ~ 14s), but we can see the system time can decrease obviously (58.16s ~ 86.97s) for multi-core's processing.
So we have ~3% improvement for real run time and ~6% user time, which seems significant.
The next question is why? It is an SMP system, it shouldn't matter how the task list is ordered and I doubt that an RB tree is faster to use than the list data structure. So I can't really explain the numbers.
I have not digged into trace log yet. At beginning I expect the data could be worse after applied rb tree related patches but finally get reverse result.
perf stat --null --repeat 10 -- \ perf bench sched messaging -g 50 -l 1000
I'm still curious how many tasks that are actually on the rq for hackbench, but it seems that overhead isn't a major issue.
From the code, at the same time hackbench launches 40 processes (20 processes for messager's senders and 20 for receivers).
How many processes in total?
40 process x 10 groups = 400 processes for hackbench.
The perf bench command gives you 20 sender + 20 receivers in each of the 50 groups for a total of 2000 processes.
Do you think this is similiar testing with perf you meantioned?
Tested video playback on Juno for LB_MIN vs rb tree:
LB_MIN Nrg:LITTLE Nrg:Big Nrg:Sum
11.3122 8.983429 20.295629 11.337446 8.174061 19.511507 11.256941 8.547895 19.804836 10.994329 9.633028 20.627357 11.483148 8.522364 20.005512 avg. 11.2768128 8.7721554 20.0489682
stdev 0.431777914
rb tree Nrg:LITTLE Nrg:Big Nrg:Sum
11.384301 8.412714 19.797015 11.673992 8.455219 20.129211 11.586081 8.414606 20.000687 11.423509 8.64781 20.071319 11.43709 8.595252 20.032342 avg. 11.5009946 8.5051202 20.0061148
stdev 0.1263371635
vs LB_MIN +1.99% -3.04% -0.21%
Should I read this as the energy benefits of the rb-tree solution is the negligible? It seems to be much smaller than the error margins.
From the power data on Juno, the final difference is quite small. But if we review the energy for big cluster and little cluster, we can see the benefit by reducing big core's energy and uses more LITTLE core for tasks.
From my previous experience, the power optimization patches can save power 10% on another b.L system for camera case, but it only can save CPU power 2.97% for video playback on Juno. This is caused by video playback still have no enough small tasks, for the case with many small tasks we can see more benefit.
Do you have suggestion for power testing case on Juno?
In the end it is real energy consumption that matters, shifting energy from big to little doesn't really matter if the sum is the same which seems to be the case for the Juno test.
You are suggesting that video doesn't have enough small tasks. Would it be possible to use rtapp to generate a bunch of small tasks of roughly similar size to those you see for the camera case?
Let me firstly gather camera case log if possible.
Does we have evidence that the util-rb-tree solution is any better than using LB_MIN?
I think the evidence is big core saving 3.04% energy than LB_MIN, this is not big difference on Juno. The difference is possible to enlarged for the scenario with many tasks (like camera case).
The energy sum difference is only 0.21%.
Yeah.
But yes, this should be verified on other SoCs for more confidence.
A synthetic test and a good theory to explain the numbers is also quite helpful in convincing people :)
Yes and agree :) Will do this and try to use rt-app.
Thanks, Leo Yan