Re: [Eas-dev] [PATCH 0/3] Evaluation for tracking task load/util with rb tree

27 Oct 2016

      On Thu, Oct 27, 2016 at 10:28:41PM +0800, Leo Yan wrote:
...
Hi Morten,
On Thu, Oct 27, 2016 at 02:38:12PM +0100, Morten Rasmussen wrote:
[...]
...
...
o Testing result:
Tested hackbench on Hikey with CA53x8 CPUs with SMP load balance:
time sh -c 'for i in `seq 100`; do /data/hackbench -p -P > /dev/null; done'
                   real           user           system
  baseline         6m00.57s       1m41.72s       34m38.18s
  rb tree          5m55.79s       1m33.68s       34m08.38s
For hackbench test case we can see with rb tree it even has better
  result than baseline kernel.
It is around 1% difference, so pretty much in the noise? Have tried
longer runs of hackbench? For the capacity awareness postings I'm using:
Sorry in my previous testing I wrongly used WALT signals, so please
ignore it.
I did more testing on it for PELT signals, below are testing result:
baseline:
5m57.86s real     1m42.36s user    34m23.30s system (PELT)
5m56.60s real     1m41.45s user    34m16.23s system  (PELT)
with rb-tree patches:
5m43.84s real     1m35.82s user    32m56.43s system (PELT)
5m46.67s real     1m35.39s user    33m18.07s system  (PELT)
The real run time is not decreased too much (10s ~ 14s), but we can see
the system time can decrease obviously (58.16s ~ 86.97s) for
multi-core's processing.
So we have ~3% improvement for real run time and ~6% user time, which
seems significant.
The next question is why? It is an SMP system, it shouldn't matter how
the task list is ordered and I doubt that an RB tree is faster to use
than the list data structure. So I can't really explain the numbers.
...
...
    perf stat --null --repeat 10 -- \
    perf bench sched messaging -g 50 -l 1000

I'm still curious how many tasks that are actually on the rq for
hackbench, but it seems that overhead isn't a major issue.
From the code, at the same time hackbench launches 40 processes (20
processes for messager's senders and 20 for receivers).
How many processes in total?
The perf bench command gives you 20 sender + 20 receivers in each of
the 50 groups for a total of 2000 processes.
...
Do you think this is similiar testing with perf you meantioned?
...
...
Tested video playback on Juno for LB_MIN vs rb tree:
LB_MIN           Nrg:LITTLE     Nrg:Big        Nrg:Sum
               11.3122        8.983429       20.295629
               11.337446      8.174061       19.511507
               11.256941      8.547895       19.804836
               10.994329      9.633028       20.627357
               11.483148      8.522364       20.005512
         avg.  11.2768128     8.7721554      20.0489682

            stdev                              0.431777914

...
rb tree          Nrg:LITTLE     Nrg:Big        Nrg:Sum
               11.384301      8.412714       19.797015
               11.673992      8.455219       20.129211
               11.586081      8.414606       20.000687
               11.423509      8.64781        20.071319
               11.43709       8.595252       20.032342
         avg.  11.5009946     8.5051202      20.0061148

            stdev                              0.1263371635

...
    vs LB_MIN  +1.99%         -3.04%         -0.21%

Should I read this as the energy benefits of the rb-tree solution is the
negligible? It seems to be much smaller than the error margins.
From the power data on Juno, the final difference is quite small. But
if we review the energy for big cluster and little cluster, we can see
the benefit by reducing big core's energy and uses more LITTLE core for
tasks.
From my previous experience, the power optimization patches can save
power 10% on another b.L system for camera case, but it only can save
CPU power 2.97% for video playback on Juno. This is caused by video
playback still have no enough small tasks, for the case with many
small tasks we can see more benefit.
Do you have suggestion for power testing case on Juno?
In the end it is real energy consumption that matters, shifting energy
from big to little doesn't really matter if the sum is the same which
seems to be the case for the Juno test.
You are suggesting that video doesn't have enough small tasks. Would it
be possible to use rtapp to generate a bunch of small tasks of roughly
similar size to those you see for the camera case?
...
...
Does we have evidence that the util-rb-tree solution is any better than
using LB_MIN?
I think the evidence is big core saving 3.04% energy than LB_MIN, this
is not big difference on Juno. The difference is possible to enlarged
for the scenario with many tasks (like camera case).
The energy sum difference is only 0.21%.
...
But yes, this should be verified on other SoCs for more confidence.
A synthetic test and a good theory to explain the numbers is also quite
helpful in convincing people :)
Thanks,
Morten
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [PATCH 0/3] Evaluation for tracking task load/util with rb tree

LB_MIN Nrg:LITTLE Nrg:Big Nrg:Sum

rb tree Nrg:LITTLE Nrg:Big Nrg:Sum