Re: HMP behavior bad when port the patch set to latest kernel

30 Jul 2013

      Hi Chris,
On Tue, Jul 30, 2013 at 5:09 PM, Chris Redpath chris.redpath@arm.com wrote:
...
Hi Lei,
On 30/07/13 02:57, Lei Wen wrote:
...
Chris,
On Mon, Jul 29, 2013 at 11:36 PM, Chris Redpath chris.redpath@arm.com
wrote:
...
Hi Lei,
Thanks for looking at these patches. Have you seen the additional
optimization patch set that was released to go on-top of 13.06? It's
linked
from the 13.06 release notes at
http://releases.linaro.org/13.06/android/vexpress . The additional
patches
contained there will be part of the 13.07 release. Please note however
that
we're releasing these patches against a 3.10-based kernel release and we
have not attempted to port or test them on 3.11.
I see. What I have backported are only those listed in the 13.06, no
additional
optimization patch is involved.
OK. The optimization pack provides improved performance for some common
workloads, you should take a look if you have time.
Got it. Thanks!
...
...
...
That said, I will attempt to answer your questions inline.
Thanks for your comments!
...
On 29/07/13 06:51, Lei Wen wrote:
Hi list,
I am trying to porting linaro MP patches to other kernel branches,
and find some regression due to latest kernel change.
The regression is while we test over Linaro 13.06 release with MP3
scenario,
we find CA7 is always busy, while one core of CA15 is occasionally raise
up
with
the average of 2% cpu usage ratio, another core is kept silence most of
time.
But when switch to our backported branch, we find both core in CA15
becomes
active, and keep the usage ratio around 2%.
With further checking, I find one recent merged patch in mainline cause
this:
https://lkml.org/lkml/2013/6/27/152
This patch mainly change the initial load for the new born task to the
largest one,
so that it cause hmp make the decision to move all such task to the big
cluster.
Since cpu3(The first cpu of CA15) is getting busy now, cpu4 is raise
up to share the
load as consequence. And it make ALL 5 cpus are used in the MP3 scenario,
which
should make power result looks bad than before...
What is your opinion for this regression, especially for how HMP should
act
with
the increased load raising up with the merged patch?
In general, the changes introduced in Alex Shi's patch set will probably
conflict a bit with big.LITTLE MP patches. Our patches and his both use
the
same load tracking metrics and having the two patch sets present will
probably have some subtle interactions even though on the surface they
look
relatively benign. We don't intend to work on forward-porting to 3.11
right
now since the Linaro LSK is going to stay on 3.10 and we and Linaro are
supporting the patch set on that release. Even if you were to remove all
the
patches from 141965c7 to 2509940f (Alex's switch to use load-tracking)
before integrating big.LITTLE MP, I can't guarantee that there hasn't
been a
behavioural change elsewhere which won't increase power consumption or
damage performance since we haven't tested it.
I see.
...
Specifically about the initial task load profile, we agree with Alex that
tasks probably should not always start with zero load for performance
reasons and have done something similar in the patch titled "HMP: Force
new
non-kernel tasks onto big CPUs until load stabilises" in the 13.06
Optimization pack. The aim of that patch is to provide application tasks
with access to the maximum compute capacity as soon as possible, and to
keep
all other tasks migrating between big and little clusters only on the
basis
of need. It's a kind of initial performance boost behaviour for user
application threads, after a couple of sched_periods the load profile
will
reasonably accurately represent the history and task placement will
happen
according to need as usual.
Its interesting that the effect is long-lasting enough to use both A15s -
are there tasks being created all the time which are exercising the big
CPUs? Alex only adds a small amount of busy time to teh accounting, so
after
10-20ms its impact should be very small. From the behaviour you describe,
I
feel that something else is happening to interfere with task placement
which
is made more visible by the patch you point to. The only way for you to
be
sure of the cause is to investigate the behaviour - which tasks are
resident
on the big CPUs, for how long, and what does their load profile look
like.
For investigating those kinds of issues, I generally use Streamline or
Kernelshark.
Actually, I have used ftrace to try to figure out what is happening.
 From the trace result, what I could see is Audio track handling work are
done in
the A7 cluster, while in A15 cluster, there is gatord application
occasionally get
run, which is what the 2% usage ratio is getting from.
With Alex patch is added, I see there is addtional kworker is migrated
to the A15.
What I captured for this kworker is __call_usermodehelper, and its
frequency
is the same as gatord get executed, every 5s.
So it seems me that the additional added workload brought by the the
kworker
trigger the below logic in fair.c
         if (affine_sd) {
                 if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
                         prev_cpu = cpu;
             new_cpu = select_idle_sibling(p, prev_cpu);
             goto unlock;
     }

And move the gatord running place from cpu3 to cpu4, as it is idle
then, which make
the both cpu in ca15 are runnable.
Yes, that is likely. Waking tasks will be woken on an idle CPU in the
package they last ran on if the previous CPU is not idle at the time. This
is normal scheduling behaviour :)
It looks to me like you have updated the kernel without updating the gator
module, so the gator daemon doesn't start correctly and Android will attempt
to restart the service every 5s. At least when using the VExpress filesystem
(I haven't tried the others) Linaro adds a gatord service to the init
scripts of Android, putting the gator daemon in /system/bin and the gator
module into the initrd.
Yep, that is indeed what you point out.
...
You can do a few things about it.

Rebuild the module and the ramdisk each time you build the kernel
Switch the gator module over to be built in
Change the gatord service to a oneshot service (it won't restart if it

fails)
4. Disable the gatord service in android
I choose to disable the gatord service, then CA15 cluster becomes quite now. :)
...
Having the gator module built in is probably the easiest way retain the
functionality without having the extra build steps. The VExpress kernel
source from the ARM LT in Linaro has the gator module source integrated so
IIRC you just need to change the config.
When the gatord service can start correctly, this should go away. Obviously,
there will be some impact at runtime if you're connected to it and pulling
data but if its just listening it should not cause a problem.
...
...
In the 13.06 Optimization pack patch I mentioned we give each task an
initial load profile, but we chose to make it fully loaded in the
beginning.
However, just giving a task an initial load profile is not the whole
story.
In the kernels we have tested, the CPU that a task runs on in its first
few
slices is largely governed by the location of the parent task and the
overall system load (it can be balanced elsewhere within its cluster, but
only if unbalanced), so energy consumption and the performance in the
first
few schedules can be unpredictable.
So you also notice if set kernel task as full load, there is power
downgrade?
Yes, but the cause is more subtle. It is common that a kernel task will use
a workqueue and/or a timer to do something which is queued on the CPU they
ran on at the time. This is usually set up early in the task life. When the
task migrates because its load does not justify using one of the faster
CPUs, these resources do not migrate with it. The best engineering solution
is to modify all the APIs and users, but that would likely be a more
invasive patchset than even the big.LITTLE MP patches so for existing
kernels we feel that the pragmatic solution is to not give kernel tasks the
boost.
The power cost from locating a low-load task temporarily on a big CPU is
tiny, the task will be located on the correct CPU in a very short time if
the load does not justify the big CPU. It is the continued periodic
execution on behalf of a low-load task which uses the power. If the power is
necessary, that would be OK but the compute requirement for these items can
be easily satisfied on the little CPUs.
Agree! We should avoid the continuous execution of low priority task
over big cluster.
...
...
When backport, we tend to include all needed patch, which included Alex's,
as his already been included in the mainstream.
So your opinion is to keep out all Alex patch out HMP series, or just the
boost initial load one?
As I haven't investigated the behaviour with both sets present I can't say
for sure, I certainly don't have an opinion strong enough to recommend
removal :) The addition of the patches in the optimisation pack might
counteract the issue you see, but since they are working in the same area
getting the integration just right might be tricky.
The initial task loads and fork placement behaviour in the optimization pack
have been tested together, so we're reasonably confident that they work
properly in 13.06. Using Alex's initial load values will require some
investigation to be sure that the resulting behaviour is desirable, but I'd
prefer to update the big.LITTLE MP patchset than start reverting patches
from mainline if we can get the behaviour correct.
After rechecking the Alex patches set, seems only his initial load setting is
strongly conflict with the hmp setting. While others fixes he brought is a
enhancement to per-entity tracking indeed.
So I finally choose to revert that patch only. :)
Thanks for your great help.
Actually, we have locally perform some test directly based over linaro13.06.
Do you mind to check whether it match with yours?
...
--Chris
Thanks,
Lei

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: HMP behavior bad when port the patch set to latest kernel