Hi Chris,
On Tue, Jul 30, 2013 at 5:09 PM, Chris Redpath chris.redpath@arm.com wrote:
Hi Lei,
On 30/07/13 02:57, Lei Wen wrote:
Chris,
On Mon, Jul 29, 2013 at 11:36 PM, Chris Redpath chris.redpath@arm.com wrote:
Hi Lei,
Thanks for looking at these patches. Have you seen the additional optimization patch set that was released to go on-top of 13.06? It's linked from the 13.06 release notes at http://releases.linaro.org/13.06/android/vexpress . The additional patches contained there will be part of the 13.07 release. Please note however that we're releasing these patches against a 3.10-based kernel release and we have not attempted to port or test them on 3.11.
I see. What I have backported are only those listed in the 13.06, no additional optimization patch is involved.
OK. The optimization pack provides improved performance for some common workloads, you should take a look if you have time.
Got it. Thanks!
That said, I will attempt to answer your questions inline.
Thanks for your comments!
On 29/07/13 06:51, Lei Wen wrote:
Hi list,
I am trying to porting linaro MP patches to other kernel branches, and find some regression due to latest kernel change.
The regression is while we test over Linaro 13.06 release with MP3 scenario, we find CA7 is always busy, while one core of CA15 is occasionally raise up with the average of 2% cpu usage ratio, another core is kept silence most of time.
But when switch to our backported branch, we find both core in CA15 becomes active, and keep the usage ratio around 2%.
With further checking, I find one recent merged patch in mainline cause this: https://lkml.org/lkml/2013/6/27/152
This patch mainly change the initial load for the new born task to the largest one, so that it cause hmp make the decision to move all such task to the big cluster. Since cpu3(The first cpu of CA15) is getting busy now, cpu4 is raise up to share the load as consequence. And it make ALL 5 cpus are used in the MP3 scenario, which should make power result looks bad than before...
What is your opinion for this regression, especially for how HMP should act with the increased load raising up with the merged patch?
In general, the changes introduced in Alex Shi's patch set will probably conflict a bit with big.LITTLE MP patches. Our patches and his both use the same load tracking metrics and having the two patch sets present will probably have some subtle interactions even though on the surface they look relatively benign. We don't intend to work on forward-porting to 3.11 right now since the Linaro LSK is going to stay on 3.10 and we and Linaro are supporting the patch set on that release. Even if you were to remove all the patches from 141965c7 to 2509940f (Alex's switch to use load-tracking) before integrating big.LITTLE MP, I can't guarantee that there hasn't been a behavioural change elsewhere which won't increase power consumption or damage performance since we haven't tested it.
I see.
Specifically about the initial task load profile, we agree with Alex that tasks probably should not always start with zero load for performance reasons and have done something similar in the patch titled "HMP: Force new non-kernel tasks onto big CPUs until load stabilises" in the 13.06 Optimization pack. The aim of that patch is to provide application tasks with access to the maximum compute capacity as soon as possible, and to keep all other tasks migrating between big and little clusters only on the basis of need. It's a kind of initial performance boost behaviour for user application threads, after a couple of sched_periods the load profile will reasonably accurately represent the history and task placement will happen according to need as usual.
Its interesting that the effect is long-lasting enough to use both A15s - are there tasks being created all the time which are exercising the big CPUs? Alex only adds a small amount of busy time to teh accounting, so after 10-20ms its impact should be very small. From the behaviour you describe, I feel that something else is happening to interfere with task placement which is made more visible by the patch you point to. The only way for you to be sure of the cause is to investigate the behaviour - which tasks are resident on the big CPUs, for how long, and what does their load profile look like.
For investigating those kinds of issues, I generally use Streamline or Kernelshark.
Actually, I have used ftrace to try to figure out what is happening. From the trace result, what I could see is Audio track handling work are done in the A7 cluster, while in A15 cluster, there is gatord application occasionally get run, which is what the 2% usage ratio is getting from.
With Alex patch is added, I see there is addtional kworker is migrated to the A15. What I captured for this kworker is __call_usermodehelper, and its frequency is the same as gatord get executed, every 5s. So it seems me that the additional added workload brought by the the kworker trigger the below logic in fair.c if (affine_sd) { if (cpu == prev_cpu || wake_affine(affine_sd, p, sync)) prev_cpu = cpu;
new_cpu = select_idle_sibling(p, prev_cpu); goto unlock; }
And move the gatord running place from cpu3 to cpu4, as it is idle then, which make the both cpu in ca15 are runnable.
Yes, that is likely. Waking tasks will be woken on an idle CPU in the package they last ran on if the previous CPU is not idle at the time. This is normal scheduling behaviour :)
It looks to me like you have updated the kernel without updating the gator module, so the gator daemon doesn't start correctly and Android will attempt to restart the service every 5s. At least when using the VExpress filesystem (I haven't tried the others) Linaro adds a gatord service to the init scripts of Android, putting the gator daemon in /system/bin and the gator module into the initrd.
Yep, that is indeed what you point out.
You can do a few things about it.
- Rebuild the module and the ramdisk each time you build the kernel
- Switch the gator module over to be built in
- Change the gatord service to a oneshot service (it won't restart if it
fails) 4. Disable the gatord service in android
I choose to disable the gatord service, then CA15 cluster becomes quite now. :)
Having the gator module built in is probably the easiest way retain the functionality without having the extra build steps. The VExpress kernel source from the ARM LT in Linaro has the gator module source integrated so IIRC you just need to change the config.
When the gatord service can start correctly, this should go away. Obviously, there will be some impact at runtime if you're connected to it and pulling data but if its just listening it should not cause a problem.
In the 13.06 Optimization pack patch I mentioned we give each task an initial load profile, but we chose to make it fully loaded in the beginning. However, just giving a task an initial load profile is not the whole story. In the kernels we have tested, the CPU that a task runs on in its first few slices is largely governed by the location of the parent task and the overall system load (it can be balanced elsewhere within its cluster, but only if unbalanced), so energy consumption and the performance in the first few schedules can be unpredictable.
So you also notice if set kernel task as full load, there is power downgrade?
Yes, but the cause is more subtle. It is common that a kernel task will use a workqueue and/or a timer to do something which is queued on the CPU they ran on at the time. This is usually set up early in the task life. When the task migrates because its load does not justify using one of the faster CPUs, these resources do not migrate with it. The best engineering solution is to modify all the APIs and users, but that would likely be a more invasive patchset than even the big.LITTLE MP patches so for existing kernels we feel that the pragmatic solution is to not give kernel tasks the boost.
The power cost from locating a low-load task temporarily on a big CPU is tiny, the task will be located on the correct CPU in a very short time if the load does not justify the big CPU. It is the continued periodic execution on behalf of a low-load task which uses the power. If the power is necessary, that would be OK but the compute requirement for these items can be easily satisfied on the little CPUs.
Agree! We should avoid the continuous execution of low priority task over big cluster.
When backport, we tend to include all needed patch, which included Alex's, as his already been included in the mainstream. So your opinion is to keep out all Alex patch out HMP series, or just the boost initial load one?
As I haven't investigated the behaviour with both sets present I can't say for sure, I certainly don't have an opinion strong enough to recommend removal :) The addition of the patches in the optimisation pack might counteract the issue you see, but since they are working in the same area getting the integration just right might be tricky.
The initial task loads and fork placement behaviour in the optimization pack have been tested together, so we're reasonably confident that they work properly in 13.06. Using Alex's initial load values will require some investigation to be sure that the resulting behaviour is desirable, but I'd prefer to update the big.LITTLE MP patchset than start reverting patches from mainline if we can get the behaviour correct.
After rechecking the Alex patches set, seems only his initial load setting is strongly conflict with the hmp setting. While others fixes he brought is a enhancement to per-entity tracking indeed.
So I finally choose to revert that patch only. :)
Thanks for your great help. Actually, we have locally perform some test directly based over linaro13.06. Do you mind to check whether it match with yours?
--Chris
Thanks, Lei