Good day Jon,
Please include the included patch in your tree. It is a fix for [1].
Thanks, Mathieu.
[1]. https://bugs.launchpad.net/linaro-big-little-system/+bug/1097213
-------- Original Message -------- Subject: Re: Update on LP1097213 Date: Mon, 17 Jun 2013 16:31:47 +0100 From: Morten Rasmussen morten.rasmussen@arm.com To: Mathieu Poirier mathieu.poirier@linaro.org CC: Vincent Guittot vincent.guittot@linaro.org, Serge Broslavsky serge.broslavsky@linaro.org, Amit Kucheria amit.kucheria@linaro.org, Nicolas Pitre nicolas.pitre@linaro.org, Naresh Kamboju naresh.kamboju@linaro.org
Hi Mathieu,
I had a quick look at the hmp_next_{up,down}_delay() stuff. It is all introduced in the patch: "sched: SCHED_HMP multi-domain task migration control". Reverting it requires some manual conflict fixing and you will also need to remove the extra hmp_next_down_delay() added by a later patch.
I've attached a revert patch for debugging purposes that should do it all.
I'm not sure if this will just remove the symptom or if the sched_clock accesses are the true cause of the problem.
I hope it helps, Morten
On 17/06/13 14:26, Vincent Guittot wrote:
Mathieu,
Please find below the mail we have discussed during the call
Vincent
On 14 June 2013 15:21, Vincent Guittot vincent.guittot@linaro.org wrote:
On 14 June 2013 15:14, Vincent Guittot vincent.guittot@linaro.org wrote:
On 14 June 2013 14:39, Mathieu Poirier mathieu.poirier@linaro.org wrote:
Anything on this ?!? Morten, Vincent ?
Hi Mathieu,
I haven't noticed that the problem can be reproduced on a snowball, the 1st time i read your email. It's means that the hmp specific function are also called on smp system ?
I'm going to look more ddeplyin the code
for_each_online_cpu is used in hmp_force_up_migration but it's not protected against hotplug so it can used a cpu that is going to be unplugged
We should probably protect the sequence with get/put_online_cpus
Vincent
Vincent
On 13-06-12 03:13 PM, Mathieu Poirier wrote:
Good day gents,
I have been working on [1] for a while now, on and off as time permitted. The problem has always been very elusive but definitely present. As some of the notes in the bug report indicate TC2 wasn't the only ARM system I could reproduce this on - snowball suffered from the exact same problem.
I started looking at this again for 3.10 and I have good and bad news.
The good news is that I can't reproduce the problem anymore if CONFIG_SCHED_HMP is not enabled. I ran the attached script for more than 16 hours without even the hint of a problem. Normally one would get a crash [2] in less than a minute. I won't go so far as claiming that upstream solved the problem. Maybe we are lucky and timing in 3.10 simply doesn't allow for the fault to occur. In any case, all we can do is continue monitoring the situation in upcoming versions.
On the flip side we have a definite problem with hotplug when CONFIG_SCHED_HMP is defined. The crash in [2] is consistent and can be reproduced at will. Looking at the trace the problem happens in 'select_task_rq_fair' where calls to 'hmp_next_up_delay' and 'hmp_next_down_delay' end up referencing 'cfs_rq_clock_task' where cfs-rq->rq point to a bogus address.
Have a look at line 9 in [2] - this is a little bit of instrumentation I started working on. It basically outputs the new and previous CPUs in 'hmp_[up,down]_migration' conditional statements along with the direction of the migration [3]. In every instances the system was going from the A15 to the A7 cluster. I haven't found a single instance where the opposite was be true.
Since this is directly related to our efforts to make the scheduler power aware and based on Ingo's latest rebuttal, I am not sure that it wise for me to continue working on this - specifically if we end up scrapping that portion of the code. I'm eager to hear your opinion.
On the flip side it highlights (once again) that we need to invest massively in the hotplug subsystem, more specifically in its relation to the scheduler and the RCU subsystem.
Mathieu.
PS. I have purposely kept the audience to a minimum - forward as you see fit.
[1]. https://bugs.launchpad.net/linaro-big-little-system/+bug/1188778 [2]. https://pastebin.linaro.org/view/0751c84b [3]. https://pastebin.linaro.org/view/4491ee27
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.