Hi Chris,
For the initial set of patched we ran the full set of tests including functional tests and the hotplug tests there (part of non-functional test suite) were passing, but the review reworked patches came too close to the deadline and we focused on running only workloads and benchmark runs on a15only/a7only and a7bc modes. The rest including the functional tests were excluded as we expected a respin of these tests at fina QA week. Functional tests could have been run in parallel though!
In hind sight a poor decision):
Thanks Basil Eljuse...
-----Original Message----- From: Chris Redpath [mailto:Chris.Redpath@arm.com] Sent: 01 May 2014 13:59 To: Jon Medhurst (Tixy); Vincent Guittot; Basil Eljuse Cc: linaro-kernel; Mark Brown; Robin Randhawa; Morten Rasmussen; Dietmar Eggemann Subject: HMP Idle Pull Bug 1301886
Hi Tixy,
In respect of the idle pull issue (https://bugs.launchpad.net/linaro-stable-kernel/+bug/1301886) I did a bit of root-cause digging.
I'm not sure why I didn't see this in our testing because as you say it's pretty easy to trigger. I can trigger it either with hotplug (sometimes even twice per unplug) or by starting IKS mode. Either our automated test doesn't do any hotplug testing or we somehow missed recognising the failure condition. I know we did not run an IKS test on the revalidated version, but I would expect us to have some hotplug tests in the MP functional testing.
Basil, can you look into that please?
Ultimately, what happens is that the scheduler will often run __schedule on a CPU which is in the process of being shut down. Its probably too costly to try to compute when it shouldn't run, hence the scheduler makes it safe to run on a mostly-offline CPU.
When there is nothing to schedule in, idle_balance is executed (and hence hmp_idle_pull). This happens after the relevant rq has been marked as offline and the sched domains have been rebuilt, but before the tasks are migrated away.
Vincent refers to this in a paper he wrote back in 2012 about hotplug as zombie CPUs ( http://www.rdrop.com/users/paulmck/realtime/paper/hotplug-ecrts.2012.06.11a.... - in section 2.3).
It seems to me that we really should be not doing anything in idle_balance in this situation, and the existing code accomplishes that because the sched_domains already reflect the new world at that point in time. The HMP patch doesn't really care about sched_domains, which is where the problem comes in.
It's trivial to add a check to abort idle_balance altogether if the rq is offline, but perhaps nobody has added it since it is only taking a small amount of time on a CPU which is about to turn off and the conditional will need to be evaluated otherwise in every idle balance.
Changing the BUG to a simple return NULL and fixing up the callers for this case as you did in your testing patch is functionally correct and safe. The question for me is if we should bother to try to optimize idle_pull behavior during cpu_down - I'm open to opinions :)
Best Regards,
Chris
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782