RE: HMP Idle Pull Bug 1301886

1 May 2014


      Hi Chris,
For the initial set of patched we ran the full set of tests including functional tests and the hotplug tests there (part of non-functional test suite) were passing, but the review reworked patches came too close to the deadline and we focused on running only workloads and benchmark runs on a15only/a7only and a7bc modes. The rest including the functional tests were excluded as we expected a respin of these tests at fina QA week. Functional tests could have been run in parallel though!
In hind sight a poor decision):
Thanks
Basil Eljuse...
-----Original Message-----
From: Chris Redpath [mailto:Chris.Redpath@arm.com]
Sent: 01 May 2014 13:59
To: Jon Medhurst (Tixy); Vincent Guittot; Basil Eljuse
Cc: linaro-kernel; Mark Brown; Robin Randhawa; Morten Rasmussen; Dietmar Eggemann
Subject: HMP Idle Pull Bug 1301886
Hi Tixy,
In respect of the idle pull issue
(https://bugs.launchpad.net/linaro-stable-kernel/+bug/1301886) I did a
bit of root-cause digging.
I'm not sure why I didn't see this in our testing because as you say
it's pretty easy to trigger. I can trigger it either with hotplug
(sometimes even twice per unplug) or by starting IKS mode. Either our
automated test doesn't do any hotplug testing or we somehow missed
recognising the failure condition. I know we did not run an IKS test on
the revalidated version, but I would expect us to have some hotplug
tests in the MP functional testing.
Basil, can you look into that please?
Ultimately, what happens is that the scheduler will often run __schedule
on a CPU which is in the process of being shut down. Its probably too
costly to try to compute when it shouldn't run, hence the scheduler
makes it safe to run on a mostly-offline CPU.
When there is nothing to schedule in, idle_balance is executed (and
hence hmp_idle_pull). This happens after the relevant rq has been marked
as offline and the sched domains have been rebuilt, but before the tasks
are migrated away.
Vincent refers to this in a paper he wrote back in 2012 about hotplug as
zombie CPUs (
http://www.rdrop.com/users/paulmck/realtime/paper/hotplug-ecrts.2012.06.11a....
  - in section 2.3).
It seems to me that we really should be not doing anything in
idle_balance in this situation, and the existing code accomplishes that
because the sched_domains already reflect the new world at that point in
time. The HMP patch doesn't really care about sched_domains, which is
where the problem comes in.
It's trivial to add a check to abort idle_balance altogether if the rq
is offline, but perhaps nobody has added it since it is only taking a
small amount of time on a CPU which is about to turn off and the
conditional will need to be evaluated otherwise in every idle balance.
Changing the BUG to a simple return NULL and fixing up the callers for
this case as you did in your testing patch is functionally correct and
safe. The question for me is if we should bother to try to optimize
idle_pull behavior during cpu_down - I'm open to opinions :)
Best Regards,
Chris
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.
ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2548782

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

RE: HMP Idle Pull Bug 1301886