HMP Idle Pull Bug 1301886

1 May 2014


      Hi Tixy,
In respect of the idle pull issue 
(https://bugs.launchpad.net/linaro-stable-kernel/+bug/1301886) I did a 
bit of root-cause digging.
I'm not sure why I didn't see this in our testing because as you say 
it's pretty easy to trigger. I can trigger it either with hotplug 
(sometimes even twice per unplug) or by starting IKS mode. Either our 
automated test doesn't do any hotplug testing or we somehow missed 
recognising the failure condition. I know we did not run an IKS test on 
the revalidated version, but I would expect us to have some hotplug 
tests in the MP functional testing.
Basil, can you look into that please?
Ultimately, what happens is that the scheduler will often run __schedule 
on a CPU which is in the process of being shut down. Its probably too 
costly to try to compute when it shouldn't run, hence the scheduler 
makes it safe to run on a mostly-offline CPU.
When there is nothing to schedule in, idle_balance is executed (and 
hence hmp_idle_pull). This happens after the relevant rq has been marked 
as offline and the sched domains have been rebuilt, but before the tasks 
are migrated away.
Vincent refers to this in a paper he wrote back in 2012 about hotplug as 
zombie CPUs ( 
http://www.rdrop.com/users/paulmck/realtime/paper/hotplug-ecrts.2012.06.11a.... 
  - in section 2.3).
It seems to me that we really should be not doing anything in 
idle_balance in this situation, and the existing code accomplishes that 
because the sched_domains already reflect the new world at that point in 
time. The HMP patch doesn't really care about sched_domains, which is 
where the problem comes in.
It's trivial to add a check to abort idle_balance altogether if the rq 
is offline, but perhaps nobody has added it since it is only taking a 
small amount of time on a CPU which is about to turn off and the 
conditional will need to be evaluated otherwise in every idle balance.
Changing the BUG to a simple return NULL and fixing up the callers for 
this case as you did in your testing patch is functionally correct and 
safe. The question for me is if we should bother to try to optimize 
idle_pull behavior during cpu_down - I'm open to opinions :)
Best Regards,
Chris

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

HMP Idle Pull Bug 1301886