Hi Tixy,
In respect of the idle pull issue (https://bugs.launchpad.net/linaro-stable-kernel/+bug/1301886) I did a bit of root-cause digging.
I'm not sure why I didn't see this in our testing because as you say it's pretty easy to trigger. I can trigger it either with hotplug (sometimes even twice per unplug) or by starting IKS mode. Either our automated test doesn't do any hotplug testing or we somehow missed recognising the failure condition. I know we did not run an IKS test on the revalidated version, but I would expect us to have some hotplug tests in the MP functional testing.
Basil, can you look into that please?
Ultimately, what happens is that the scheduler will often run __schedule on a CPU which is in the process of being shut down. Its probably too costly to try to compute when it shouldn't run, hence the scheduler makes it safe to run on a mostly-offline CPU.
When there is nothing to schedule in, idle_balance is executed (and hence hmp_idle_pull). This happens after the relevant rq has been marked as offline and the sched domains have been rebuilt, but before the tasks are migrated away.
Vincent refers to this in a paper he wrote back in 2012 about hotplug as zombie CPUs ( http://www.rdrop.com/users/paulmck/realtime/paper/hotplug-ecrts.2012.06.11a.... - in section 2.3).
It seems to me that we really should be not doing anything in idle_balance in this situation, and the existing code accomplishes that because the sched_domains already reflect the new world at that point in time. The HMP patch doesn't really care about sched_domains, which is where the problem comes in.
It's trivial to add a check to abort idle_balance altogether if the rq is offline, but perhaps nobody has added it since it is only taking a small amount of time on a CPU which is about to turn off and the conditional will need to be evaluated otherwise in every idle balance.
Changing the BUG to a simple return NULL and fixing up the callers for this case as you did in your testing patch is functionally correct and safe. The question for me is if we should bother to try to optimize idle_pull behavior during cpu_down - I'm open to opinions :)
Best Regards,
Chris