Re: deadlock in HMP

26 Jun 2014

Hi Alex,
On 26 June 2014 18:27, Alex Shi alex.shi@linaro.org wrote:
...
On 06/26/2014 06:13 PM, Chris Redpath wrote:> On 26/06/14 10:37, Alex Shi wrote:
...
...
Thanks for testing, Naresh!
于 6/26/14, 17:34, Naresh Kamboju 写道:
...
...
...
After applying above patch. Boot tested on vexpress-tc2 and noticed
that this deadlock bug stopped reporting.
Boot logs attached:

deadlock-on-hmp-with-alex-fix-patch-on-tc2.log
deadlock-on-hmp-without-alex-fix-patch-on-tc2.log

Bug updated:
https://bugs.launchpad.net/linaro-stable-kernel/+bug/1330392
Yes, the warning goes away but what you've done is to just ignore
locking this resource. It's not guaranteed by anything in CPUidle that
the driver won't be changed while we are doing this. On all our
platforms, the cpuidle driver is built-in rather than loaded as a module
so it's extremely unlikely to happen (I think it's actually impossible)
but I don't think we should write kernel code that works due to
circumstance - we should be honoring the contract for accessing this
resource. We didn't have the warning checks working properly in our
integrated test and I didn't see this before but the root cause is that
I should not be taking that lock in potentially re-entrant code without
guarding it.
To properly fix this specific implementation will probably require
hacking a new API into the cpuidle framework (something like
try_cpuidle_driver_ref) which will make future backporting a wee bit
more complicated but not a lot. The patch would have to be added to the
big.LITTLE MP patch set.
Perhaps we could also add a lock into the timer hack which we can
trylock before attempting to access the cpuidle driver, this would at
least stop us deadlocking ourselves.
An alternative might be to re-engineer the solution to provide a
low-resume-latency request interface which the scheduler can use instead
of a timer hack to keep it alive - the existing QoS interface isn't
suitable because it's global and wakes all the CPUs up to get the new
latency requirement in effect. That would not strictly be a big.LITTLE
feature, but as its the only client it would likely need to be carried
around in the same patch set.
There is already a suitable state disable API in cpuidle, but
determining which states to disable still requires reading the exit
latency which is only available in this driver information.
In summary, I don't like this fix although I accept that it works and
it'll continue to work for TC2 and any other platform where the idle
driver cannot be changed at runtime.
What do you think?
In today's 1:1, I just talked with Mark, it is not a upstream quality fix
for this issue. and better solution would ask more cpuidle part coordinate. :)
So, I thought it maybe accepted for a quick fix before a final solution
coming out. So I added more comments in commit log and remove the
'cpuidle_driver_unref()' which omitted in v1 patch.
below patch applied and tested. And noticed that, did not found
deadlock and warning.
Boot log attached and pasted.
https://pastebin.linaro.org/view/56e64bde
-Naresh Kamboju
...
==============
From 49cdbf3b800933bfb25a49ef1cbad918a072574e Mon Sep 17 00:00:00 2001
From: Alex Shi alex.shi@linaro.org
Date: Thu, 19 Jun 2014 13:47:35 +0800
Subject: [PATCH] HMP: use per cpu cpuidle driver to fix deadlock in
 hmp_idle_pull
Using per cpu cpuidle driver to fix deadlock in hmp_idle_pull.
Otherwise a deadlock happened when do bl_idle_init.
[  113.878664] other info that might help us debug this:
[  113.878667]  Possible unsafe locking scenario:
[  113.878667]
[  113.878670]        CPU0
[  113.878673]        ----
[  113.878681]   lock(cpuidle_driver_lock);
[  113.878684]   <Interrupt>
[  113.878691]     lock(cpuidle_driver_lock);
[  113.878693]
[  113.878693]  *** DEADLOCK ***
[  113.878693]
[  113.878697] 1 lock held by ksoftirqd/4/28:
[  113.878719]  #0:  (hmp_force_migration){+.....}, at: [<c0054da5>]
hmp_idle_pull+0x49/0x508
This patch is just a quick/cheap workaround for cpuidle_driver_lock
deadlock. It works for TC2 and any other platform where the idle
driver cannot be changed at runtime.
Signed-off-by: Alex Shi alex.shi@linaro.org
kernel/sched/fair.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 71da724..0269273 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3587,10 +3587,12 @@ static enum hrtimer_restart hmp_cpu_keepalive_notify(struct hrtimer *hrtimer)

If there are any, set ns_delay to
('target_residency of state with shortest too-big latency' - 1) * 1000.

*/
-static void hmp_keepalive_delay(unsigned int *ns_delay)
+static void hmp_keepalive_delay(int cpu, unsigned int *ns_delay)
 {

  struct cpuidle_device *dev = per_cpu(cpuidle_devices, cpu);
  struct cpuidle_driver *drv;




  drv = cpuidle_driver_ref();





  drv = cpuidle_get_cpu_driver(dev);
  if (drv) {
          unsigned int us_delay = UINT_MAX;
          unsigned int us_max_delay = *ns_delay / 1000;



@@ -3609,7 +3611,6 @@ static void hmp_keepalive_delay(unsigned int *ns_delay)
                else
                        *ns_delay = 1000 * (us_delay - 1);
        }

  cpuidle_driver_unref();



}
static void hmp_cpu_keepalive_trigger(void)
@@ -3623,7 +3624,7 @@ static void hmp_cpu_keepalive_trigger(void)
                                CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
                keepalive->timer.function = hmp_cpu_keepalive_notify;

          hmp_keepalive_delay(&ns_delay);




          hmp_keepalive_delay(cpu, &ns_delay);
          keepalive->delay = ns_to_ktime(ns_delay);
          keepalive->init = true;
  }



--
1.9.1
--
Thanks
    Alex

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: deadlock in HMP

Signed-off-by: Alex Shi alex.shi@linaro.org