Re: [Eas-dev] cpu hotplug & up-migrate will cause kernel panic

11 Jan 2018

      Hi Quentin, Leo,
Thanks for your feedback and help!
Actually, even only applying below two extension patches, the issues
still happen.
* | dc626b2 sched: avoid pushing tasks to an offline CPU
* | 2da014c sched: Extend active balance to accept 'push_task' argument
And except the crash which I mentioned in the first email,  there is
another type crash happened as below:
[ 2072.653091] c1 ------------[ cut here ]------------
    [ 2072.653133] c1 WARNING: CPU: 1 PID: 13 at kernel/fork.c:252
__put_task_struct+0x30/0x124()
    [ 2072.653173] c1 CPU: 1 PID: 13 Comm: migration/1 Tainted: G
  W  O    4.4.83-01066-g04c5403-dirty #17
    [ 2072.653215] c1 [<c011141c>] (unwind_backtrace) from
[<c010ced8>] (show_stack+0x20/0x24)
    [ 2072.653235] c1 [<c010ced8>] (show_stack) from [<c043d7f8>]
(dump_stack+0xa8/0xe0)
    [ 2072.653255] c1 [<c043d7f8>] (dump_stack) from [<c012be04>]
(warn_slowpath_common+0x98/0xc4)
    [ 2072.653273] c1 [<c012be04>] (warn_slowpath_common) from
[<c012beec>] (warn_slowpath_null+0x2c/0x34)
    [ 2072.653291] c1 [<c012beec>] (warn_slowpath_null) from
[<c01293b4>] (__put_task_struct+0x30/0x124)
    [ 2072.653310] c1 [<c01293b4>] (__put_task_struct) from
[<c0166964>] (active_load_balance_cpu_stop+0x22c/0x314)
    [ 2072.653331] c1 [<c0166964>] (active_load_balance_cpu_stop) from
[<c01c2604>] (cpu_stopper_thread+0x90/0x144)
    [ 2072.653352] c1 [<c01c2604>] (cpu_stopper_thread) from
[<c014d80c>] (smpboot_thread_fn+0x258/0x270)
    [ 2072.653370] c1 [<c014d80c>] (smpboot_thread_fn) from
[<c0149ee4>] (kthread+0x118/0x12c)
    [ 2072.653388] c1 [<c0149ee4>] (kthread) from [<c0108310>]
(ret_from_fork+0x14/0x24)
    [ 2072.653400] c1 ---[ end trace 49c3d154890763fc ]---
    [ 2072.653418] c1 Unable to handle kernel NULL pointer dereference
at virtual address 00000000
    ...
    [ 2072.832804] c1 [<c01ba00c>] (put_css_set) from [<c01be870>]
(cgroup_free+0x6c/0x78)
    [ 2072.832823] c1 [<c01be870>] (cgroup_free) from [<c01293f8>]
(__put_task_struct+0x74/0x124)
    [ 2072.832844] c1 [<c01293f8>] (__put_task_struct) from
[<c0166964>] (active_load_balance_cpu_stop+0x22c/0x314)
    [ 2072.832860] c1 [<c0166964>] (active_load_balance_cpu_stop) from
[<c01c2604>] (cpu_stopper_thread+0x90/0x144)
    [ 2072.832879] c1 [<c01c2604>] (cpu_stopper_thread) from
[<c014d80c>] (smpboot_thread_fn+0x258/0x270)
    [ 2072.832896] c1 [<c014d80c>] (smpboot_thread_fn) from
[<c0149ee4>] (kthread+0x118/0x12c)
    [ 2072.832914] c1 [<c0149ee4>] (kthread) from [<c0108310>]
(ret_from_fork+0x14/0x24)
    [ 2072.832930] c1 Code: f57ff05b f590f000 e3e02000 e3a03001 (e1941f9f)
    [ 2072.839208] c1 ---[ end trace 49c3d154890763fd ]---
For this crash, the root cause is the push_task pointer is used
without initialization on the out_lock path.
And maybe cpu hotplug in/out make this happen more easily.
For the crash i mentioned in the first email seems gone too after
applying the initialization patch.
The gerrit is as below:
https://android-review.googlesource.com/c/kernel/common/+/586107
Please review and thanks again.
On 10 January 2018 at 19:39, Quentin Perret quentin.perret@arm.com wrote:
...
On Wednesday 10 Jan 2018 at 09:07:49 (+0800), Leo Yan wrote:
...
Hi Ke, Quentin,
On Tue, Jan 09, 2018 at 12:38:42PM +0000, Quentin Perret wrote:
...
Hi Ke,
Thank you very much for your feedback !
After a quick investigation I noticed that select_energy_cpu_brute() can
actually return an offline CPU in some corner cases (if prev_cpu is
offline for example). That was not an issue when select_energy_cpu_brute()
was called only from select_task_rq_fair() as select_task_rq() would
safely call select_fallback_rq() in that case. However, since:
9e293db sched: EAS: upmigrate misfit current task
select_energy_cpu_brute is now called outside of the wakeup path and an
active load balance is triggered unconditionally on the CPU that was
selected, which might be offline. I'm not an expert in load balance but
I suspect this isn't the right thing to do. I'll investigate a little
bit more and try to come up with a fix if this is confirmed to be the
root cause.
Just reminding two things:

I saw Ke mentioned 'leaving EAS disabled', should function
select_energy_cpu_brute() not be executed?

It is not executed in the wakeup path but the current implementation of
the upmigrate thing doesn't check energy_aware() before calling
select_energy_cpu_brute(). This is probably a bug as well ...
For this, I also pushed a patch as below:
https://android-review.googlesource.com/c/kernel/common/+/586110
Could you please help to review? thanks.
...
...

Threads 'migration/X' should be FIFO thread, so it's not in CFS
class;

Is this issue in select_task_rq_rt()?
Possibly yes. That said, the 3 patches Ke mentioned don't touch rt IIRC ...
...
Thanks,
Leo Yan
...
On Tuesday 09 Jan 2018 at 11:15:03 (+0800), Ke Wang wrote:
...
Hi Joonwoo, Chris,
When porting EAS1.4 to our platform which is SMP(4*A7, k4.4), we
encountered kernel panic frequently after applied following patches:

| 9e293db sched: EAS: upmigrate misfit current task
| dc626b2 sched: avoid pushing tasks to an offline CPU
| 2da014c sched: Extend active balance to accept 'push_task' argument

After applying these three patches, leaving EAS disabled and doing a
stability test which includes some random cpu plugin/plugout, kernel panic
sometimes happened, always with the same stack as below:
[  214.742695] c1 ------------[ cut here ]------------
[  214.742709] c1 kernel BUG at
/space/builder/repo/sprdroid8.1_trunk/kernel/kernel/smpboot.c:136!
[  214.742718] c1 Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
[  214.748750] c0 Modules linked in: mtty marlin2_fm mali(O)
[  214.748785] c1 CPU: 1 PID: 18 Comm: migration/2 Tainted: G        W
O    4.4.83-00912-g370f62c #1
[  214.748795] c1 Hardware name: Generic DT based system
[  214.748805] c1 task: ef2d9680 task.stack: ee862000
[  214.748821] c1 PC is at smpboot_thread_fn+0x168/0x270
[  214.748832] c1 LR is at smpboot_thread_fn+0xe4/0x270
[  214.748843] c1 pc : [<c014d71c>]    lr : [<c014d698>]    psr: 200e0113
                  sp : ee863f38  ip : ee863f38  fp : ee863f5c
[  214.748854] c1 r10: 00000000  r9 : 00000000  r8 : 00000000
[  214.748862] c1 r7 : 00000001  r6 : c111a814  r5 : ee846140  r4 : ee862000
[  214.748871] c1 r3 : 00000001  r2 : ee863f28  r1 : 00000000  r0 : 00000002
[  214.748881] c1 Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM
Segment none
[  214.748890] c1 Control: 10c5387d  Table: 9b9e406a  DAC: 00000051
...
[  214.821339] c1 [<c014d71c>] (smpboot_thread_fn) from [<c0149ee4>]
(kthread+0x118/0x12c)
[  214.821363] c1 [<c0149ee4>] (kthread) from [<c0108310>]
(ret_from_fork+0x14/0x24)
[  214.821378] c1 Code: e5950000 e5943010 e1500003 0a000000 (e7f001f2)
kernel/kernel/smpboot.c:136:
BUG_ON(td->cpu != smp_processor_id());
It seems that OOPS was caused by migration/2 actually running on cpu1.
Do you have any suggestions for this? Thanks in advance.
_______________________________________________
eas-dev mailing list
eas-dev@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/eas-dev

eas-dev mailing list
eas-dev@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/eas-dev

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] cpu hotplug & up-migrate will cause kernel panic