Hi Quentin, Leo,
Thanks for your feedback and help!
Actually, even only applying below two extension patches, the issues still happen. * | dc626b2 sched: avoid pushing tasks to an offline CPU * | 2da014c sched: Extend active balance to accept 'push_task' argument
And except the crash which I mentioned in the first email, there is another type crash happened as below: [ 2072.653091] c1 ------------[ cut here ]------------ [ 2072.653133] c1 WARNING: CPU: 1 PID: 13 at kernel/fork.c:252 __put_task_struct+0x30/0x124() [ 2072.653173] c1 CPU: 1 PID: 13 Comm: migration/1 Tainted: G W O 4.4.83-01066-g04c5403-dirty #17 [ 2072.653215] c1 [<c011141c>] (unwind_backtrace) from [<c010ced8>] (show_stack+0x20/0x24) [ 2072.653235] c1 [<c010ced8>] (show_stack) from [<c043d7f8>] (dump_stack+0xa8/0xe0) [ 2072.653255] c1 [<c043d7f8>] (dump_stack) from [<c012be04>] (warn_slowpath_common+0x98/0xc4) [ 2072.653273] c1 [<c012be04>] (warn_slowpath_common) from [<c012beec>] (warn_slowpath_null+0x2c/0x34) [ 2072.653291] c1 [<c012beec>] (warn_slowpath_null) from [<c01293b4>] (__put_task_struct+0x30/0x124) [ 2072.653310] c1 [<c01293b4>] (__put_task_struct) from [<c0166964>] (active_load_balance_cpu_stop+0x22c/0x314) [ 2072.653331] c1 [<c0166964>] (active_load_balance_cpu_stop) from [<c01c2604>] (cpu_stopper_thread+0x90/0x144) [ 2072.653352] c1 [<c01c2604>] (cpu_stopper_thread) from [<c014d80c>] (smpboot_thread_fn+0x258/0x270) [ 2072.653370] c1 [<c014d80c>] (smpboot_thread_fn) from [<c0149ee4>] (kthread+0x118/0x12c) [ 2072.653388] c1 [<c0149ee4>] (kthread) from [<c0108310>] (ret_from_fork+0x14/0x24) [ 2072.653400] c1 ---[ end trace 49c3d154890763fc ]--- [ 2072.653418] c1 Unable to handle kernel NULL pointer dereference at virtual address 00000000 ... [ 2072.832804] c1 [<c01ba00c>] (put_css_set) from [<c01be870>] (cgroup_free+0x6c/0x78) [ 2072.832823] c1 [<c01be870>] (cgroup_free) from [<c01293f8>] (__put_task_struct+0x74/0x124) [ 2072.832844] c1 [<c01293f8>] (__put_task_struct) from [<c0166964>] (active_load_balance_cpu_stop+0x22c/0x314) [ 2072.832860] c1 [<c0166964>] (active_load_balance_cpu_stop) from [<c01c2604>] (cpu_stopper_thread+0x90/0x144) [ 2072.832879] c1 [<c01c2604>] (cpu_stopper_thread) from [<c014d80c>] (smpboot_thread_fn+0x258/0x270) [ 2072.832896] c1 [<c014d80c>] (smpboot_thread_fn) from [<c0149ee4>] (kthread+0x118/0x12c) [ 2072.832914] c1 [<c0149ee4>] (kthread) from [<c0108310>] (ret_from_fork+0x14/0x24) [ 2072.832930] c1 Code: f57ff05b f590f000 e3e02000 e3a03001 (e1941f9f) [ 2072.839208] c1 ---[ end trace 49c3d154890763fd ]---
For this crash, the root cause is the push_task pointer is used without initialization on the out_lock path. And maybe cpu hotplug in/out make this happen more easily.
For the crash i mentioned in the first email seems gone too after applying the initialization patch.
The gerrit is as below: https://android-review.googlesource.com/c/kernel/common/+/586107 Please review and thanks again.
On 10 January 2018 at 19:39, Quentin Perret quentin.perret@arm.com wrote:
On Wednesday 10 Jan 2018 at 09:07:49 (+0800), Leo Yan wrote:
Hi Ke, Quentin,
On Tue, Jan 09, 2018 at 12:38:42PM +0000, Quentin Perret wrote:
Hi Ke,
Thank you very much for your feedback !
After a quick investigation I noticed that select_energy_cpu_brute() can actually return an offline CPU in some corner cases (if prev_cpu is offline for example). That was not an issue when select_energy_cpu_brute() was called only from select_task_rq_fair() as select_task_rq() would safely call select_fallback_rq() in that case. However, since:
9e293db sched: EAS: upmigrate misfit current task
select_energy_cpu_brute is now called outside of the wakeup path and an active load balance is triggered unconditionally on the CPU that was selected, which might be offline. I'm not an expert in load balance but I suspect this isn't the right thing to do. I'll investigate a little bit more and try to come up with a fix if this is confirmed to be the root cause.
Just reminding two things:
- I saw Ke mentioned 'leaving EAS disabled', should function select_energy_cpu_brute() not be executed?
It is not executed in the wakeup path but the current implementation of the upmigrate thing doesn't check energy_aware() before calling select_energy_cpu_brute(). This is probably a bug as well ...
For this, I also pushed a patch as below: https://android-review.googlesource.com/c/kernel/common/+/586110 Could you please help to review? thanks.
- Threads 'migration/X' should be FIFO thread, so it's not in CFS class;
Is this issue in select_task_rq_rt()?
Possibly yes. That said, the 3 patches Ke mentioned don't touch rt IIRC ...
Thanks, Leo Yan
On Tuesday 09 Jan 2018 at 11:15:03 (+0800), Ke Wang wrote:
Hi Joonwoo, Chris,
When porting EAS1.4 to our platform which is SMP(4*A7, k4.4), we encountered kernel panic frequently after applied following patches:
- | 9e293db sched: EAS: upmigrate misfit current task
- | dc626b2 sched: avoid pushing tasks to an offline CPU
- | 2da014c sched: Extend active balance to accept 'push_task' argument
After applying these three patches, leaving EAS disabled and doing a stability test which includes some random cpu plugin/plugout, kernel panic sometimes happened, always with the same stack as below:
[ 214.742695] c1 ------------[ cut here ]------------ [ 214.742709] c1 kernel BUG at /space/builder/repo/sprdroid8.1_trunk/kernel/kernel/smpboot.c:136! [ 214.742718] c1 Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM [ 214.748750] c0 Modules linked in: mtty marlin2_fm mali(O) [ 214.748785] c1 CPU: 1 PID: 18 Comm: migration/2 Tainted: G W O 4.4.83-00912-g370f62c #1 [ 214.748795] c1 Hardware name: Generic DT based system [ 214.748805] c1 task: ef2d9680 task.stack: ee862000 [ 214.748821] c1 PC is at smpboot_thread_fn+0x168/0x270 [ 214.748832] c1 LR is at smpboot_thread_fn+0xe4/0x270 [ 214.748843] c1 pc : [<c014d71c>] lr : [<c014d698>] psr: 200e0113 sp : ee863f38 ip : ee863f38 fp : ee863f5c [ 214.748854] c1 r10: 00000000 r9 : 00000000 r8 : 00000000 [ 214.748862] c1 r7 : 00000001 r6 : c111a814 r5 : ee846140 r4 : ee862000 [ 214.748871] c1 r3 : 00000001 r2 : ee863f28 r1 : 00000000 r0 : 00000002 [ 214.748881] c1 Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none [ 214.748890] c1 Control: 10c5387d Table: 9b9e406a DAC: 00000051 ... [ 214.821339] c1 [<c014d71c>] (smpboot_thread_fn) from [<c0149ee4>] (kthread+0x118/0x12c) [ 214.821363] c1 [<c0149ee4>] (kthread) from [<c0108310>] (ret_from_fork+0x14/0x24) [ 214.821378] c1 Code: e5950000 e5943010 e1500003 0a000000 (e7f001f2)
kernel/kernel/smpboot.c:136: BUG_ON(td->cpu != smp_processor_id());
It seems that OOPS was caused by migration/2 actually running on cpu1.
Do you have any suggestions for this? Thanks in advance. _______________________________________________ eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev