Hello all, I am working on a android tablet using TI OMAP 4470 ES1 Soc which has 2xArm Cortex A9. The system is running Android 4.2.2 with Linux Kernel 3.4.48. My kernel configuration with these debugging:
CONFIG_CC_STACKPROTECTOR=y
CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_MUTEXES=y # CONFIG_DEBUG_LOCK_ALLOC is not set CONFIG_TRACE_IRQFLAGS=y CONFIG_DEBUG_ATOMIC_SLEEP=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y CONFIG_LOCKDEP=y CONFIG_DEBUG_SLAB=y CONFIG_DEBUG_SLAB_LEAK=y #and FTRACE CONFIG_FUNCTION_TRACER=y CONFIG_FUNCTION_GRAPH_TRACER=y CONFIG_STACK_TRACER=y CONFIG_DYNAMIC_FTRACE=y
The system randomly got reset due to memory corruption and one of them seems the the stack is restored incorrectly. One of the crashing is as following:
[484677.808807] Unable to handle kernel paging request at virtual address
0040049c [484677.817077] pgd = d5ee8000 [484677.820220] [0040049c] *pgd=00000000 [484677.824615] Process UEventObserver (pid: 764, stack limit = 0xd5b2c2f8) [484677.832000] Internal error: Oops: 805 [#1] PREEMPT SMP ARM [484677.838287] Modules linked in: wlcore_sdio(O) wl18xx(O) wlcore(O) mac80211(O) cfg80211(O) compat(O) pvrsrvkm_sgx544_112(O) [484677.851806] CPU: 0 Tainted: G W O (3.4.48-dirty #1) [484677.858459] PC is at lock_release+0x9c/0x134 [484677.863433] LR is at _raw_spin_lock_irqsave+0x64/0x70 [484677.869140] pc : [<c00a48f0>] lr : [<c06b7850>] psr: 60000193 [484677.869140] sp : d5b2db20 ip : d5b2daf0 fp : d5b2db54 [484677.882141] r10: 00000000 r9 : d5b2dbf4 r8 : 00000000 [484677.888122] r7 : d5a8bec0 r6 : d5b2c000 r5 : d58a8040 r4 : c00a3710 [484677.895416] r3 : 00400040 r2 : 00000000 r1 : 5bbd5bbc r0 : 60000113
Decode the stack dump:
[484678.126312] SP: 0xd5b2daa0: [484678.131408] daa0 00000080 00000000 c0070cf8 00000000 c06b6cb4 c06b64c8 c00a48f0 60000193 [484678.141998] dac0 ffffffff d5b2db0c d5b2db54 d5b2dad8 c06b85d8(lr=__dabt_svc) c000839c(pc=do_DataAbort()) 60000113(r0) 5bbd5bbc(r1) [484678.152587] dae0 00000000(r2) 00400040(r3) c00a3710(pc=__lock_acquire) push {r4, r5, r6, r7, r8, r9, sl, fp, ip, lr, pc} d58a8040(4) d5b2c000(5) d5a8bec0(6) 00000000(7) d5b2dbf4(8) [484678.163055] db00 00000000(9) d5b2db54(sl) d5b2daf0(fp) d5b2db20(ip) c06b7850(lr=_raw_spin_lock_irqsave) c00a48f0(pc=lock_release) 60000193 ffffffff push {r3, r4, r5, r6, fp, ip, lr, pc} [484678.173522] db20 d5b2dd10(r3) d5b2dcf4(r4) d5a86338(r5) 00000000(r6) d5b2db5c(fp) d5b2db40(ip) c0144974(lr=__pollwait) c0070cd4(pc=add_wait_queue) [484678.184082] db40 d5abb400 d5b2dbfc d5a86338 d5b2dc04 d5b2db74 d5b2db60 c050e470 c01448f0(pc=pollwake) [484678.194702] db60 d5b2dcf4 d5b2dbfc d5b2db84 d5b2db78 c0500d84 c050e440 d5b2dbe4 d5b2db88 [484678.205200] db80 c0144cc4 c0500d64 d5b2dbac d5b2c000 00000000 00000000 00000000 d5b2dbf4
I found that the UEventObserver userspace process is calling select() system call then the kernel code path is: __pollwait() --> add_wait_queue() ------> _raw_spin_lock_irqsave() ---------> lock_release()
1. The first issue is I do not understand why the _raw_spin_lock_irqsave() call lock_release right after. It should call lock_acquire(). 2. It look likes the register is pop from stack is not correct:
It look likes the stack frame for add_wait_queue() function is correct:
60000193(cpsr) ffffffff push {r3, r4, r5, r6, fp, ip, lr, pc} [484678.173522] db20 d5b2dd10(r3) d5b2dcf4(r4) d5a86338(r5) 00000000(r6) d5b2db5c(fp) d5b2db40(ip) c0144974(lr=__pollwait) c0070cd4(pc=add_wait_queue)
The cpsr=600000193 means interrupt is disabled so that scheduler is also off mean no other task in the same core can affect the current execution thread.
The stack frame for calling lock_release() seems correct too
push {r4, r5, r6, r7, r8, r9, sl, fp, ip, lr, pc} d58a8040(4) d5b2c000(5) d5a8bec0(6) 00000000(7) d5b2dbf4(8) [484678.163055] db00 00000000(9) d5b2db54(sl) d5b2daf0(fp) d5b2db20(ip) c06b7850(lr=_raw_spin_lock_irqsave) c00a48f0(pc=lock_release)
I found that:
R5=d5b2c000 point to the thread_info R4=d58a8040 point to the current task_struct
But it look like poping from stack got issue then the system crashed:
r6 : d5b2c000 r5 : d58a8040 r4 : c00a3710
Now R6=d5b2c000 (which is R5 in stack) and R5=d58a8040 (which is R4 in stack) R4=c00a3710(seem this is the LR of other stack frame).
Have anyone faced this issue ? I suspect that enabling Profiling when building kernel cause this issue because I found the generated code for add_wait_queue():
void add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait) { c0070cc8: e1a0c00d mov ip, sp c0070ccc: e92dd878 push {r3, r4, r5, r6, fp, ip, lr, pc} c0070cd0: e24cb004 sub fp, ip, #4 c0070cd4: e92d4000 push {lr} c0070cd8: ebfe8c62 *bl c0013e68 <__gnu_mcount_nc>*
The jump to __gnu_mcount_nc() is generated by GCC when the option -pg is enabled. Is there any how that cause issue for stack setup ? Many thanks for reading this long mail. Any sharing idea is very appriciated.
Thanks again.
linaro-android@lists.linaro.org