We have been seeing kernel panics on certain systems with the 4.9.x stable kernel, where x >= 98. We captured the following panic message: [ 6.252000] BUG: unable to handle kernel NULL pointer dereference at 0000000000000102 [ 6.261222] IP: [<ffffffff810a8ea2>] __queue_work+0x32/0x430 [ 6.267826] PGD 0 [ 6.269951] [ 6.271887] Oops: 0000 [#1] SMP [ 6.275576] Modules linked in: [ 6.279264] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.100 #1 [ 6.296760] task: ffffffff81c114c0 task.stack: ffffffff81c00000 [ 6.303564] RIP: 0010:[<ffffffff810a8ea2>] [<ffffffff810a8ea2>] __queue_work+0x32/0x430 [ 6.312997] RSP: 0000:ffff897a40403d98 EFLAGS: 00010046 [ 6.319120] RAX: 0000000000000082 RBX: 0000000000000046 RCX: 0000000000000000 [ 6.327282] RDX: ffffffff81d02300 RSI: 0000000000000000 RDI: 0000000000002000 [ 6.335443] RBP: ffff897a40403dd0 R08: 00000000d431dd4b R09: 0000000000000000 [ 6.343606] R10: ffff897a40403e08 R11: ffffffff8227615c R12: ffffffff81d02300 [ 6.351766] R13: 0000000000002000 R14: 0000000000000000 R15: ffffffff81a88ffd [ 6.359930] FS: 0000000000000000(0000) GS:ffff897a40400000(0000) knlGS:0000000000000000 [ 6.369274] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.375883] CR2: 0000000000000102 CR3: 0000000001c08000 CR4: 0000000000040630 [ 6.384045] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.392206] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.400366] Stack: [ 6.402801] 0000001000000000 0000200040403e08 0000000000000046 0000000000000002 [ 6.411723] ffffffff82275f40 ffff897a40403e08 ffffffff81a88ffd ffff897a40403de8 [ 6.420638] ffffffff810a9677 ffffffff82275f88 ffff897a40403e78 ffffffff8148dff4 [ 6.429554] Call Trace: [ 6.432473] <IRQ> [ 6.434698] [<ffffffff810a9677>] queue_work_on+0x27/0x40 [ 6.441121] [<ffffffff8148dff4>] crng_reseed+0x154/0x260 [ 6.447343] [<ffffffff8148e334>] credit_entropy_bits+0x234/0x2a0 [ 6.454343] [<ffffffff8148e77b>] ? add_interrupt_randomness+0x1bb/0x220 [ 6.462020] [<ffffffff8148e77b>] add_interrupt_randomness+0x1bb/0x220 [ 6.469506] [<ffffffff810ef8f0>] handle_irq_event_percpu+0x40/0x80 [ 6.476698] [<ffffffff810ef96b>] handle_irq_event+0x3b/0x60 [ 6.483211] [<ffffffff810f2e7f>] handle_level_irq+0x8f/0x110 [ 6.489823] [<ffffffff81030fd5>] handle_irq+0xb5/0x140 [ 6.495854] [<ffffffff81096651>] ? _local_bh_enable+0x21/0x50 [ 6.502562] [<ffffffff81038ab5>] ? __exit_idle+0x5/0x30 [ 6.508689] [<ffffffff817250cd>] do_IRQ+0x4d/0xe0 [ 6.514232] [<ffffffff81722ba0>] common_interrupt+0xa0/0xa0 [ 6.520744] <EOI> [ 6.522968] [<ffffffff81ddeb08>] ? vfs_caches_init+0xe1/0xe3 [ 6.529779] [<ffffffff81da212e>] start_kernel+0x451/0x4c8 [ 6.536096] [<ffffffff81da1acb>] ? set_init_arg+0x55/0x55 [ 6.542415] [<ffffffff81da1120>] ? early_idt_handler_array+0x120/0x120 [ 6.549995] [<ffffffff81da15d6>] x86_64_start_reservations+0x2a/0x2c [ 6.557381] [<ffffffff81da1714>] x86_64_start_kernel+0x13c/0x15f [ 6.564378] Code: 89 e5 41 57 41 56 49 89 f6 41 55 41 89 fd 41 54 49 89 d4 53 48 83 ec 10 89 7d d4 ff 14 25 80 86 c3 81 f6 c4 02 0f 85 1f 03 00 00 <41> f6 86 02 01 00 00 01 0f 85 b2 02 00 00 49 c7 c7 78 69 01 00 [ 6.591500] RIP [<ffffffff810a8ea2>] __queue_work+0x32/0x430 [ 6.598197] RSP <ffff897a40403d98> [ 6.602281] CR2: 0000000000000102
The NULL pointer dereference happens because the second argument to __queue_work is 0. The second argument is a struct workqueue_struct, specifically system_wq. This shows that an interrupt happened before the system_wq was initialized. We believe the problem was introduced with this commit:
$ git show c3ff2da5cef05676d490fa9057b2dceb5e48cdb9 commit c3ff2da5cef05676d490fa9057b2dceb5e48cdb9 Author: Theodore Ts'o tytso@mit.edu Date: Mon Apr 23 18:51:28 2018 -0400
random: fix possible sleeping allocation from irq context
commit 6c1e851c4edc13a43adb3ea4044e3fc8f43ccf7d upstream.
We can do a sleeping allocation from an irq context when CONFIG_NUMA is enabled. Fix this by initializing the NUMA crng instances in a workqueue.
Reported-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Reported-by: syzbot+9de458f6a5e713ee8c1a@syzkaller.appspotmail.com Fixes: 8ef35c866f8862df ("random: set up the NUMA crng instances...") Cc: stable@vger.kernel.org Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
diff --git a/drivers/char/random.c b/drivers/char/random.c index b979173..dbfb3e69 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -820,7 +820,7 @@ static int crng_fast_load(const char *cp, size_t len) }
#ifdef CONFIG_NUMA -static void numa_crng_init(void) +static void do_numa_crng_init(struct work_struct *work) { int i; struct crng_state *crng; @@ -841,6 +841,13 @@ static void numa_crng_init(void) kfree(pool); } } + +static DECLARE_WORK(numa_crng_init_work, do_numa_crng_init); + +static void numa_crng_init(void) +{ + schedule_work(&numa_crng_init_work); +} #else static void numa_crng_init(void) {} #endif
Apparently we can't count on system_wq being initialized when schedule_work is called in numa_crng_init from an early interrupt.
I don't understand the underlying code enough to propose a fix right away.
thanks, Max
On Wed, Jun 06, 2018 at 11:26:46PM +0000, Max Asbock wrote:
We have been seeing kernel panics on certain systems with the 4.9.x stable kernel, where x >= 98. We captured the following panic message: [ 6.252000] BUG: unable to handle kernel NULL pointer dereference at 0000000000000102 [ 6.261222] IP: [<ffffffff810a8ea2>] __queue_work+0x32/0x430 [ 6.267826] PGD 0 [ 6.269951] [ 6.271887] Oops: 0000 [#1] SMP [ 6.275576] Modules linked in: [ 6.279264] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.100 #1 [ 6.296760] task: ffffffff81c114c0 task.stack: ffffffff81c00000 [ 6.303564] RIP: 0010:[<ffffffff810a8ea2>] [<ffffffff810a8ea2>] __queue_work+0x32/0x430 [ 6.312997] RSP: 0000:ffff897a40403d98 EFLAGS: 00010046 [ 6.319120] RAX: 0000000000000082 RBX: 0000000000000046 RCX: 0000000000000000 [ 6.327282] RDX: ffffffff81d02300 RSI: 0000000000000000 RDI: 0000000000002000 [ 6.335443] RBP: ffff897a40403dd0 R08: 00000000d431dd4b R09: 0000000000000000 [ 6.343606] R10: ffff897a40403e08 R11: ffffffff8227615c R12: ffffffff81d02300 [ 6.351766] R13: 0000000000002000 R14: 0000000000000000 R15: ffffffff81a88ffd [ 6.359930] FS: 0000000000000000(0000) GS:ffff897a40400000(0000) knlGS:0000000000000000 [ 6.369274] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.375883] CR2: 0000000000000102 CR3: 0000000001c08000 CR4: 0000000000040630 [ 6.384045] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.392206] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.400366] Stack: [ 6.402801] 0000001000000000 0000200040403e08 0000000000000046 0000000000000002 [ 6.411723] ffffffff82275f40 ffff897a40403e08 ffffffff81a88ffd ffff897a40403de8 [ 6.420638] ffffffff810a9677 ffffffff82275f88 ffff897a40403e78 ffffffff8148dff4 [ 6.429554] Call Trace: [ 6.432473] <IRQ> [ 6.434698] [<ffffffff810a9677>] queue_work_on+0x27/0x40 [ 6.441121] [<ffffffff8148dff4>] crng_reseed+0x154/0x260 [ 6.447343] [<ffffffff8148e334>] credit_entropy_bits+0x234/0x2a0 [ 6.454343] [<ffffffff8148e77b>] ? add_interrupt_randomness+0x1bb/0x220 [ 6.462020] [<ffffffff8148e77b>] add_interrupt_randomness+0x1bb/0x220 [ 6.469506] [<ffffffff810ef8f0>] handle_irq_event_percpu+0x40/0x80 [ 6.476698] [<ffffffff810ef96b>] handle_irq_event+0x3b/0x60 [ 6.483211] [<ffffffff810f2e7f>] handle_level_irq+0x8f/0x110 [ 6.489823] [<ffffffff81030fd5>] handle_irq+0xb5/0x140 [ 6.495854] [<ffffffff81096651>] ? _local_bh_enable+0x21/0x50 [ 6.502562] [<ffffffff81038ab5>] ? __exit_idle+0x5/0x30 [ 6.508689] [<ffffffff817250cd>] do_IRQ+0x4d/0xe0 [ 6.514232] [<ffffffff81722ba0>] common_interrupt+0xa0/0xa0 [ 6.520744] <EOI> [ 6.522968] [<ffffffff81ddeb08>] ? vfs_caches_init+0xe1/0xe3 [ 6.529779] [<ffffffff81da212e>] start_kernel+0x451/0x4c8 [ 6.536096] [<ffffffff81da1acb>] ? set_init_arg+0x55/0x55 [ 6.542415] [<ffffffff81da1120>] ? early_idt_handler_array+0x120/0x120 [ 6.549995] [<ffffffff81da15d6>] x86_64_start_reservations+0x2a/0x2c [ 6.557381] [<ffffffff81da1714>] x86_64_start_kernel+0x13c/0x15f [ 6.564378] Code: 89 e5 41 57 41 56 49 89 f6 41 55 41 89 fd 41 54 49 89 d4 53 48 83 ec 10 89 7d d4 ff 14 25 80 86 c3 81 f6 c4 02 0f 85 1f 03 00 00 <41> f6 86 02 01 00 00 01 0f 85 b2 02 00 00 49 c7 c7 78 69 01 00 [ 6.591500] RIP [<ffffffff810a8ea2>] __queue_work+0x32/0x430 [ 6.598197] RSP <ffff897a40403d98> [ 6.602281] CR2: 0000000000000102
The NULL pointer dereference happens because the second argument to __queue_work is 0. The second argument is a struct workqueue_struct, specifically system_wq. This shows that an interrupt happened before the system_wq was initialized. We believe the problem was introduced with this commit:
$ git show c3ff2da5cef05676d490fa9057b2dceb5e48cdb9 commit c3ff2da5cef05676d490fa9057b2dceb5e48cdb9 Author: Theodore Ts'o tytso@mit.edu Date: Mon Apr 23 18:51:28 2018 -0400
random: fix possible sleeping allocation from irq context
commit 6c1e851c4edc13a43adb3ea4044e3fc8f43ccf7d upstream. We can do a sleeping allocation from an irq context when CONFIG_NUMA is enabled. Fix this by initializing the NUMA crng instances in a workqueue. Reported-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Reported-by: syzbot+9de458f6a5e713ee8c1a@syzkaller.appspotmail.com Fixes: 8ef35c866f8862df ("random: set up the NUMA crng instances...") Cc: stable@vger.kernel.org Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
diff --git a/drivers/char/random.c b/drivers/char/random.c index b979173..dbfb3e69 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -820,7 +820,7 @@ static int crng_fast_load(const char *cp, size_t len) } #ifdef CONFIG_NUMA -static void numa_crng_init(void) +static void do_numa_crng_init(struct work_struct *work) { int i; struct crng_state *crng; @@ -841,6 +841,13 @@ static void numa_crng_init(void) kfree(pool); } }
+static DECLARE_WORK(numa_crng_init_work, do_numa_crng_init);
+static void numa_crng_init(void) +{
schedule_work(&numa_crng_init_work);
+} #else static void numa_crng_init(void) {} #endif
Apparently we can't count on system_wq being initialized when schedule_work is called in numa_crng_init from an early interrupt.
I don't understand the underlying code enough to propose a fix right away.
Ick :(
I'm guessing you also see these problems on 4.17? Can you test there to be sure of that?
thanks,
greg k-h
________________________________________ From: Greg KH [greg@kroah.com] Sent: Thursday, June 07, 2018 1:37 AM To: Max Asbock Cc: stable@vger.kernel.org; tytso@mit.edu; Chris McDermott Subject: [External] Re: panic at boot time with kernel >= 4.9.98 - uninitialized system_wq in early interrupt
Ick :(
I'm guessing you also see these problems on 4.17? Can you test there to be sure of that?
We haven't had a chance to test 4.17 on the system where this happens. I am suspecting this won't be a problem on 4.17 as workqueue init has been split up and there is now a workqueue_init_early() in start_kernel(): /* * Allow workqueue creation and work item queueing/cancelling * early. Work item execution depends on kthreads and starts after * workqueue_init(). */ workqueue_init_early();
So far we have only seen this with 4.9.x. Also, this only happens when lots of memory is installed (10TB). i am guessing the large memory size changes the timing of the initialization steps and brings out the problem. When we get access to the system again we can attempt to boot the latest main-line kernel to verify that the work_init_early indeed fixes the issue there.
Max
On Thu, Jun 07, 2018 at 05:54:56PM +0000, Max Asbock wrote:
From: Greg KH [greg@kroah.com] Sent: Thursday, June 07, 2018 1:37 AM To: Max Asbock Cc: stable@vger.kernel.org; tytso@mit.edu; Chris McDermott Subject: [External] Re: panic at boot time with kernel >= 4.9.98 - uninitialized system_wq in early interrupt
Ick :(
I'm guessing you also see these problems on 4.17? Can you test there to be sure of that?
We haven't had a chance to test 4.17 on the system where this happens. I am suspecting this won't be a problem on 4.17 as workqueue init has been split up and there is now a workqueue_init_early() in start_kernel(): /* * Allow workqueue creation and work item queueing/cancelling * early. Work item execution depends on kthreads and starts after * workqueue_init(). */ workqueue_init_early();
So far we have only seen this with 4.9.x. Also, this only happens when lots of memory is installed (10TB). i am guessing the large memory size changes the timing of the initialization steps and brings out the problem. When we get access to the system again we can attempt to boot the latest main-line kernel to verify that the work_init_early indeed fixes the issue there.
Ugh, I forgot about the workqueue rewrite.
How about 4.14, does that work for you? If so, just use that, you shouldn't be using the 4.9 kernel tree on x86-based hardware unless you are somehow forced to due to horrible closed source kernel drivers. You want and need the fixes and speedups that are on 4.14.y, it's a measurable difference.
thanks,
greg k-h
linux-stable-mirror@lists.linaro.org