With 5.10LTS (e.g., 5.10.206), on a machine using an NVME device, the following tracing commands will trigger a crash due to a NULL pointer dereference:
KDIR=/sys/kernel/debug/tracing echo 1 > $KDIR/tracing_on echo 1 > $KDIR/events/nvme/enable echo "Waiting for trace events..." cat $KDIR/trace_pipe
The backtrace looks something like this:
Call Trace: <IRQ> ? __die_body+0x6b/0xb0 ? __die+0x9e/0xb0 ? no_context+0x3eb/0x460 ? ttwu_do_activate+0xf0/0x120 ? __bad_area_nosemaphore+0x157/0x200 ? select_idle_sibling+0x2f/0x410 ? bad_area_nosemaphore+0x13/0x20 ? do_user_addr_fault+0x2ab/0x360 ? exc_page_fault+0x69/0x180 ? asm_exc_page_fault+0x1e/0x30 ? trace_event_raw_event_nvme_complete_rq+0xba/0x170 ? trace_event_raw_event_nvme_complete_rq+0xa3/0x170 nvme_complete_rq+0x168/0x170 nvme_pci_complete_rq+0x16c/0x1f0 nvme_handle_cqe+0xde/0x190 nvme_irq+0x78/0x100 __handle_irq_event_percpu+0x77/0x1e0 handle_irq_event+0x54/0xb0 handle_edge_irq+0xdf/0x230 asm_call_irq_on_stack+0xf/0x20 </IRQ> common_interrupt+0x9e/0x150 asm_common_interrupt+0x1e/0x40
It looks to me like these two upstream commits were backported to 5.10:
679c54f2de67 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()") e7006de6c238 ("nvme: code command_id with a genctr for use-after-free validation")
But they depend on this upstream commit to initialize the 'cmd' field in some cases:
f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
Does it sound like I'm on the right track? The 5.15LTS and later seems to be okay.
On Tue, Jan 09, 2024 at 10:17:22AM -0800, John Sperbeck wrote:
With 5.10LTS (e.g., 5.10.206), on a machine using an NVME device, the following tracing commands will trigger a crash due to a NULL pointer dereference:
KDIR=/sys/kernel/debug/tracing echo 1 > $KDIR/tracing_on echo 1 > $KDIR/events/nvme/enable echo "Waiting for trace events..." cat $KDIR/trace_pipe
The backtrace looks something like this:
Call Trace:
<IRQ> ? __die_body+0x6b/0xb0 ? __die+0x9e/0xb0 ? no_context+0x3eb/0x460 ? ttwu_do_activate+0xf0/0x120 ? __bad_area_nosemaphore+0x157/0x200 ? select_idle_sibling+0x2f/0x410 ? bad_area_nosemaphore+0x13/0x20 ? do_user_addr_fault+0x2ab/0x360 ? exc_page_fault+0x69/0x180 ? asm_exc_page_fault+0x1e/0x30 ? trace_event_raw_event_nvme_complete_rq+0xba/0x170 ? trace_event_raw_event_nvme_complete_rq+0xa3/0x170 nvme_complete_rq+0x168/0x170 nvme_pci_complete_rq+0x16c/0x1f0 nvme_handle_cqe+0xde/0x190 nvme_irq+0x78/0x100 __handle_irq_event_percpu+0x77/0x1e0 handle_irq_event+0x54/0xb0 handle_edge_irq+0xdf/0x230 asm_call_irq_on_stack+0xf/0x20 </IRQ> common_interrupt+0x9e/0x150 asm_common_interrupt+0x1e/0x40
It looks to me like these two upstream commits were backported to 5.10:
679c54f2de67 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()") e7006de6c238 ("nvme: code command_id with a genctr for use-after-free validation")
But they depend on this upstream commit to initialize the 'cmd' field in some cases:
f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
Does it sound like I'm on the right track? The 5.15LTS and later seems to be okay.
If you apply that commit, does it solve the issue for you?
thanks,
greg k-h
On Thu, Jan 11, 2024 at 1:46 AM Greg KH gregkh@linuxfoundation.org wrote:
On Tue, Jan 09, 2024 at 10:17:22AM -0800, John Sperbeck wrote:
With 5.10LTS (e.g., 5.10.206), on a machine using an NVME device, the following tracing commands will trigger a crash due to a NULL pointer dereference:
KDIR=/sys/kernel/debug/tracing echo 1 > $KDIR/tracing_on echo 1 > $KDIR/events/nvme/enable echo "Waiting for trace events..." cat $KDIR/trace_pipe
The backtrace looks something like this:
Call Trace:
<IRQ> ? __die_body+0x6b/0xb0 ? __die+0x9e/0xb0 ? no_context+0x3eb/0x460 ? ttwu_do_activate+0xf0/0x120 ? __bad_area_nosemaphore+0x157/0x200 ? select_idle_sibling+0x2f/0x410 ? bad_area_nosemaphore+0x13/0x20 ? do_user_addr_fault+0x2ab/0x360 ? exc_page_fault+0x69/0x180 ? asm_exc_page_fault+0x1e/0x30 ? trace_event_raw_event_nvme_complete_rq+0xba/0x170 ? trace_event_raw_event_nvme_complete_rq+0xa3/0x170 nvme_complete_rq+0x168/0x170 nvme_pci_complete_rq+0x16c/0x1f0 nvme_handle_cqe+0xde/0x190 nvme_irq+0x78/0x100 __handle_irq_event_percpu+0x77/0x1e0 handle_irq_event+0x54/0xb0 handle_edge_irq+0xdf/0x230 asm_call_irq_on_stack+0xf/0x20 </IRQ> common_interrupt+0x9e/0x150 asm_common_interrupt+0x1e/0x40
It looks to me like these two upstream commits were backported to 5.10:
679c54f2de67 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()") e7006de6c238 ("nvme: code command_id with a genctr for use-after-free validation")
But they depend on this upstream commit to initialize the 'cmd' field in some cases:
f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
Does it sound like I'm on the right track? The 5.15LTS and later seems to be okay.
If you apply that commit, does it solve the issue for you?
thanks,
greg k-h
The f4b9e6c90c57 ("nvme: use driver pdu command for passthrough") upstream commit doesn't apply cleanly to 5.10LTS. If I adjust it to fit, then the crash no longer occurs for me.
A revert of 706960d328f5 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()") from 5.10LTS also prevents the crash.
My leaning would be for a revert from 5.10LTS, but I think the maintainers would have better insight then me. It's also possible that this isn't serious enough to worry about in general. I don't really know.
On 1/11/24 10:00 AM, John Sperbeck wrote:
On Thu, Jan 11, 2024 at 1:46?AM Greg KH gregkh@linuxfoundation.org wrote:
On Tue, Jan 09, 2024 at 10:17:22AM -0800, John Sperbeck wrote:
With 5.10LTS (e.g., 5.10.206), on a machine using an NVME device, the following tracing commands will trigger a crash due to a NULL pointer dereference:
KDIR=/sys/kernel/debug/tracing echo 1 > $KDIR/tracing_on echo 1 > $KDIR/events/nvme/enable echo "Waiting for trace events..." cat $KDIR/trace_pipe
The backtrace looks something like this:
Call Trace:
<IRQ> ? __die_body+0x6b/0xb0 ? __die+0x9e/0xb0 ? no_context+0x3eb/0x460 ? ttwu_do_activate+0xf0/0x120 ? __bad_area_nosemaphore+0x157/0x200 ? select_idle_sibling+0x2f/0x410 ? bad_area_nosemaphore+0x13/0x20 ? do_user_addr_fault+0x2ab/0x360 ? exc_page_fault+0x69/0x180 ? asm_exc_page_fault+0x1e/0x30 ? trace_event_raw_event_nvme_complete_rq+0xba/0x170 ? trace_event_raw_event_nvme_complete_rq+0xa3/0x170 nvme_complete_rq+0x168/0x170 nvme_pci_complete_rq+0x16c/0x1f0 nvme_handle_cqe+0xde/0x190 nvme_irq+0x78/0x100 __handle_irq_event_percpu+0x77/0x1e0 handle_irq_event+0x54/0xb0 handle_edge_irq+0xdf/0x230 asm_call_irq_on_stack+0xf/0x20 </IRQ> common_interrupt+0x9e/0x150 asm_common_interrupt+0x1e/0x40
It looks to me like these two upstream commits were backported to 5.10:
679c54f2de67 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()") e7006de6c238 ("nvme: code command_id with a genctr for use-after-free validation")
But they depend on this upstream commit to initialize the 'cmd' field in some cases:
f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
Does it sound like I'm on the right track? The 5.15LTS and later seems to be okay.
If you apply that commit, does it solve the issue for you?
thanks,
greg k-h
The f4b9e6c90c57 ("nvme: use driver pdu command for passthrough") upstream commit doesn't apply cleanly to 5.10LTS. If I adjust it to fit, then the crash no longer occurs for me.
A revert of 706960d328f5 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()") from 5.10LTS also prevents the crash.
My leaning would be for a revert from 5.10LTS, but I think the maintainers would have better insight then me. It's also possible that this isn't serious enough to worry about in general. I don't really know.
Either solution is fine with me, doesn't really matter. I was wondering how this ended up in stable, and it looks like it was one of those auto-selections... Those seem particularly dangerous the further back you go.
On Thu, Jan 11, 2024 at 12:38:03PM -0700, Jens Axboe wrote:
On 1/11/24 10:00 AM, John Sperbeck wrote:
On Thu, Jan 11, 2024 at 1:46?AM Greg KH gregkh@linuxfoundation.org wrote:
On Tue, Jan 09, 2024 at 10:17:22AM -0800, John Sperbeck wrote:
With 5.10LTS (e.g., 5.10.206), on a machine using an NVME device, the following tracing commands will trigger a crash due to a NULL pointer dereference:
KDIR=/sys/kernel/debug/tracing echo 1 > $KDIR/tracing_on echo 1 > $KDIR/events/nvme/enable echo "Waiting for trace events..." cat $KDIR/trace_pipe
The backtrace looks something like this:
Call Trace:
<IRQ> ? __die_body+0x6b/0xb0 ? __die+0x9e/0xb0 ? no_context+0x3eb/0x460 ? ttwu_do_activate+0xf0/0x120 ? __bad_area_nosemaphore+0x157/0x200 ? select_idle_sibling+0x2f/0x410 ? bad_area_nosemaphore+0x13/0x20 ? do_user_addr_fault+0x2ab/0x360 ? exc_page_fault+0x69/0x180 ? asm_exc_page_fault+0x1e/0x30 ? trace_event_raw_event_nvme_complete_rq+0xba/0x170 ? trace_event_raw_event_nvme_complete_rq+0xa3/0x170 nvme_complete_rq+0x168/0x170 nvme_pci_complete_rq+0x16c/0x1f0 nvme_handle_cqe+0xde/0x190 nvme_irq+0x78/0x100 __handle_irq_event_percpu+0x77/0x1e0 handle_irq_event+0x54/0xb0 handle_edge_irq+0xdf/0x230 asm_call_irq_on_stack+0xf/0x20 </IRQ> common_interrupt+0x9e/0x150 asm_common_interrupt+0x1e/0x40
It looks to me like these two upstream commits were backported to 5.10:
679c54f2de67 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()") e7006de6c238 ("nvme: code command_id with a genctr for use-after-free validation")
But they depend on this upstream commit to initialize the 'cmd' field in some cases:
f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
Does it sound like I'm on the right track? The 5.15LTS and later seems to be okay.
If you apply that commit, does it solve the issue for you?
thanks,
greg k-h
The f4b9e6c90c57 ("nvme: use driver pdu command for passthrough") upstream commit doesn't apply cleanly to 5.10LTS. If I adjust it to fit, then the crash no longer occurs for me.
A revert of 706960d328f5 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()") from 5.10LTS also prevents the crash.
My leaning would be for a revert from 5.10LTS, but I think the maintainers would have better insight then me. It's also possible that this isn't serious enough to worry about in general. I don't really know.
Either solution is fine with me, doesn't really matter. I was wondering how this ended up in stable, and it looks like it was one of those auto-selections... Those seem particularly dangerous the further back you go.
Now reverted, thanks. But note, that commit does say it fixes an issue this far back, which is why it was applied.
greg k-h
On 1/13/24 2:34 AM, Greg KH wrote:
On Thu, Jan 11, 2024 at 12:38:03PM -0700, Jens Axboe wrote:
On 1/11/24 10:00 AM, John Sperbeck wrote:
On Thu, Jan 11, 2024 at 1:46?AM Greg KH gregkh@linuxfoundation.org wrote:
On Tue, Jan 09, 2024 at 10:17:22AM -0800, John Sperbeck wrote:
With 5.10LTS (e.g., 5.10.206), on a machine using an NVME device, the following tracing commands will trigger a crash due to a NULL pointer dereference:
KDIR=/sys/kernel/debug/tracing echo 1 > $KDIR/tracing_on echo 1 > $KDIR/events/nvme/enable echo "Waiting for trace events..." cat $KDIR/trace_pipe
The backtrace looks something like this:
Call Trace:
<IRQ> ? __die_body+0x6b/0xb0 ? __die+0x9e/0xb0 ? no_context+0x3eb/0x460 ? ttwu_do_activate+0xf0/0x120 ? __bad_area_nosemaphore+0x157/0x200 ? select_idle_sibling+0x2f/0x410 ? bad_area_nosemaphore+0x13/0x20 ? do_user_addr_fault+0x2ab/0x360 ? exc_page_fault+0x69/0x180 ? asm_exc_page_fault+0x1e/0x30 ? trace_event_raw_event_nvme_complete_rq+0xba/0x170 ? trace_event_raw_event_nvme_complete_rq+0xa3/0x170 nvme_complete_rq+0x168/0x170 nvme_pci_complete_rq+0x16c/0x1f0 nvme_handle_cqe+0xde/0x190 nvme_irq+0x78/0x100 __handle_irq_event_percpu+0x77/0x1e0 handle_irq_event+0x54/0xb0 handle_edge_irq+0xdf/0x230 asm_call_irq_on_stack+0xf/0x20 </IRQ> common_interrupt+0x9e/0x150 asm_common_interrupt+0x1e/0x40
It looks to me like these two upstream commits were backported to 5.10:
679c54f2de67 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()") e7006de6c238 ("nvme: code command_id with a genctr for use-after-free validation")
But they depend on this upstream commit to initialize the 'cmd' field in some cases:
f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
Does it sound like I'm on the right track? The 5.15LTS and later seems to be okay.
If you apply that commit, does it solve the issue for you?
thanks,
greg k-h
The f4b9e6c90c57 ("nvme: use driver pdu command for passthrough") upstream commit doesn't apply cleanly to 5.10LTS. If I adjust it to fit, then the crash no longer occurs for me.
A revert of 706960d328f5 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()") from 5.10LTS also prevents the crash.
My leaning would be for a revert from 5.10LTS, but I think the maintainers would have better insight then me. It's also possible that this isn't serious enough to worry about in general. I don't really know.
Either solution is fine with me, doesn't really matter. I was wondering how this ended up in stable, and it looks like it was one of those auto-selections... Those seem particularly dangerous the further back you go.
Now reverted, thanks. But note, that commit does say it fixes an issue this far back, which is why it was applied.
Yeah, looking into it, I can see how this was applied. And hard to know not to, it's one of those boundary conditions that may happen for stable backports and that only the ones intimate with the nvme code will spot. Just one of those things, I think.
linux-stable-mirror@lists.linaro.org