The patch usb: xhci: prevent potential failure in handle_tx_event() for Transfer events without TRB https://patches.linaro.org/project/linux-usb/patch/20240429140245.3955523-11... causes The Linux kernel 6.1.98 https://lkml.org/lkml/2024/7/9/645 to crash when plugging in a USB Seagate drive. https://www.seagate.com/ca/en/products/gaming-drives/pc-gaming/firecuda-gami... This is a regression.
Behavior of 6.1.98: ============================================================================== scsi host1: uas_eh_device_reset_handler start rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: 0-...!: (1 GPs behind) idle=686c/0/0x1 softirq=1841/1841 fqs=610 (detected by 4, t=5253 jiffies, g=2269, q=225 ncpus=6) Task dump for CPU 0: task:swapper/0 state:R running task stack:0 pid:0 ppid:0 flags:0x0000000 8 Call trace: __switch_to+0xe4/0x160 0xd7f8f808 rcu: rcu_preempt kthread timer wakeup didn't happen for 4037 jiffies! g2269 f0x0 RCU_GP_WAIT_FQS (5) ->state=0x402 rcu: Possible timer handling issue on cpu=5 timer-softirq=1141 rcu: rcu_preempt kthread starved for 4043 jiffies! g2269 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 -
cpu=5
rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_preempt state:I stack:0 pid:14 ppid:2 flags:0x00000008 Call trace: __switch_to+0xe4/0x160 __schedule+0x28c/0x710 schedule+0x5c/0xd0 schedule_timeout+0x8c/0x100 rcu_gp_fqs_loop+0x140/0x4a0 rcu_gp_kthread+0x13c/0x170 kthread+0x108/0x10c ret_from_fork+0x10/0x20 rcu: Stack dump where RCU GP kthread last ran: Task dump for CPU 5: task:kworker/5:1 state:R running task stack:0 pid:89 ppid:2 flags:0x0000000 8 Workqueue: events xhci_handle_command_timeout Call trace: __switch_to+0xe4/0x160 0x0
Behavior of 6.1.97 (or 6.1.978 with the patch reverted): ============================================================================== scsi host1: uas_eh_device_reset_handler start usb 2-1.4.2.1: reset SuperSpeed USB device number 6 using xhci-hcd scsi host1: uas_eh_device_reset_handler success sd 1:0:0:0:31251759103 512-byte logical blocks: (16.0 TB/14.6 TiB) sd 1:0:0:0:Write Protect is off sd 1:0:0:0:Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 1:0:0:0:Preferred minimum I/O size 512 bytes sd 1:0:0:0:Optimal transfer size 33553920 bytes sdb: sdb1 sd 1:0:0:0:Attached SCSI disk sd 0:0:0:0:tag#6 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN sd 0:0:0:0:tag#6 CDB: opcode=0x9e, sa=0x10 9e 10 00 00 00 00 00 00 00 00 00 00 00 20 00 0 0 scsi host0: uas_eh_device_reset_handler start usb 2-1.4.1: reset SuperSpeed USB device number 5 using xhci-hcd scsi host0: uas_eh_device_reset_handler success sd 0:0:0:0:31251759103 512-byte logical blocks: (16.0 TB/14.6 TiB) sd 0:0:0:0:Write Protect is off sd 0:0:0:0:Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0:Preferred minimum I/O size 512 bytes sd 0:0:0:0:Optimal transfer size 33553920 bytes sda: sda1 sd 0:0:0:0:Attached SCSI disk
On Sat, Jul 13, 2024 at 01:52:52PM -0400, Tim Lewis wrote:
The patch usb: xhci: prevent potential failure in handle_tx_event() for Transfer events without TRB https://patches.linaro.org/project/linux-usb/patch/20240429140245.3955523-11... causes The Linux kernel 6.1.98 https://lkml.org/lkml/2024/7/9/645 to crash when plugging in a USB Seagate drive. https://www.seagate.com/ca/en/products/gaming-drives/pc-gaming/firecuda-gami... This is a regression.
Ick, is this also a problem with the latest 6.6 and/or the latest 6.9 and/or Linus's tree?
thanks,
greg k-h
On Sun, Jul 14, 2024 at 2:30 AM Greg KH gregkh@linuxfoundation.org wrote:
On Sat, Jul 13, 2024 at 01:52:52PM -0400, Tim Lewis wrote:
usb: xhci: prevent potential failure in handle_tx_event() for Transfer events without TRB
Ick, is this also a problem with the latest 6.6 and/or the latest 6.9 and/or Linus's tree?
The problem did not occur on 6.9.9 I'll test 6.6.y next.
On Sun, Jul 14, 2024 at 8:51 AM Tim Lewis elatllat@gmail.com wrote:
On Sun, Jul 14, 2024 at 2:30 AM Greg KH gregkh@linuxfoundation.org wrote:
On Sat, Jul 13, 2024 at 01:52:52PM -0400, Tim Lewis wrote:
usb: xhci: prevent potential failure in handle_tx_event() for Transfer events without TRB
Ick, is this also a problem with the latest 6.6 and/or the latest 6.9 and/or Linus's tree?
The problem did not occur on 6.9.9 I'll test 6.6.y next.
The problem did occur on 6.6.y I'm going to re-test 6.9.y ...
This looks like bug 219039, please see if my suggested solution works.
The upstream commit is correct, because the call to inc_deq() has been moved outside handle_tx_event() so there is no longer this critical difference between doing 'goto cleanup' and 'return 0'. The intended change of this commit also makes sense to me.
This refactor is already present in v6.9 so I don't think the commit will have any effect besides fixing the isochronous bug which it is meant to fix.
But it is not present in v6.6 and v6.1, so they break/crash/hang/etc. Symptoms may vary, but I believe the root cause is the same because the code is visibly wrong.
I would like to use this opportunity to point out that the xhci driver is currenty undergoing (much needed IMO) cleanups and refactors and this is not the first time when a naive, verbatim backport is attempted of a patch which works fine on upstream, but causes problems on earlier kernels. These things need special scrutiny, beyond just "CC:stable".
Regards, Michal
Hi,
On Sun, Jul 14, 2024 at 05:32:39PM +0200, Michał Pecio wrote:
This looks like bug 219039, please see if my suggested solution works.
The upstream commit is correct, because the call to inc_deq() has been moved outside handle_tx_event() so there is no longer this critical difference between doing 'goto cleanup' and 'return 0'. The intended change of this commit also makes sense to me.
This refactor is already present in v6.9 so I don't think the commit will have any effect besides fixing the isochronous bug which it is meant to fix.
But it is not present in v6.6 and v6.1, so they break/crash/hang/etc. Symptoms may vary, but I believe the root cause is the same because the code is visibly wrong.
I would like to use this opportunity to point out that the xhci driver is currenty undergoing (much needed IMO) cleanups and refactors and this is not the first time when a naive, verbatim backport is attempted of a patch which works fine on upstream, but causes problems on earlier kernels. These things need special scrutiny, beyond just "CC:stable".
For tracking I guess this should go as well to the regressions list?
#regzbot introduced: 948554f1bb16e15b90006c109c3a558c66d4c4ac #regzbot title: freezes on plugging USB connector due to 948554f1bb16 ("usb: xhci: prevent potential failure in handle_tx_event() for Transfer events without TRB") #regzbot monitor: https://bugzilla.kernel.org/show_bug.cgi?id=219039
Thorsten I hope I got the most bits correctly, how would one inform regzbot about the regresssion for 6.1.98 and 6.6.39 but not happening in the upper versions?
Regards, Salvatore
On Sun, Jul 14, 2024 at 11:32 AM Michał Pecio michal.pecio@gmail.com wrote:
see if my suggested solution works.
On Sun, Jul 14, 2024 at 12:05 PM Salvatore Bonaccorso carnil@debian.org wrote:
Not on 6.1.y: drivers/usb/host/xhci-ring.c:2644:31: error: ~@~Xir~@~Y undeclared (first use in this fuunction); did you mean ~@~Xidr~@~Y? 2644 | inc_deq(xhci, ir->event_ring);
On Sun, Jul 14, 2024 at 2:30 AM Greg KH gregkh@linuxfoundation.org wrote:
Ick
We now have 3 USB HDD Vendors that are reported to crash the kernel because of this patch. I think it best we take time to sort out the long standing minor issue this patch is for, after we quickly revert the major issues it is causing. Can we get it removed from 6.6.y and older branches for the hopefully soon next release?
Not on 6.1.y:
drivers/usb/host/xhci-ring.c:2644:31: error: ~@~Xir~@~Y undeclared
(first use in this fuunction); did you mean ~@~Xidr~@~Y? 2644 | inc_deq(xhci, ir->event_ring);
Sorry, I didn't notice that v6.1 is different. You need this instead:
inc_deq(xhci, xhci->event_ring);
Another refactor interfering with backporting :(
On Sun, Jul 14, 2024 at 06:05:25PM +0200, Salvatore Bonaccorso wrote:
Hi,
On Sun, Jul 14, 2024 at 05:32:39PM +0200, Michał Pecio wrote:
This looks like bug 219039, please see if my suggested solution works.
The upstream commit is correct, because the call to inc_deq() has been moved outside handle_tx_event() so there is no longer this critical difference between doing 'goto cleanup' and 'return 0'. The intended change of this commit also makes sense to me.
This refactor is already present in v6.9 so I don't think the commit will have any effect besides fixing the isochronous bug which it is meant to fix.
But it is not present in v6.6 and v6.1, so they break/crash/hang/etc. Symptoms may vary, but I believe the root cause is the same because the code is visibly wrong.
I would like to use this opportunity to point out that the xhci driver is currenty undergoing (much needed IMO) cleanups and refactors and this is not the first time when a naive, verbatim backport is attempted of a patch which works fine on upstream, but causes problems on earlier kernels. These things need special scrutiny, beyond just "CC:stable".
For tracking I guess this should go as well to the regressions list?
#regzbot introduced: 948554f1bb16e15b90006c109c3a558c66d4c4ac #regzbot title: freezes on plugging USB connector due to 948554f1bb16 ("usb: xhci: prevent potential failure in handle_tx_event() for Transfer events without TRB") #regzbot monitor: https://bugzilla.kernel.org/show_bug.cgi?id=219039
Thorsten I hope I got the most bits correctly, how would one inform regzbot about the regresssion for 6.1.98 and 6.6.39 but not happening in the upper versions?
I'll handle this and go release new kernels with just this reverted in it. Let my morning coffee kick in first...
thanks,
greg k-h
On Mon, Jul 15, 2024 at 07:45:07AM +0200, Greg KH wrote:
On Sun, Jul 14, 2024 at 06:05:25PM +0200, Salvatore Bonaccorso wrote:
Hi,
On Sun, Jul 14, 2024 at 05:32:39PM +0200, Michał Pecio wrote:
This looks like bug 219039, please see if my suggested solution works.
The upstream commit is correct, because the call to inc_deq() has been moved outside handle_tx_event() so there is no longer this critical difference between doing 'goto cleanup' and 'return 0'. The intended change of this commit also makes sense to me.
This refactor is already present in v6.9 so I don't think the commit will have any effect besides fixing the isochronous bug which it is meant to fix.
But it is not present in v6.6 and v6.1, so they break/crash/hang/etc. Symptoms may vary, but I believe the root cause is the same because the code is visibly wrong.
I would like to use this opportunity to point out that the xhci driver is currenty undergoing (much needed IMO) cleanups and refactors and this is not the first time when a naive, verbatim backport is attempted of a patch which works fine on upstream, but causes problems on earlier kernels. These things need special scrutiny, beyond just "CC:stable".
For tracking I guess this should go as well to the regressions list?
#regzbot introduced: 948554f1bb16e15b90006c109c3a558c66d4c4ac #regzbot title: freezes on plugging USB connector due to 948554f1bb16 ("usb: xhci: prevent potential failure in handle_tx_event() for Transfer events without TRB") #regzbot monitor: https://bugzilla.kernel.org/show_bug.cgi?id=219039
Thorsten I hope I got the most bits correctly, how would one inform regzbot about the regresssion for 6.1.98 and 6.6.39 but not happening in the upper versions?
I'll handle this and go release new kernels with just this reverted in it. Let my morning coffee kick in first...
Should all now be fixed in the 6.6.40 and 6.1.99 releases. If not, please let me know.
thanks,
greg k-h
linux-stable-mirror@lists.linaro.org