On Mon, 26 Mar 2018, Jonathan Liu wrote:
I am experiencing a USB function call hang from userspace with OCHI (full speed USB device) after updating from Linux 4.14.15 to 4.14.24 and noticed this commit.
Here is the Linux 4.14.24 kernel stack trace (extracted from SysRq+w and amended with addr2line): [<c05cbd64>] (__schedule) from [<c05cc148>] (schedule+0x50/0xb4) kernel/sched/core.c:2792 [<c05cc148>] (schedule) from [<c042789c>] (usb_kill_urb.part.3+0x78/0xa8) include/asm-generic/preempt.h:59 [<c042789c>] (usb_kill_urb.part.3) from [<c0433a3c>] (usbdev_ioctl+0x1288/0x1cf0) drivers/usb/core/urb.c:690 [<c0433a3c>] (usbdev_ioctl) from [<c0217db8>] (do_vfs_ioctl+0x9c/0x8ec) drivers/usb/core/devio.c:1835 [<c0217db8>] (do_vfs_ioctl) from [<c021863c>] (SyS_ioctl+0x34/0x5c) fs/ioctl.c:47 [<c021863c>] (SyS_ioctl) from [<c0107820>] (ret_fast_syscall+0x0/0x54) include/linux/file.h:39
Afterwards the kernel is unresponsive to disconnect/connect of the full speed USB device but I can connect/disconnect a high speed USB device to the same port and communicate to it without problem since it uses EHCI (OHCI is companion controller). If I try to connect the full speed USB device again it is still unresponsive. The userspace application is still hanging after all this.
Could this commit be causing the issue?
It sounds like the URB gets cancelled but is never given back. But although the commit you mentioned does affect this logic, I don't see the problem could arise.
There does seem to be something wrong with this commit. I did some more testing with Linux 4.16.0-rc6 on sun7i-a20-olinuxino-lime. I connected a sun7i-a20-pcduino3b running Linux 4.15.12 that has OTG port with gadget zero kernel module (g_zero) loaded to the USB port and did some testing with usbtest from tools/usb/testusb.c in kernel repository.
I had to do the following change to avoid getting flooded with "usbtest 1-1:3.0: unlink retry" kernel messages after test 11 of testusb starts: --- a/drivers/usb/misc/usbtest.c +++ b/drivers/usb/misc/usbtest.c @@ -1473,7 +1473,7 @@ static int unlink1(struct usbtest_dev *dev, int pipe, int size, int async) * resubmission, but since we're testing unlink * paths, we can't. */
ERROR(dev, "unlink retry\n");
/*ERROR(dev, "unlink retry\n");*/
That's a bizarre message in any case. The condition under which it gets printed is _not_ an error, just a race. I would expect it to show up every now and then just based on random chance. But if it gets printed over and over again then something is definitely wrong.
If I revert the commit, I can run "./testusb -a" to completion without any issues with Linux 4.16.0-rc6 on sun7i-a20-olinuxino-lime. I don't get a kernel stall detected when testing on x86_64 but I do still get at least one "unlink retry" error message from kernel in dmesg which can be reproduced using "./testusb -a" or more frequently with "./testusb -a -t 11 -c 5000".
A few of those "unlink retry" messages during a test with 5000 iterations does not mean anything is wrong; don't worry about it.
It would be great if you could debug this farther. I'm not sure what to suggest, however. Clearly we need to know what's going on in ohci-q.c's finish_unlinks() routine; unfortunately this routine probably gets called thousands of times before the problem shows up.
Anyway, here's a patch you can try out on the sun7i-a20-olinuxino-lime. I have no idea whether it will help pinpoint the source of the problem.
Alan Stern
Index: usb-4.x/drivers/usb/host/ohci-hcd.c =================================================================== --- usb-4.x.orig/drivers/usb/host/ohci-hcd.c +++ usb-4.x/drivers/usb/host/ohci-hcd.c @@ -327,14 +327,21 @@ static int ohci_urb_dequeue(struct usb_h * some upcoming INTR_SF to call finish_unlinks() */ urb_priv = urb->hcpriv; - if (urb_priv->ed->state == ED_OPER) + if (urb_priv->ed->state == ED_OPER) { + ohci_info(ohci, "start dequeue URB %p ed %p\n", urb, + urb_priv->ed); start_ed_unlink(ohci, urb_priv->ed); - + } else + ohci_info(ohci, "idle dequeue URB %p ed %p\n", urb, + urb_priv->ed); if (ohci->rh_state != OHCI_RH_RUNNING) { /* With HC dead, we can clean up right away */ ohci_work(ohci); } } + else + ohci_info(ohci, "failed dequeue URB %p\n", urb); + spin_unlock_irqrestore (&ohci->lock, flags); return rc; } Index: usb-4.x/drivers/usb/host/ohci-q.c =================================================================== --- usb-4.x.orig/drivers/usb/host/ohci-q.c +++ usb-4.x/drivers/usb/host/ohci-q.c @@ -1022,6 +1022,7 @@ skip_ed: *last = ed->ed_next; ed->ed_next = NULL; modified = 0; + ohci_info(ohci, "scanning unlinked ed %p\n", ed);
/* unlink urbs as requested, but rescan the list after * we call a completion since it might have unlinked @@ -1072,6 +1073,8 @@ rescan_this: /* if URB is done, clean up */ if (urb_priv->td_cnt >= urb_priv->length) { modified = completed = 1; + ohci_info(ohci, "finish dequeue URB %p ed %p\n", + urb, ed); finish_urb(ohci, urb, 0); } }