This reverts commit ed194d1367698a0872a2b75bbe06b3932ce9df3a.
In contrast to the original patch description, apparently not all handlers were audited properly. E.g. my RT5370 based USB WIFI adapter (driver in drivers/net/wireless/ralink/rt2x00) hangs after a while under heavy load. This revert fixes this.
Also revert the follow-up patch d6142b91e9cc249b3aa22c90fade67e2e2d52cdb ("usb: core: remove flags variable in __usb_hcd_giveback_urb()"), since now we need the flags variable again.
Cc: Sebastian Andrzej Siewior bigeasy@linutronix.de Cc: Alan Stern stern@rowland.harvard.edu Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: linux-usb@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org # 4.20+ Signed-off-by: Soeren Moch smoch@web.de --- drivers/usb/core/hcd.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c index 94d22551fc1b..08d25fcf8b8e 100644 --- a/drivers/usb/core/hcd.c +++ b/drivers/usb/core/hcd.c @@ -1739,6 +1739,7 @@ static void __usb_hcd_giveback_urb(struct urb *urb) struct usb_hcd *hcd = bus_to_hcd(urb->dev->bus); struct usb_anchor *anchor = urb->anchor; int status = urb->unlinked; + unsigned long flags;
urb->hcpriv = NULL; if (unlikely((urb->transfer_flags & URB_SHORT_NOT_OK) && @@ -1755,7 +1756,20 @@ static void __usb_hcd_giveback_urb(struct urb *urb)
/* pass ownership to the completion handler */ urb->status = status; + + /* + * We disable local IRQs here avoid possible deadlock because + * drivers may call spin_lock() to hold lock which might be + * acquired in one hard interrupt handler. + * + * The local_irq_save()/local_irq_restore() around complete() + * will be removed if current USB drivers have been cleaned up + * and no one may trigger the above deadlock situation when + * running complete() in tasklet. + */ + local_irq_save(flags); urb->complete(urb); + local_irq_restore(flags);
usb_anchor_resume_wakeups(anchor); atomic_dec(&urb->use_count); -- 2.17.1
On Fri, May 31, 2019 at 11:53:40PM +0200, Soeren Moch wrote:
This reverts commit ed194d1367698a0872a2b75bbe06b3932ce9df3a.
In contrast to the original patch description, apparently not all handlers were audited properly. E.g. my RT5370 based USB WIFI adapter (driver in drivers/net/wireless/ralink/rt2x00) hangs after a while under heavy load. This revert fixes this.
Why not just fix that driver? Wouldn't that be easier?
thanks,
greg k-h
On 01.06.19 00:05, Greg Kroah-Hartman wrote:
On Fri, May 31, 2019 at 11:53:40PM +0200, Soeren Moch wrote:
This reverts commit ed194d1367698a0872a2b75bbe06b3932ce9df3a.
In contrast to the original patch description, apparently not all handlers were audited properly. E.g. my RT5370 based USB WIFI adapter (driver in drivers/net/wireless/ralink/rt2x00) hangs after a while under heavy load. This revert fixes this.
Why not just fix that driver? Wouldn't that be easier?
I suspect there are more drivers to fix. I only tested WIFI sticks so far, RTL8188 drivers also seem to suffer from this. I'm not sure how to fix all this properly, maybe Sebastian as original patch author can help here. This patch is mostly for -stable, to get an acceptable solution quickly. It was really annoying to get such unstable WIFI connection over the last three kernel releases to my development board. Since my internet service provider forcefully updated my router box 3 weeks ago, I unfortunately see the same symptoms on my primary internet access. That's even worse, I need to reset this router box every few days. I'm not sure, however, that this is caused by the same problem, but it feels like this. So can we please fix this regression quickly and workout a proper fix later? In the original patch there is no reason given, why this patch is necessary. With this revert I at least see a stable connection.
Thanks, Soeren
On 2019-06-01 01:02:37 [+0200], Soeren Moch wrote:
Why not just fix that driver? Wouldn't that be easier?
I suspect there are more drivers to fix. I only tested WIFI sticks so far, RTL8188 drivers also seem to suffer from this. I'm not sure how to fix all this properly, maybe Sebastian as original patch author can help here.
Suspecting isn't helping here.
This patch is mostly for -stable, to get an acceptable solution quickly. It was really annoying to get such unstable WIFI connection over the last three kernel releases to my development board. Since my internet service provider forcefully updated my router box 3 weeks ago, I unfortunately see the same symptoms on my primary internet access. That's even worse, I need to reset this router box every few days. I'm not sure, however, that this is caused by the same problem, but it feels like this. So can we please fix this regression quickly and workout a proper fix later? In the original patch there is no reason given, why this patch is necessary. With this revert I at least see a stable connection.
I will look into this. This patch got in in v4.20-rc1 and the final kernel was released by the end of 2018. This is the first report I am aware of over half year later…
Thanks, Soeren
Sebastian
On 2019-06-01 12:50:08 [+0200], To Soeren Moch wrote:
I will look into this.
nothing obvious. If there is really blocken lock, could you please enable lockdep |CONFIG_LOCK_DEBUGGING_SUPPORT=y |CONFIG_PROVE_LOCKING=y |# CONFIG_LOCK_STAT is not set |CONFIG_DEBUG_RT_MUTEXES=y |CONFIG_DEBUG_SPINLOCK=y |CONFIG_DEBUG_MUTEXES=y |CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y |CONFIG_DEBUG_RWSEMS=y |CONFIG_DEBUG_LOCK_ALLOC=y |CONFIG_LOCKDEP=y |# CONFIG_DEBUG_LOCKDEP is not set |CONFIG_DEBUG_ATOMIC_SLEEP=y
and send me the splat that lockdep will report?
Thanks, Soeren
Sebastian
On 01.06.19 13:02, Sebastian Andrzej Siewior wrote:
On 2019-06-01 12:50:08 [+0200], To Soeren Moch wrote:
I will look into this.
nothing obvious. If there is really blocken lock, could you please enable lockdep |CONFIG_LOCK_DEBUGGING_SUPPORT=y |CONFIG_PROVE_LOCKING=y |# CONFIG_LOCK_STAT is not set |CONFIG_DEBUG_RT_MUTEXES=y |CONFIG_DEBUG_SPINLOCK=y |CONFIG_DEBUG_MUTEXES=y |CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y |CONFIG_DEBUG_RWSEMS=y |CONFIG_DEBUG_LOCK_ALLOC=y |CONFIG_LOCKDEP=y |# CONFIG_DEBUG_LOCKDEP is not set |CONFIG_DEBUG_ATOMIC_SLEEP=y
and send me the splat that lockdep will report?
I will do so. I cannot promise, however, that this will happen within the next few days.
Thanks for your suggestions, Soeren
On 01.06.19 13:02, Sebastian Andrzej Siewior wrote:
On 2019-06-01 12:50:08 [+0200], To Soeren Moch wrote:
I will look into this.
nothing obvious. If there is really blocken lock, could you please enable lockdep |CONFIG_LOCK_DEBUGGING_SUPPORT=y |CONFIG_PROVE_LOCKING=y |# CONFIG_LOCK_STAT is not set |CONFIG_DEBUG_RT_MUTEXES=y |CONFIG_DEBUG_SPINLOCK=y |CONFIG_DEBUG_MUTEXES=y |CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y |CONFIG_DEBUG_RWSEMS=y |CONFIG_DEBUG_LOCK_ALLOC=y |CONFIG_LOCKDEP=y |# CONFIG_DEBUG_LOCKDEP is not set |CONFIG_DEBUG_ATOMIC_SLEEP=y
and send me the splat that lockdep will report?
Nothing interesting:
[ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Linux version 5.1.0 (root@matrix) (gcc version 7.4.0 (Debian 7.4.0-6)) #6 SMP PREEMPT Wed Jun 12 11:28:41 CEST 2019 [ 0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7), cr=10c5387d [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache [ 0.000000] OF: fdt: Machine model: TBS2910 Matrix ARM mini PC ... [ 0.000000] rcu: Preemptible hierarchical RCU implementation. [ 0.000000] rcu: RCU lockdep checking is enabled. ... [ 0.003546] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar [ 0.003657] ... MAX_LOCKDEP_SUBCLASSES: 8 [ 0.003713] ... MAX_LOCK_DEPTH: 48 [ 0.003767] ... MAX_LOCKDEP_KEYS: 8191 [ 0.003821] ... CLASSHASH_SIZE: 4096 [ 0.003876] ... MAX_LOCKDEP_ENTRIES: 32768 [ 0.003931] ... MAX_LOCKDEP_CHAINS: 65536 [ 0.003986] ... CHAINHASH_SIZE: 32768 [ 0.004042] memory used by lock dependency info: 5243 kB
Nothing else.
When stopping hostapd after it hangs: [ 903.504475] ieee80211 phy0: rt2x00queue_flush_queue: Warning - Queue 14 failed to flush
Soeren
On Wed, 12 Jun 2019, Soeren Moch wrote:
On 01.06.19 13:02, Sebastian Andrzej Siewior wrote:
On 2019-06-01 12:50:08 [+0200], To Soeren Moch wrote:
I will look into this.
nothing obvious. If there is really blocken lock, could you please enable lockdep |CONFIG_LOCK_DEBUGGING_SUPPORT=y |CONFIG_PROVE_LOCKING=y |# CONFIG_LOCK_STAT is not set |CONFIG_DEBUG_RT_MUTEXES=y |CONFIG_DEBUG_SPINLOCK=y |CONFIG_DEBUG_MUTEXES=y |CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y |CONFIG_DEBUG_RWSEMS=y |CONFIG_DEBUG_LOCK_ALLOC=y |CONFIG_LOCKDEP=y |# CONFIG_DEBUG_LOCKDEP is not set |CONFIG_DEBUG_ATOMIC_SLEEP=y
and send me the splat that lockdep will report?
Nothing interesting:
[ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Linux version 5.1.0 (root@matrix) (gcc version 7.4.0 (Debian 7.4.0-6)) #6 SMP PREEMPT Wed Jun 12 11:28:41 CEST 2019 [ 0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7), cr=10c5387d [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache [ 0.000000] OF: fdt: Machine model: TBS2910 Matrix ARM mini PC ... [ 0.000000] rcu: Preemptible hierarchical RCU implementation. [ 0.000000] rcu: RCU lockdep checking is enabled. ... [ 0.003546] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar [ 0.003657] ... MAX_LOCKDEP_SUBCLASSES: 8 [ 0.003713] ... MAX_LOCK_DEPTH: 48 [ 0.003767] ... MAX_LOCKDEP_KEYS: 8191 [ 0.003821] ... CLASSHASH_SIZE: 4096 [ 0.003876] ... MAX_LOCKDEP_ENTRIES: 32768 [ 0.003931] ... MAX_LOCKDEP_CHAINS: 65536 [ 0.003986] ... CHAINHASH_SIZE: 32768 [ 0.004042] memory used by lock dependency info: 5243 kB
Nothing else.
When stopping hostapd after it hangs: [ 903.504475] ieee80211 phy0: rt2x00queue_flush_queue: Warning - Queue 14 failed to flush
Instead of reverting the original commit, can you prevent the problem by adding local_irq_save() and local_irq_restore() to the URB completion routines in that wireless driver?
Probably people who aren't already pretty familiar with the driver code won't easily be able to locate the race. Still, a little overkill may be an acceptable solution.
Alan Stern
On 2019-06-12 10:38:11 [-0400], Alan Stern wrote:
When stopping hostapd after it hangs: [ 903.504475] ieee80211 phy0: rt2x00queue_flush_queue: Warning - Queue 14 failed to flush
Instead of reverting the original commit, can you prevent the problem by adding local_irq_save() and local_irq_restore() to the URB completion routines in that wireless driver?
Probably people who aren't already pretty familiar with the driver code won't easily be able to locate the race. Still, a little overkill may be an acceptable solution.
Not from RT point of view because you do local_irq_save() -> spin_lock()
but it might be worth checking if the local_irq_save() within that wireless driver avoids the race or not.
Alan Stern
Sebastian
On 12.06.19 16:47, Sebastian Andrzej Siewior wrote:
On 2019-06-12 10:38:11 [-0400], Alan Stern wrote:
When stopping hostapd after it hangs: [ 903.504475] ieee80211 phy0: rt2x00queue_flush_queue: Warning - Queue 14 failed to flush
Instead of reverting the original commit, can you prevent the problem by adding local_irq_save() and local_irq_restore() to the URB completion routines in that wireless driver?
Probably people who aren't already pretty familiar with the driver code won't easily be able to locate the race. Still, a little overkill may be an acceptable solution.
Not from RT point of view because you do local_irq_save() -> spin_lock()
but it might be worth checking if the local_irq_save() within that wireless driver avoids the race or not.
I just sent a patch for the underlying problem in the rt2x00 driver, see [1]. So we can drop this patch here and keep RT folks happy. I really hope that no other usb drivers are affected by similar problems.
Zyxel support just sent me (some?) source code for the firmware of my internet router (my real problem I mentioned before). Need to look into this...
Thanks for your help, Soeren
[1] https://lkml.org/lkml/2019/6/17/238
Alan Stern
Sebastian
On 01.06.19 12:50, Sebastian Andrzej Siewior wrote:
On 2019-06-01 01:02:37 [+0200], Soeren Moch wrote:
Why not just fix that driver? Wouldn't that be easier?
I suspect there are more drivers to fix. I only tested WIFI sticks so far, RTL8188 drivers also seem to suffer from this. I'm not sure how to fix all this properly, maybe Sebastian as original patch author can help here.
Suspecting isn't helping here.
Yes, you are right. When I encountered this problem half a year ago, I tried some other type of wifi stick and immediately ran into trouble with this, too.
Now I did a short test with a RTL8188CUS stick, I could not reproduce this bug with this stick so far.
This patch is mostly for -stable, to get an acceptable solution quickly. It was really annoying to get such unstable WIFI connection over the last three kernel releases to my development board. Since my internet service provider forcefully updated my router box 3 weeks ago, I unfortunately see the same symptoms on my primary internet access. That's even worse, I need to reset this router box every few days. I'm not sure, however, that this is caused by the same problem, but it feels like this. So can we please fix this regression quickly and workout a proper fix later? In the original patch there is no reason given, why this patch is necessary. With this revert I at least see a stable connection.
I will look into this. This patch got in in v4.20-rc1 and the final kernel was released by the end of 2018. This is the first report I am aware of over half year later…
It is not easy to reproduce this bug reliably within a reasonable period of time. It took days to bisect this. And usb core code was for sure not my first candidate to look at.
Reverting this patch solves the problem, but of course disabling interrupts also can hide a bug elsewhere.
Soeren
linux-stable-mirror@lists.linaro.org