The commit in the Fixes tag breaks my laptop (found by git bisect). My home RJ45 LAN cable cannot connect after that commit.
The call to netif_carrier_on() should be done when netif_carrier_ok() is false. Not when it's true. Because calling netif_carrier_on() when __LINK_STATE_NOCARRIER is not set actually does nothing.
Cc: Armando Budianto sprite@gnuweeb.org Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/netdev/0752dee6-43d6-4e1f-81d2-4248142cccd2@gnuweeb.... Fixes: 0d9cfc9b8cb1 ("net: usbnet: Avoid potential RCU stall on LINK_CHANGE event") Signed-off-by: Ammar Faizi ammarfaizi2@gnuweeb.org ---
v2: - Rebase on top of the latest netdev/net tree. The previous patch was based on 0d9cfc9b8cb1. Line numbers have changed since then.
drivers/net/usb/usbnet.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c index a38ffbf4b3f0..a1827684b92c 100644 --- a/drivers/net/usb/usbnet.c +++ b/drivers/net/usb/usbnet.c @@ -1114,31 +1114,31 @@ static const struct ethtool_ops usbnet_ethtool_ops = { };
/*-------------------------------------------------------------------------*/
static void __handle_link_change(struct usbnet *dev) { if (!test_bit(EVENT_DEV_OPEN, &dev->flags)) return;
if (!netif_carrier_ok(dev->net)) { + if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags)) + netif_carrier_on(dev->net); + /* kill URBs for reading packets to save bus bandwidth */ unlink_urbs(dev, &dev->rxq);
/* * tx_timeout will unlink URBs for sending packets and * tx queue is stopped by netcore after link becomes off */ } else { - if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags)) - netif_carrier_on(dev->net); - /* submitting URBs for reading packets */ queue_work(system_bh_wq, &dev->bh_work); }
/* hard_mtu or rx_urb_size may change during link change */ usbnet_update_max_qlen(dev);
clear_bit(EVENT_LINK_CHANGE, &dev->flags); }
+ John Ernberg
On Sat, Aug 02, 2025 at 02:03:10AM +0700, Ammar Faizi wrote:
The commit in the Fixes tag breaks my laptop (found by git bisect). My home RJ45 LAN cable cannot connect after that commit.
The call to netif_carrier_on() should be done when netif_carrier_ok() is false. Not when it's true. Because calling netif_carrier_on() when __LINK_STATE_NOCARRIER is not set actually does nothing.
Cc: Armando Budianto sprite@gnuweeb.org Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/netdev/0752dee6-43d6-4e1f-81d2-4248142cccd2@gnuweeb.... Fixes: 0d9cfc9b8cb1 ("net: usbnet: Avoid potential RCU stall on LINK_CHANGE event") Signed-off-by: Ammar Faizi ammarfaizi2@gnuweeb.org
v2:
- Rebase on top of the latest netdev/net tree. The previous patch was based on 0d9cfc9b8cb1. Line numbers have changed since then.
drivers/net/usb/usbnet.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c index a38ffbf4b3f0..a1827684b92c 100644 --- a/drivers/net/usb/usbnet.c +++ b/drivers/net/usb/usbnet.c @@ -1114,31 +1114,31 @@ static const struct ethtool_ops usbnet_ethtool_ops = { }; /*-------------------------------------------------------------------------*/ static void __handle_link_change(struct usbnet *dev) { if (!test_bit(EVENT_DEV_OPEN, &dev->flags)) return; if (!netif_carrier_ok(dev->net)) {
if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags))
netif_carrier_on(dev->net);
- /* kill URBs for reading packets to save bus bandwidth */ unlink_urbs(dev, &dev->rxq);
/* * tx_timeout will unlink URBs for sending packets and * tx queue is stopped by netcore after link becomes off */ } else {
if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags))
netif_carrier_on(dev->net);
- /* submitting URBs for reading packets */ queue_work(system_bh_wq, &dev->bh_work); }
/* hard_mtu or rx_urb_size may change during link change */ usbnet_update_max_qlen(dev); clear_bit(EVENT_LINK_CHANGE, &dev->flags); } -- Ammar Faizi
+ Linus
On Mon, Aug 04, 2025 at 11:00:50AM +0100, Simon Horman wrote:
- John Ernberg
On Sat, Aug 02, 2025 at 02:03:10AM +0700, Ammar Faizi wrote:
The commit in the Fixes tag breaks my laptop (found by git bisect). My home RJ45 LAN cable cannot connect after that commit.
The call to netif_carrier_on() should be done when netif_carrier_ok() is false. Not when it's true. Because calling netif_carrier_on() when __LINK_STATE_NOCARRIER is not set actually does nothing.
Cc: Armando Budianto sprite@gnuweeb.org Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/netdev/0752dee6-43d6-4e1f-81d2-4248142cccd2@gnuweeb.... Fixes: 0d9cfc9b8cb1 ("net: usbnet: Avoid potential RCU stall on LINK_CHANGE event") Signed-off-by: Ammar Faizi ammarfaizi2@gnuweeb.org
v2:
- Rebase on top of the latest netdev/net tree. The previous patch was based on 0d9cfc9b8cb1. Line numbers have changed since then.
drivers/net/usb/usbnet.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
It seems this has escalated a bit as it broke things for Linus while he was travelling. He tested this patch and it resolved the problem. Which I think counts for something.
https://lore.kernel.org/netdev/CAHk-=wgkvNuGCDUMMs9bW9Mz5o=LcMhcDK_b2ThO6_T7...
I have looked over the patch and it appears to me that it addresses a straightforward logic error: a check was added to turn the carrier on only if it is already on. Which seems a bit nonsensical. And presumably the intention was to add the check for the opposite case.
This patch addresses that problem.
So let me try and nudge this on a bit by providing a tag.
Reviewed-by: Simon Horman horms@kernel.org
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c index a38ffbf4b3f0..a1827684b92c 100644 --- a/drivers/net/usb/usbnet.c +++ b/drivers/net/usb/usbnet.c @@ -1114,31 +1114,31 @@ static const struct ethtool_ops usbnet_ethtool_ops = { }; /*-------------------------------------------------------------------------*/ static void __handle_link_change(struct usbnet *dev) { if (!test_bit(EVENT_DEV_OPEN, &dev->flags)) return; if (!netif_carrier_ok(dev->net)) {
if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags))
netif_carrier_on(dev->net);
- /* kill URBs for reading packets to save bus bandwidth */ unlink_urbs(dev, &dev->rxq);
/* * tx_timeout will unlink URBs for sending packets and * tx queue is stopped by netcore after link becomes off */ } else {
if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags))
netif_carrier_on(dev->net);
- /* submitting URBs for reading packets */ queue_work(system_bh_wq, &dev->bh_work); }
/* hard_mtu or rx_urb_size may change during link change */ usbnet_update_max_qlen(dev); clear_bit(EVENT_LINK_CHANGE, &dev->flags); } -- Ammar Faizi
On Tue, Aug 05, 2025 at 09:28:48PM +0100, Simon Horman wrote:
It seems this has escalated a bit as it broke things for Linus while he was travelling. He tested this patch and it resolved the problem. Which I think counts for something.
https://lore.kernel.org/netdev/CAHk-=wgkvNuGCDUMMs9bW9Mz5o=LcMhcDK_b2ThO6_T7...
I have looked over the patch and it appears to me that it addresses a straightforward logic error: a check was added to turn the carrier on only if it is already on. Which seems a bit nonsensical. And presumably the intention was to add the check for the opposite case.
This patch addresses that problem.
So let me try and nudge this on a bit by providing a tag.
Reviewed-by: Simon Horman horms@kernel.org
Hi Linus,
Given that Reviewed-by tag and the simplicity of the patch, it would be great if you can take this patch sooner to your tree. The fix is very critical for network connectivity. Especially for laptop users.
https://lore.kernel.org/all/20250801190310.58443-1-ammarfaizi2@gnuweeb.org/
On Tue, 5 Aug 2025 at 23:28, Simon Horman horms@kernel.org wrote:
I have looked over the patch and it appears to me that it addresses a straightforward logic error: a check was added to turn the carrier on only if it is already on. Which seems a bit nonsensical. And presumably the intention was to add the check for the opposite case.
This patch addresses that problem.
So I agree that there was a logic error.
I'm not 100% sure about the "straightforward" part.
In particular, the whole *rest* of the code in that
if (!netif_carrier_ok(dev->net)) {
no longer makes sense after we've turned the link on with that
if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags)) netif_carrier_on(dev->net);
sequence.
Put another way - once we've turned the carrier on, now that whole
/* kill URBs for reading packets to save bus bandwidth */ unlink_urbs(dev, &dev->rxq);
/* * tx_timeout will unlink URBs for sending packets and * tx queue is stopped by netcore after link becomes off */
thing makes no sense.
So my gut feel is that the
if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags)) netif_carrier_on(dev->net);
should actually be done outside that if-statement entirely, because it literally ends up changing the thing that if-statement is testing.
And no, I didn't actually test that version, because I was hoping that somebody who actually knows this code better would pipe up.
Linus
On Wed, 6 Aug 2025 01:40:37 +0300 Linus Torvalds wrote:
So my gut feel is that the
if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags)) netif_carrier_on(dev->net);
should actually be done outside that if-statement entirely, because it literally ends up changing the thing that if-statement is testing.
Right. I think it should be before the if (!netif_carrier_ok(dev->net))
Ammar, could you retest and repost that, since we haven't heard from John?
On Tue, Aug 05, 2025 at 04:47:47PM -0700, Jakub Kicinski wrote:
On Wed, 6 Aug 2025 01:40:37 +0300 Linus Torvalds wrote:
So my gut feel is that the
if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags)) netif_carrier_on(dev->net);
should actually be done outside that if-statement entirely, because it literally ends up changing the thing that if-statement is testing.
Right. I think it should be before the if (!netif_carrier_ok(dev->net))
Ammar, could you retest and repost that, since we haven't heard from John?
OK, I'll send a v3 shortly.
Hi Jakub, Linus, Ammar,
(sorry for the delay, on vacation, wasn't paying attention to the internet)
On Tue, Aug 05, 2025 at 04:47:47PM -0700, Jakub Kicinski wrote:
On Wed, 6 Aug 2025 01:40:37 +0300 Linus Torvalds wrote:
So my gut feel is that the
if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags)) netif_carrier_on(dev->net);
should actually be done outside that if-statement entirely, because it literally ends up changing the thing that if-statement is testing.
Right. I think it should be before the if (!netif_carrier_ok(dev->net))
Ammar, could you retest and repost that, since we haven't heard from John?
I can't verify the suggested change until sometime in September, after I return to office, but it feels correct.
However... I'm almost inclined to suggest a full revert of my patch as the testing was clearly royally botched. Booting it on the boards I have would have shown the failure immediately.
(I did see v3 of this patch being applied)
Apologies for the mess // John Ernberg
On Wed, Aug 06, 2025 at 01:40:37AM +0300, Linus Torvalds wrote:
In particular, the whole *rest* of the code in that
if (!netif_carrier_ok(dev->net)) {
no longer makes sense after we've turned the link on with that
if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags)) netif_carrier_on(dev->net);
sequence.
Put another way - once we've turned the carrier on, now that whole
/* kill URBs for reading packets to save bus bandwidth */ unlink_urbs(dev, &dev->rxq); /* * tx_timeout will unlink URBs for sending packets and * tx queue is stopped by netcore after link becomes off */
thing makes no sense.
After taking a look further, I agree with you. I git-blamed the unlink_urbs()'s line and it's indeed expected to be called after link becomes off. So yes, it makes no sense to call that when we're turning the link on.
commit 4b49f58fff00e6e9b24eaa31d4c6324393d76b0a Author: Ming Lei ming.lei@canonical.com Date: Thu Apr 11 04:40:40 2013 +0000
usbnet: handle link change
The link change is detected via the interrupt pipe, and bulk pipes are responsible for transfering packets, so it is reasonable to stop bulk transfer after link is reported as off.
Even though my patch works on my machine. Something may go wrong.
So my gut feel is that the
if (test_and_clear_bit(EVENT_LINK_CARRIER_ON, &dev->flags)) netif_carrier_on(dev->net);
should actually be done outside that if-statement entirely, because it literally ends up changing the thing that if-statement is testing.
Apart from moving it outside that if-statement, unlink_urbs() call should probably also be guarded as we agreed it makes no sense to call it when we're turning the link on.
On Wed, Aug 06, 2025 at 06:56:20AM +0700, Ammar Faizi wrote:
Apart from moving it outside that if-statement, unlink_urbs() call should probably also be guarded as we agreed it makes no sense to call it when we're turning the link on.
Oh, no.
I just realized, it does need to be guarded because if netif_carrier_on() is placed before the if (!netif_carrier_ok(dev->net)), it already clears __LINK_STATE_NOCARRIER.
On Wed, 6 Aug 2025 at 01:40, Linus Torvalds torvalds@linux-foundation.org wrote:
And no, I didn't actually test that version, because I was hoping that somebody who actually knows this code better would pipe up.
Bah. Since I'm obviously horribly jetlagged, I decided to just test to make sure I understand the code.
And yeah, the attached patch also fixes the problem for me and makes more sense to me.
But again, it would be good to get comments from people who *actually* know the code.
Linus
On Wed, 6 Aug 2025 at 04:11, Linus Torvalds torvalds@linux-foundation.org wrote:
And yeah, the attached patch also fixes the problem for me and makes more sense to me.
Ok, crossed emails because I was reading things in odd orders and going back to bed trying to get over jetlag.
Anyway, I've applied Ammar's v3 that ended up the same patch that I also tested,
Linus
On Wed, Aug 06, 2025 at 04:54:36AM +0300, Linus Torvalds wrote:
Anyway, I've applied Ammar's v3 that ended up the same patch that I also tested,
Yesterday, I synced with your tree, but couldn't boot. Crashed with this call trace:
https://gist.githubusercontent.com/ammarfaizi2/3ba41f13517be4bae70cde869347d...
This morning, I synced with your tree again, still the same result.
I'll try to bisect it and report to approriate subsystem once I get the first bad commit. I suspect it's related to pci or nvme (based on that call trace).
linux-stable-mirror@lists.linaro.org