Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000 CAN controller reception: the Rx buffer is only 5 messages long, so when the bus loaded (eg. a message every 50us), overrun may easily happen. Upon an overrun situation, due to a possible internal crosstalk situation, the controller enters a frozen state which only can be unlocked with a soft reset (experimentally). The solution was to offload a call to sja1000_start() in a threaded handler. This needs to happen in process context as this operation requires to sleep. sja1000_start() basically enters "reset mode", performs a proper software reset and returns back into "normal mode".
Since this fix was introduced, we no longer observe any stalls in reception. However it was sporadically observed that the transmit path would now freeze. Further investigation blamed the fix mentioned above, and especially the reset operation. Reproducing the reset in a loop helped identifying what could possibly go wrong. The sja1000 is a single Tx queue device, which leverages the netdev helpers to process one Tx message at a time. The logic is: the queue is stopped, the message sent to the transceiver, once properly transmitted the controller sets a status bit which triggers an interrupt, in the interrupt handler the transmission status is checked and the queue woken up. Unfortunately, if an overrun happens, we might perform the soft reset precisely between the transmission of the buffer to the transceiver and the advent of the transmission status bit. We would then stop the transmission operation without re-enabling the queue, leading to all further transmissions to be ignored.
The reset interrupt can only happen while the device is "open", and after a reset we anyway want to resume normal operations, no matter if a packet to transmit got dropped in the process, so we shall wake up the queue. Restarting the device and waking-up the queue is exactly what sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about the queue state, we must acquire a lock both in the reset handler and in the transmit path to ensure serialization of both operations. As the reset handler might still be called after the transmission of a frame to the transceiver but before it actually gets transmitted, we must ensure we don't leak the skb, so we free it (the behavior is consistent, no matter if there was an skb on the stack or not).
Fixes: 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with a soft reset on Renesas SoCs") Cc: stable@vger.kernel.org Signed-off-by: Miquel Raynal miquel.raynal@bootlin.com ---
This patch was written and tested on a slightly older stable kernel as this is the kernel that runs on the boards which shown the problem, but there should be no difference with upstream kernels.
drivers/net/can/sja1000/sja1000.c | 13 ++++++++++++- drivers/net/can/sja1000/sja1000.h | 1 + drivers/net/can/sja1000/sja1000_platform.c | 2 ++ 3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c index ae47fc72aa96..fe1d818f5d51 100644 --- a/drivers/net/can/sja1000/sja1000.c +++ b/drivers/net/can/sja1000/sja1000.c @@ -297,6 +297,8 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb, if (can_dropped_invalid_skb(dev, skb)) return NETDEV_TX_OK;
+ spin_lock(&priv->tx_lock); + netif_stop_queue(dev);
fi = dlc = cf->can_dlc; @@ -335,6 +337,8 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb,
sja1000_write_cmdreg(priv, cmd_reg_val);
+ spin_unlock(&priv->tx_lock); + return NETDEV_TX_OK; }
@@ -394,9 +398,16 @@ static void sja1000_rx(struct net_device *dev) static irqreturn_t sja1000_reset_interrupt(int irq, void *dev_id) { struct net_device *dev = (struct net_device *)dev_id; + struct sja1000_priv *priv = netdev_priv(dev);
netdev_dbg(dev, "performing a soft reset upon overrun\n"); - sja1000_start(dev); + + spin_lock(&priv->tx_lock); + + can_free_echo_skb(dev, 0); + sja1000_set_mode(dev, CAN_MODE_START); + + spin_unlock(&priv->tx_lock);
return IRQ_HANDLED; } diff --git a/drivers/net/can/sja1000/sja1000.h b/drivers/net/can/sja1000/sja1000.h index 9f041d027dcc..85def6329edb 100644 --- a/drivers/net/can/sja1000/sja1000.h +++ b/drivers/net/can/sja1000/sja1000.h @@ -166,6 +166,7 @@ struct sja1000_priv { void __iomem *reg_base; /* ioremap'ed address to registers */ unsigned long irq_flags; /* for request_irq() */ spinlock_t cmdreg_lock; /* lock for concurrent cmd register writes */ + spinlock_t tx_lock; /* lock for serializing transmissions and soft resets */
u16 flags; /* custom mode flags */ u8 ocr; /* output control register */ diff --git a/drivers/net/can/sja1000/sja1000_platform.c b/drivers/net/can/sja1000/sja1000_platform.c index f33bad164813..ca3bcab461d8 100644 --- a/drivers/net/can/sja1000/sja1000_platform.c +++ b/drivers/net/can/sja1000/sja1000_platform.c @@ -305,6 +305,8 @@ static int sp_probe(struct platform_device *pdev) priv->can.clock.freq = clk_get_rate(clk) / 2; }
+ spin_lock_init(&priv->tx_lock); + platform_set_drvdata(pdev, dev); SET_NETDEV_DEV(dev, &pdev->dev);
On 22.09.2023 17:47:27, Miquel Raynal wrote:
Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000 CAN controller reception: the Rx buffer is only 5 messages long, so when the bus loaded (eg. a message every 50us), overrun may easily happen. Upon an overrun situation, due to a possible internal crosstalk situation, the controller enters a frozen state which only can be unlocked with a soft reset (experimentally). The solution was to offload a call to sja1000_start() in a threaded handler. This needs to happen in process context as this operation requires to sleep. sja1000_start() basically enters "reset mode", performs a proper software reset and returns back into "normal mode".
Since this fix was introduced, we no longer observe any stalls in reception. However it was sporadically observed that the transmit path would now freeze. Further investigation blamed the fix mentioned above, and especially the reset operation. Reproducing the reset in a loop helped identifying what could possibly go wrong. The sja1000 is a single Tx queue device, which leverages the netdev helpers to process one Tx message at a time. The logic is: the queue is stopped, the message sent to the transceiver, once properly transmitted the controller sets a status bit which triggers an interrupt, in the interrupt handler the transmission status is checked and the queue woken up. Unfortunately, if an overrun happens, we might perform the soft reset precisely between the transmission of the buffer to the transceiver and the advent of the transmission status bit. We would then stop the transmission operation without re-enabling the queue, leading to all further transmissions to be ignored.
The reset interrupt can only happen while the device is "open", and after a reset we anyway want to resume normal operations, no matter if a packet to transmit got dropped in the process, so we shall wake up the queue. Restarting the device and waking-up the queue is exactly what sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about the queue state, we must acquire a lock both in the reset handler and in the transmit path to ensure serialization of both operations. As the reset handler might still be called after the transmission of a frame to the transceiver but before it actually gets transmitted, we must ensure we don't leak the skb, so we free it (the behavior is consistent, no matter if there was an skb on the stack or not).
Can you make use of netif_tx_disable() and netif_wake_queue() in sja1000_reset_interrupt() instead of the lock?
Marc
On 27.09.2023 11:30:16, Marc Kleine-Budde wrote:
On 22.09.2023 17:47:27, Miquel Raynal wrote:
Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000 CAN controller reception: the Rx buffer is only 5 messages long, so when the bus loaded (eg. a message every 50us), overrun may easily happen. Upon an overrun situation, due to a possible internal crosstalk situation, the controller enters a frozen state which only can be unlocked with a soft reset (experimentally). The solution was to offload a call to sja1000_start() in a threaded handler. This needs to happen in process context as this operation requires to sleep. sja1000_start() basically enters "reset mode", performs a proper software reset and returns back into "normal mode".
Since this fix was introduced, we no longer observe any stalls in reception. However it was sporadically observed that the transmit path would now freeze. Further investigation blamed the fix mentioned above, and especially the reset operation. Reproducing the reset in a loop helped identifying what could possibly go wrong. The sja1000 is a single Tx queue device, which leverages the netdev helpers to process one Tx message at a time. The logic is: the queue is stopped, the message sent to the transceiver, once properly transmitted the controller sets a status bit which triggers an interrupt, in the interrupt handler the transmission status is checked and the queue woken up. Unfortunately, if an overrun happens, we might perform the soft reset precisely between the transmission of the buffer to the transceiver and the advent of the transmission status bit. We would then stop the transmission operation without re-enabling the queue, leading to all further transmissions to be ignored.
The reset interrupt can only happen while the device is "open", and after a reset we anyway want to resume normal operations, no matter if a packet to transmit got dropped in the process, so we shall wake up the queue. Restarting the device and waking-up the queue is exactly what sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about the queue state, we must acquire a lock both in the reset handler and in the transmit path to ensure serialization of both operations. As the reset handler might still be called after the transmission of a frame to the transceiver but before it actually gets transmitted, we must ensure we don't leak the skb, so we free it (the behavior is consistent, no matter if there was an skb on the stack or not).
Can you make use of netif_tx_disable() and netif_wake_queue() in sja1000_reset_interrupt() instead of the lock?
...or netif_tx_lock()/netif_tx_unlock().
Marc
Hi Marc,
mkl@pengutronix.de wrote on Wed, 27 Sep 2023 11:33:32 +0200:
On 27.09.2023 11:30:16, Marc Kleine-Budde wrote:
On 22.09.2023 17:47:27, Miquel Raynal wrote:
Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000 CAN controller reception: the Rx buffer is only 5 messages long, so when the bus loaded (eg. a message every 50us), overrun may easily happen. Upon an overrun situation, due to a possible internal crosstalk situation, the controller enters a frozen state which only can be unlocked with a soft reset (experimentally). The solution was to offload a call to sja1000_start() in a threaded handler. This needs to happen in process context as this operation requires to sleep. sja1000_start() basically enters "reset mode", performs a proper software reset and returns back into "normal mode".
Since this fix was introduced, we no longer observe any stalls in reception. However it was sporadically observed that the transmit path would now freeze. Further investigation blamed the fix mentioned above, and especially the reset operation. Reproducing the reset in a loop helped identifying what could possibly go wrong. The sja1000 is a single Tx queue device, which leverages the netdev helpers to process one Tx message at a time. The logic is: the queue is stopped, the message sent to the transceiver, once properly transmitted the controller sets a status bit which triggers an interrupt, in the interrupt handler the transmission status is checked and the queue woken up. Unfortunately, if an overrun happens, we might perform the soft reset precisely between the transmission of the buffer to the transceiver and the advent of the transmission status bit. We would then stop the transmission operation without re-enabling the queue, leading to all further transmissions to be ignored.
The reset interrupt can only happen while the device is "open", and after a reset we anyway want to resume normal operations, no matter if a packet to transmit got dropped in the process, so we shall wake up the queue. Restarting the device and waking-up the queue is exactly what sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about the queue state, we must acquire a lock both in the reset handler and in the transmit path to ensure serialization of both operations. As the reset handler might still be called after the transmission of a frame to the transceiver but before it actually gets transmitted, we must ensure we don't leak the skb, so we free it (the behavior is consistent, no matter if there was an skb on the stack or not).
Can you make use of netif_tx_disable() and netif_wake_queue() in sja1000_reset_interrupt() instead of the lock?
...or netif_tx_lock()/netif_tx_unlock().
As that's also a spinlock behind I guess it would fit. A quick look does not seem to show any specific constraint in using it, so let's go for it.
Thanks, Miquèl
linux-stable-mirror@lists.linaro.org