-----Original Message----- From: Alexander Lobakin aleksander.lobakin@intel.com Sent: Monday, July 3, 2023 10:18 PM To: souradeep chakrabarti schakrabarti@linux.microsoft.com Cc: KY Srinivasan kys@microsoft.com; Haiyang Zhang haiyangz@microsoft.com; wei.liu@kernel.org; Dexuan Cui decui@microsoft.com; davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Long Li longli@microsoft.com; Ajay Sharma sharmaajay@microsoft.com; leon@kernel.org; cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org; stable@vger.kernel.org; Souradeep Chakrabarti schakrabarti@microsoft.com Subject: [EXTERNAL] Re: [PATCH V4 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Souradeep Chakrabarti schakrabarti@linux.microsoft.com Date: Mon, 3 Jul 2023 01:49:31 -0700
From: Souradeep Chakrabarti schakrabarti@linux.microsoft.com
Please sync your Git name and Git mail account settings, so that your own patches won't have "From:" when sending. From what I see, you need to correct first letters of name and surname to capital in the Git email settings block.
Thank you for pointing, I will fix it.
When unloading the MANA driver, mana_dealloc_queues() waits for the MANA hardware to complete any inflight packets and set the pending send count to zero. But if the hardware has failed, mana_dealloc_queues() could wait forever.
Fix this by adding a timeout to the wait. Set the timeout to 120 seconds, which is a somewhat arbitrary value that is more than long enough for functional hardware to complete any sends.
Signed-off-by: Souradeep Chakrabarti schakrabarti@linux.microsoft.com
Where's "Fixes:" tagging the blamed commit?
This is present from the day zero of the mana driver code. It has not been introduced in the code by any commit.
V3 -> V4:
- Fixed the commit message to describe the context.
- Removed the vf_unload_timeout, as it is not required.
drivers/net/ethernet/microsoft/mana/mana_en.c | 26 ++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c index a499e460594b..d26f1da70411 100644 --- a/drivers/net/ethernet/microsoft/mana/mana_en.c +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c @@ -2346,7 +2346,10 @@ static int mana_dealloc_queues(struct net_device *ndev) { struct mana_port_context *apc = netdev_priv(ndev); struct gdma_dev *gd = apc->ac->gdma_dev;
unsigned long timeout; struct mana_txq *txq;
struct sk_buff *skb;
struct mana_cq *cq; int i, err;
if (apc->port_is_up)
@@ -2363,15 +2366,32 @@ static int mana_dealloc_queues(struct
net_device *ndev)
* to false, but it doesn't matter since mana_start_xmit() drops any * new packets due to apc->port_is_up being false. *
* Drain all the in-flight TX packets
* Drain all the in-flight TX packets.
* A timeout of 120 seconds for all the queues is used.
* This will break the while loop when h/w is not responding.
* This value of 120 has been decided here considering max
*/* number of queues.
- timeout = jiffies + 120 * HZ;
Why not initialize it right when declaring?
I will fix it in next version.
for (i = 0; i < apc->num_queues; i++) { txq = &apc->tx_qp[i].txq;
while (atomic_read(&txq->pending_sends) > 0)
while (atomic_read(&txq->pending_sends) > 0 &&
}time_before(jiffies, timeout)) { usleep_range(1000, 2000);> + }
120 seconds by 2 msec step is 60000 iterations, by 1 msec is 120000 iterations. I know usleep_range() often is much less precise, but still. Do you really need that much time? Has this been measured during the tests that it can take up to 120 seconds or is it just some random value that "should be enough"? If you really need 120 seconds, I'd suggest using a timer / delayed work instead of wasting resources.
Here the intent is not waiting for 120 seconds, rather than avoid continue checking the pending_sends of each tx queues for an indefinite time, before freeing sk_buffs. The pending_sends can only get decreased for a tx queue, if mana_poll_tx_cq() gets called for a completion notification and increased by xmit.
In this particular bug, apc->port_is_up is not set to false, causing xmit to keep increasing the pending_sends for the queue and mana_poll_tx_cq() not getting called for the queue.
If we see the comment in the function mana_dealloc_queues(), it mentions it :
2346 /* No packet can be transmitted now since apc->port_is_up is false. 2347 * There is still a tiny chance that mana_poll_tx_cq() can re-enable 2348 * a txq because it may not timely see apc->port_is_up being cleared 2349 * to false, but it doesn't matter since mana_start_xmit() drops any 2350 * new packets due to apc->port_is_up being false.
The value 120 seconds has been decided here based on maximum number of queues are allowed in this specific hardware, it is a safe assumption.
- for (i = 0; i < apc->num_queues; i++) {
txq = &apc->tx_qp[i].txq;
cq = &apc->tx_qp[i].tx_cq;
cq can be just &txq->tx_cq.
mana_txq structure does not have a pointer to mana_cq.
while (atomic_read(&txq->pending_sends)) {
skb = skb_dequeue(&txq->pending_skbs);
mana_unmap_skb(skb, apc);
napi_consume_skb(skb, cq->budget);
(you already have comment about this one)
atomic_sub(1, &txq->pending_sends);
}
- } /* We're 100% sure the queues can no longer be woken up, because
*/
- we're sure now mana_poll_tx_cq() can't be running.
Thanks, Olek