On Sat, May 18, 2024 at 11:46 AM David Wei dw@davidwei.uk wrote:
On 2024-05-10 16:21, Mina Almasry wrote:
+void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) +{
struct netdev_rx_queue *rxq;
unsigned long xa_idx;
unsigned int rxq_idx;
if (!binding)
return;
if (binding->list.next)
list_del(&binding->list);
xa_for_each(&binding->bound_rxq_list, xa_idx, rxq) {
if (rxq->mp_params.mp_priv == binding) {
/* We hold the rtnl_lock while binding/unbinding
* dma-buf, so we can't race with another thread that
* is also modifying this value. However, the page_pool
* may read this config while it's creating its
* rx-queues. WRITE_ONCE() here to match the
* READ_ONCE() in the page_pool.
*/
WRITE_ONCE(rxq->mp_params.mp_ops, NULL);
WRITE_ONCE(rxq->mp_params.mp_priv, NULL);
rxq_idx = get_netdev_rx_queue_index(rxq);
netdev_rx_queue_restart(binding->dev, rxq_idx);
What if netdev_rx_queue_restart() fails? Depending on where it failed, a queue might still be filled from struct net_devmem_dmabuf_binding. This is one downside of the current situation with netdev_rx_queue_restart() needing to do allocations each time.
Perhaps a full reset if individual queue restart fails?
Sorry for the late reply, I've been out on vacation for a few days and caught up to some other work.
Yes, netdev_rx_queue_restart() can fail, but I'm not sure how to recover. Full reset would be an option, but it may be way too big of a hammer to do a full reset on this failure. Also, last I discussed with Jakub, AFAIU, there is no way for core to reset the driver? I had suggested to Jakub to use ndo_stop/ndo_open to reset the driver on queue binding/unbinding, but he rejected that as it could cause the driver to fail to come back up, which would leave the machine stranded from the network. This is why we implemented the queue API, as a way to do the binding/unbinding without risking the machine stranding via a full reset. This is the previous convo from months back[1].
So, all in all, I don't see anything amazing we can do here to recover. How about just log? I will add a warning in the next iteration.
(I applied most of the rest of your suggestions btw).
[1] https://patchwork.kernel.org/project/netdevbpf/patch/20231106024413.2801438-...