Re: [PATCH net] net: Lock lower level devices when updating features

8 May 2025


      On Wed, 2025-05-07 at 14:20 -0700, Stanislav Fomichev wrote:
...
On 05/07, Cosmin Ratiu wrote:
...
On Wed, 2025-05-07 at 15:13 +0000, Cosmin Ratiu wrote:
...
...
In any case, please hold off with picking this patch up, it
seems
there's a possibility of a real deadlock. Here's the scenario:
Hmm, are you sure you're calling __netdev_update_features on the
upper?
I don't see how the lower would be locked in that case. From my POW,
this is what happens:

your dev (lower) has a vlan on it (upper)
you call lro=off on the _lower_
this triggers FEAT_CHANGE notifier and vlan_device_event catches

it
4. since the lower has a vlan device (dev->vlan_info != NULL), it
goes
   over every other vlan in the group and triggers
netdev_update_features
   for the upper (netdev_update_features vlandev)
5. the upper tries to sync the features into the lower (including the
   one that triggered FEAT_CHANGE) and that's where the deadlock
happens
But I think (5) should be largely a no-op for the device triggering
the
notification, because the features have been already applied in
ethnl_set_features.
I'd move the lock into netdev_sync_lower_features, and only after
checking
the features (and making sure that we are going to change them). The
feature
check might be racy, but I think it should still work?
You are right, if I restrict the lower dev critical section to only the
call to __netdev_update_features for the lower dev there's no deadlock
any more, because the device with the lock held already had its
features updated.
I will send a new version of this patch soon after the full regression
suite finishes and I convince myself there are no more issues related
to this that we can encounter.
...
Can you also share the bonding stacktrace as well to confirm it's the
same issue?
Sure, here it is, it's the same scenario. bond_netdev_event gets called
on a slave dev, it recomputes features and updates all slaves
(bond_compute_features), and then the same lock is reacquired.
But this is also fixed with your suggestion above.
============================================
 WARNING: possible recursive locking detected
devlink/14341 is trying to acquire lock:
 ffff88810ebd8c80 (&dev_instance_lock_key#9){+.+.}-{4:4}, at:
__netdev_update_features+0x31e/0xe20
but task is already holding lock:
 ffff88810ebd8c80 (&dev_instance_lock_key#9){+.+.}-{4:4}, at:
mlx5e_attach_netdev+0x31f/0x360 [mlx5_core]
 and the lock comparison function returns 0:
other info that might help us debug this: 
  Possible unsafe locking scenario:
CPU0
        ----
   lock(&dev_instance_lock_key#9);
   lock(&dev_instance_lock_key#9);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by devlink/14341:
  #0: ffffffff830e5a50 (cb_lock){++++}-{4:4}, at: genl_rcv+0x15/0x40
  #1: ffff888164a5c250 (&devlink->lock_key){+.+.}-{4:4}, at:
devlink_get_from_attrs_lock+0xbc/0x180
  #2: ffffffff830cf708 (rtnl_mutex){+.+.}-{4:4}, at:
mlx5e_attach_netdev+0x30d/0x360 [mlx5_core]
  #3: ffff88810ebd8c80 (&dev_instance_lock_key#9){+.+.}-{4:4}, at:
mlx5e_attach_netdev+0x31f/0x360 [mlx5_core]
Call Trace:
  <TASK>
  dump_stack_lvl+0x69/0xa0
  print_deadlock_bug.cold+0xbd/0xca
  __lock_acquire+0x163c/0x2f00
  lock_acquire+0xd3/0x2e0
  __mutex_lock+0x98/0xf10
  __netdev_update_features+0x31e/0xe20
  netdev_change_features+0x1f/0x60
  bond_compute_features+0x24e/0x300 [bonding]
  bond_netdev_event+0x2e0/0x400 [bonding]
  notifier_call_chain+0x3d/0x100
  netdev_update_features+0x52/0x60
  mlx5e_attach_netdev+0x32f/0x360 [mlx5_core]
  mlx5e_netdev_attach_profile+0x48/0x90 [mlx5_core]
  mlx5e_netdev_change_profile+0x90/0xf0 [mlx5_core]
  mlx5e_vport_rep_load+0x414/0x490 [mlx5_core]
  __esw_offloads_load_rep+0x87/0xd0 [mlx5_core]
  mlx5_esw_offloads_rep_load+0x45/0xe0 [mlx5_core]
  esw_offloads_enable+0xb7b/0xca0 [mlx5_core] 
  mlx5_eswitch_enable_locked+0x293/0x430 [mlx5_core]
  mlx5_devlink_eswitch_mode_set+0x229/0x620 [mlx5_core]
  devlink_nl_eswitch_set_doit+0x60/0xd0
  genl_family_rcv_msg_doit+0xe0/0x130
  genl_rcv_msg+0x188/0x290
  netlink_rcv_skb+0x4b/0xf0
  genl_rcv+0x24/0x40
  netlink_unicast+0x1e1/0x2c0
  netlink_sendmsg+0x210/0x450
Cosmin.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH net] net: Lock lower level devices when updating features