On Wed, 2025-05-07 at 15:13 +0000, Cosmin Ratiu wrote:
In any case, please hold off with picking this patch up, it seems there's a possibility of a real deadlock. Here's the scenario:
============================================ WARNING: possible recursive locking detected
ethtool/44150 is trying to acquire lock: ffff8881364e8c80 (&dev_instance_lock_key#7){+.+.}-{4:4}, at: __netdev_update_features+0x31e/0xe20
but task is already holding lock: ffff8881364e8c80 (&dev_instance_lock_key#7){+.+.}-{4:4}, at: ethnl_set_features+0xbc/0x4b0 and the lock comparison function returns 0:
other info that might help us debug this: Possible unsafe locking scenario:
CPU0 ---- lock(&dev_instance_lock_key#7); lock(&dev_instance_lock_key#7);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by ethtool/44150: #0: ffffffff830e5a50 (cb_lock){++++}-{4:4}, at: genl_rcv+0x15/0x40 #1: ffffffff830cf708 (rtnl_mutex){+.+.}-{4:4}, at: ethnl_set_features+0x88/0x4b0 #2: ffff8881364e8c80 (&dev_instance_lock_key#7){+.+.}-{4:4}, at: ethnl_set_features+0xbc/0x4b0
stack backtrace: Call Trace: <TASK> dump_stack_lvl+0x69/0xa0 print_deadlock_bug.cold+0xbd/0xca __lock_acquire+0x163c/0x2f00 lock_acquire+0xd3/0x2e0 __mutex_lock+0x98/0xf10 __netdev_update_features+0x31e/0xe20 netdev_update_features+0x1f/0x60 vlan_device_event+0x57d/0x930 [8021q] notifier_call_chain+0x3d/0x100 netdev_features_change+0x32/0x50 ethnl_set_features+0x17e/0x4b0 genl_family_rcv_msg_doit+0xe0/0x130 genl_rcv_msg+0x188/0x290 [...]
I'm not sure how to solve this yet... Cosmin.
If it's not clear, the problem is that:
- the lower device is already ops locked
- netdev_feature_change gets called.
- __netdev_update_features gets called for the vlan (upper) dev.
- It tries to acquire the same lock instance as 1 (this patch).
- Deadlock.
One solution I can think of would be to run device notifiers for changing features outside the lock, it doesn't seem like the netdev lock has anything to do with what other devices might do with this information.
This can be triggered from many scenarios, I have another similar stack involving bonding.
What do you think?
All I could think of was to drop the lock during the netdev_features_changed notifier calls, like in the following hunk. I'm running this through regressions, let's see if it's a good idea or not.
diff --git a/net/core/dev.c b/net/core/dev.c index 1be7cb73a602..817fd5bc21b1 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1514,7 +1514,12 @@ int dev_get_alias(const struct net_device *dev, char *name, size_t len) */ void netdev_features_change(struct net_device *dev) { + /* Drop the lock to avoid potential deadlocks from e.g. upper dev + * notifiers altering features of 'dev' and acquiring dev lock again. + */ + netdev_unlock_ops(dev); call_netdevice_notifiers(NETDEV_FEAT_CHANGE, dev); + netdev_lock_ops(dev); } EXPORT_SYMBOL(netdev_features_change);