Did you audit if it is safe to not hold the pernet_ops_rwsem when traversing the pernet_list list? Last time several months back when I reviewed this area for this issue, it appeared that pernet_ops_rwsem must be held while traversing pernet_list.
You also need to fix the mail client to send text only patches.
From: 张广辉 zhang.guanghui@cestc.cn Sent: Sunday, April 24, 2022 2:02 AM To: 张广辉 zhang.guanghui@cestc.cn; Roi Dayan roid@nvidia.com; Saeed Mahameed saeedm@nvidia.com; Parav Pandit parav@nvidia.com; Jason Gunthorpe jgg@nvidia.com; gregkh gregkh@linuxfoundation.org Cc: linux-kernel linux-kernel@vger.kernel.org; stable stable@vger.kernel.org Subject: Fix a devlink AB-BA deadlock on net namespace deletion
Hi all
Deleting a netns holds pernet_ops_rwsem and then takes devlink_mutex. at that time changing mode to switchdev, holds the devlink_mutex, unregistered to netdevice notifier and then takes pernet_ops_rwsem. So AB-BA deadlock problem can happen. I have made a patch to fix the deadlock problem, it work well. please help with the review. Thanks
Example sequence is: $ ip netns add foo $ ip netns del foo & $ devlink dev eswitch set pci/0000:af:00.1 mode switchdev
Process A: Process B: cleanup_net() genl_family_rcv_msg_doit down_read(&pernet_ops_rwsem); <- first sem acquired ops_pre_exit_list() pre_doit devlink_nl_pre_doit mutex_lock(&devlink_mutex); <-first devlink_mutex acquired pre_exit() devlink_pernet_pre_exit() mutex_lock(&devlink_mutex);<-first devlink_mutex acquired devlink_nl_cmd_eswitch_set_doit mlx5_devlink_eswitch_mode_set mlx5_lag_disable_change mlx5_disable_lag mlx5_rescan_drivers_locked device_del ... unregister_netdevice_notifier down_write(&pernet_ops_rwsem);<- first sem acquired
deleting netns trace: [ 248.061947] INFO: task kworker/u160:3:1179 blocked for more than 122 seconds. [ 248.061953] Not tainted 5.15.13-0.el9.x86_64 #1 [ 248.061955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 248.061956] task:kworker/u160:3 state:D stack: 0 pid: 1179 ppid: 2 flags:0x00004000 [ 248.061962] Workqueue: netns cleanup_net [ 248.061970] Call Trace: [ 248.061972] <TASK> [ 248.061975] __schedule+0x200/0x540 [ 248.061982] schedule+0x44/0xa0 [ 248.061984] schedule_preempt_disabled+0xa/0x10 [ 248.061986] __mutex_lock.constprop.0+0x212/0x400 [ 248.061989] devlink_pernet_pre_exit+0x2a/0x140 [ 248.061994] cleanup_net+0x1d2/0x3a0 [ 248.061997] process_one_work+0x1e8/0x390 [ 248.062003] worker_thread+0x53/0x3c0 [ 248.062005] ? process_one_work+0x390/0x390 [ 248.062007] kthread+0x10c/0x130 [ 248.062011] ? set_kthread_struct+0x40/0x40 [ 248.062014] ret_from_fork+0x1f/0x30 [ 248.062020] </TASK>
changing mode to switchdev trace:
[ 248.062078] task:devlink state:D stack: 0 pid: 8546 ppid: 8542 flags:0x00004000 [ 248.062081] Call Trace: [ 248.062082] <TASK> [ 248.062083] __schedule+0x200/0x540 [ 248.062087] ? free_msg+0x3f/0xb0 [mlx5_core] [ 248.062156] schedule+0x44/0xa0 [ 248.062158] rwsem_down_write_slowpath+0x19c/0x3c0 [ 248.062165] unregister_netdevice_notifier+0x1c/0xb0 [ 248.062168] mlx5_ib_roce_cleanup+0x8a/0x110 [mlx5_ib] [ 248.062184] mlx5r_remove+0x36/0x60 [mlx5_ib] [ 248.062196] auxiliary_bus_remove+0x18/0x30 [ 248.062200] __device_release_driver+0x177/0x240 [ 248.062203] device_release_driver+0x24/0x30 [ 248.062205] bus_remove_device+0xd8/0x140 [ 248.062210] device_del+0x18b/0x400 [ 248.062213] mlx5_rescan_drivers_locked.part.0+0x7e/0x150 [mlx5_core] [ 248.062267] mlx5_disable_lag+0x149/0x160 [mlx5_core] [ 248.062318] mlx5_lag_disable_change+0x60/0xa0 [mlx5_core] [ 248.062369] mlx5_devlink_eswitch_mode_set+0x4b/0x1a0 [mlx5_core] [ 248.062436] devlink_nl_cmd_eswitch_set_doit+0xc1/0x150 [ 248.062440] genl_family_rcv_msg_doit+0xe7/0x150 [ 248.062445] genl_rcv_msg+0xdc/0x1e0 [ 248.062448] ? __devlink_port_phys_port_name_get+0x1e0/0x1e0 [ 248.062451] ? genl_get_cmd+0xd0/0xd0 [ 248.062454] netlink_rcv_skb+0x4e/0xf0 [ 248.062457] genl_rcv+0x24/0x40 [ 248.062460] netlink_unicast+0x1fe/0x2d0 [ 248.062463] netlink_sendmsg+0x24f/0x4b0 [ 248.062466] sock_sendmsg+0x5b/0x60 [ 248.062469] __sys_sendto+0xf0/0x160 [ 248.062473] ? handle_mm_fault+0xbf/0x280 [ 248.062478] ? do_user_addr_fault+0x1d0/0x670 [ 248.062482] __x64_sys_sendto+0x20/0x30 [ 248.062484] do_syscall_64+0x38/0x90 [ 248.062487] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 248.062492] RIP: 0033:0x7ff8cc469c3a [ 248.062494] RSP: 002b:00007ffe06025e08 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 248.062497] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007ff8cc469c3a [ 248.062499] RDX: 0000000000000038 RSI: 000055c261bf7440 RDI: 0000000000000003 [ 248.062501] RBP: 0000000000000000 R08: 00007ff8cc52d200 R09: 000000000000000c [ 248.062502] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 248.062503] R13: 000055c261bf72a0 R14: 000055c260a01d5c R15: 000055c261bf7440 [ 248.062505] </TASK>
the patch details:
diff --git a/linux/net/core/net_namespace.c b/linux/net/core/net_namespace.c index 202fa5eac..5c872db1f 100644 --- a/linux/net/core/net_namespace.c +++ b/linux/net/core/net_namespace.c @@ -576,6 +576,7 @@ static void cleanup_net(struct work_struct *work) list_add_tail(&net->exit_list, &net_exit_list); }
+ up_read(&pernet_ops_rwsem); /* Run all of the network namespace pre_exit methods */ list_for_each_entry_reverse(ops, &pernet_list, list) ops_pre_exit_list(ops, &net_exit_list); @@ -596,7 +597,6 @@ static void cleanup_net(struct work_struct *work) list_for_each_entry_reverse(ops, &pernet_list, list) ops_free_list(ops, &net_exit_list);
- up_read(&pernet_ops_rwsem);
/* Ensure there are no outstanding rcu callbacks using this * network namespace.