When a devlink instance is unregistered the following happens (among other things):
t0 - The instance is marked with 'DEVLINK_UNREGISTERING'. t1 - Blocking until an RCU grace period passes. t2 - The 'DEVLINK_UNREGISTERING' mark is cleared from the instance.
When iterating over devlink instances (f.e., when requesting a dump of available instances) and encountering an instance that is currently being unregistered, the current code will loop around until the 'DEVLINK_UNREGISTERING' mark is cleared.
The iteration over devlink instances happens in an RCU critical section, so if the instance that is currently being unregistered was encountered between t0 and t1, the system will deadlock and RCU stalls will be reported [1]. The task unregistering the instance will forever wait for an RCU grace period to pass and the task iterating over the instances will forever wait for the mark to be cleared.
The issue can be reliably reproduced by increasing the time window between t0 and t1 (used a 60 seconds sleep) and running the following reproducer [2].
Fix by skipping over instances that are currently being unregistered.
[1] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P344 (detected by 4, t=26002 jiffies, g=5773, q=12 ncpus=8) task:devlink state:R running task stack:25568 pid:344 ppid:260 flags:0x00004002 [...] Call Trace: xa_get_mark+0x184/0x3e0 devlinks_xa_find_get.constprop.0+0xc6/0x2e0 devlink_nl_cmd_get_dumpit+0x105/0x3f0 netlink_dump+0x568/0xff0 __netlink_dump_start+0x651/0x900 genl_family_rcv_msg_dumpit+0x201/0x340 genl_rcv_msg+0x573/0x780 netlink_rcv_skb+0x15f/0x430 genl_rcv+0x29/0x40 netlink_unicast+0x546/0x800 netlink_sendmsg+0x958/0xe60 __sys_sendto+0x3a2/0x480 __x64_sys_sendto+0xe1/0x1b0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x68/0xd2
[2] # echo 10 > /sys/bus/netdevsim/new_device # echo 10 > /sys/bus/netdevsim/del_device & # devlink dev
Fixes: c2368b19807a ("net: devlink: introduce "unregistering" mark and use it during devlinks iteration") Reported-by: Vivek Reddy Karri vkarri@nvidia.com Signed-off-by: Ido Schimmel idosch@nvidia.com --- I read the stable rules and I am not providing an "upstream commit ID" since the code in upstream has been reworked, making this fix irrelevant. The only affected stable kernel is 6.1.y. --- net/devlink/leftover.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/devlink/leftover.c b/net/devlink/leftover.c index 032c7af065cd..c6f781a08d06 100644 --- a/net/devlink/leftover.c +++ b/net/devlink/leftover.c @@ -301,6 +301,9 @@ devlinks_xa_find_get(struct net *net, unsigned long *indexp, xa_mark_t filter, if (!devlink) goto unlock;
+ /* For a possible retry, the xa_find_after() should be always used */ + xa_find_fn = xa_find_after; + /* In case devlink_unregister() was already called and "unregistering" * mark was set, do not allow to get a devlink reference here. * This prevents live-lock of devlink_unregister() wait for completion. @@ -308,8 +311,6 @@ devlinks_xa_find_get(struct net *net, unsigned long *indexp, xa_mark_t filter, if (xa_get_mark(&devlinks, *indexp, DEVLINK_UNREGISTERING)) goto retry;
- /* For a possible retry, the xa_find_after() should be always used */ - xa_find_fn = xa_find_after; if (!devlink_try_get(devlink)) goto retry; if (!net_eq(devlink_net(devlink), net)) {
On Tue, Oct 01, 2024 at 02:20:35PM +0300, Ido Schimmel wrote:
When a devlink instance is unregistered the following happens (among other things):
t0 - The instance is marked with 'DEVLINK_UNREGISTERING'. t1 - Blocking until an RCU grace period passes. t2 - The 'DEVLINK_UNREGISTERING' mark is cleared from the instance.
When iterating over devlink instances (f.e., when requesting a dump of available instances) and encountering an instance that is currently being unregistered, the current code will loop around until the 'DEVLINK_UNREGISTERING' mark is cleared.
The iteration over devlink instances happens in an RCU critical section, so if the instance that is currently being unregistered was encountered between t0 and t1, the system will deadlock and RCU stalls will be reported [1]. The task unregistering the instance will forever wait for an RCU grace period to pass and the task iterating over the instances will forever wait for the mark to be cleared.
The issue can be reliably reproduced by increasing the time window between t0 and t1 (used a 60 seconds sleep) and running the following reproducer [2].
Fix by skipping over instances that are currently being unregistered.
[1] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P344 (detected by 4, t=26002 jiffies, g=5773, q=12 ncpus=8) task:devlink state:R running task stack:25568 pid:344 ppid:260 flags:0x00004002 [...] Call Trace: xa_get_mark+0x184/0x3e0 devlinks_xa_find_get.constprop.0+0xc6/0x2e0 devlink_nl_cmd_get_dumpit+0x105/0x3f0 netlink_dump+0x568/0xff0 __netlink_dump_start+0x651/0x900 genl_family_rcv_msg_dumpit+0x201/0x340 genl_rcv_msg+0x573/0x780 netlink_rcv_skb+0x15f/0x430 genl_rcv+0x29/0x40 netlink_unicast+0x546/0x800 netlink_sendmsg+0x958/0xe60 __sys_sendto+0x3a2/0x480 __x64_sys_sendto+0xe1/0x1b0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x68/0xd2
[2] # echo 10 > /sys/bus/netdevsim/new_device # echo 10 > /sys/bus/netdevsim/del_device & # devlink dev
Fixes: c2368b19807a ("net: devlink: introduce "unregistering" mark and use it during devlinks iteration") Reported-by: Vivek Reddy Karri vkarri@nvidia.com Signed-off-by: Ido Schimmel idosch@nvidia.com
I read the stable rules and I am not providing an "upstream commit ID" since the code in upstream has been reworked, making this fix irrelevant. The only affected stable kernel is 6.1.y.
You need to document the heck out of why this is only relevant for this one specific kernel branch IN the changelog text, so that we understand what is going on, AND you need to get acks from the relevant maintainers of this area of the kernel to accept something that is not in Linus's tree.
But first of, why? Why not just take the upstrema commits instead?
thanks,
greg k-h
On Tue, Oct 01, 2024 at 02:11:27PM +0200, Greg KH wrote:
On Tue, Oct 01, 2024 at 02:20:35PM +0300, Ido Schimmel wrote:
I read the stable rules and I am not providing an "upstream commit ID" since the code in upstream has been reworked, making this fix irrelevant. The only affected stable kernel is 6.1.y.
You need to document the heck out of why this is only relevant for this one specific kernel branch IN the changelog text, so that we understand what is going on, AND you need to get acks from the relevant maintainers of this area of the kernel to accept something that is not in Linus's tree.
But first of, why? Why not just take the upstrema commits instead?
There were a lot of changes as part of the 6.3 cycle to completely rework the semantics of the devlink instance reference count. As part of these changes, commit d77278196441 ("devlink: bump the instance index directly when iterating") inadvertently fixed the bug mentioned in this patch. This commit cannot be applied to 6.1.y as-is because a prior commit (also in 6.3) moved the code to a different file (leftover.c -> core.c). There might be more dependencies that I'm currently unaware of.
The alternative, proposed in this patch, is to provide a minimal and contained fix for the bug introduced in upstream commit c2368b19807a ("net: devlink: introduce "unregistering" mark and use it during devlinks iteration") as part of the 6.0 cycle.
The above explains why the patch is only relevant to 6.1.y.
Jakub / Jiri, what is your preference here? This patch or cherry picking a lot of code from 6.3?
Hi Ido,
On Tue, 1 Oct 2024 16:38:39 +0300, Ido Schimmel wrote:
On Tue, Oct 01, 2024 at 02:11:27PM +0200, Greg KH wrote:
On Tue, Oct 01, 2024 at 02:20:35PM +0300, Ido Schimmel wrote:
I read the stable rules and I am not providing an "upstream commit ID" since the code in upstream has been reworked, making this fix irrelevant. The only affected stable kernel is 6.1.y.
You need to document the heck out of why this is only relevant for this one specific kernel branch IN the changelog text, so that we understand what is going on, AND you need to get acks from the relevant maintainers of this area of the kernel to accept something that is not in Linus's tree.
But first of, why? Why not just take the upstrema commits instead?
There were a lot of changes as part of the 6.3 cycle to completely rework the semantics of the devlink instance reference count. As part of these changes, commit d77278196441 ("devlink: bump the instance index directly when iterating") inadvertently fixed the bug mentioned in this patch. This commit cannot be applied to 6.1.y as-is because a prior commit (also in 6.3) moved the code to a different file (leftover.c -> core.c). There might be more dependencies that I'm currently unaware of.
The alternative, proposed in this patch, is to provide a minimal and contained fix for the bug introduced in upstream commit c2368b19807a ("net: devlink: introduce "unregistering" mark and use it during devlinks iteration") as part of the 6.0 cycle.
The above explains why the patch is only relevant to 6.1.y.
Thanks for bringing up this topic!
For what it's worth, syzbot would also greatly benefit from your fix: https://github.com/google/syzkaller/issues/5328
I've built a kernel locally with your changes, run syzkaller against it, and I can confirm that the kernel no longer crashes due to devlink.
On Tue, Oct 01, 2024 at 06:47:59PM +0200, Aleksandr Nogikh wrote:
Hi Ido,
On Tue, 1 Oct 2024 16:38:39 +0300, Ido Schimmel wrote:
On Tue, Oct 01, 2024 at 02:11:27PM +0200, Greg KH wrote:
On Tue, Oct 01, 2024 at 02:20:35PM +0300, Ido Schimmel wrote:
I read the stable rules and I am not providing an "upstream commit ID" since the code in upstream has been reworked, making this fix irrelevant. The only affected stable kernel is 6.1.y.
You need to document the heck out of why this is only relevant for this one specific kernel branch IN the changelog text, so that we understand what is going on, AND you need to get acks from the relevant maintainers of this area of the kernel to accept something that is not in Linus's tree.
But first of, why? Why not just take the upstrema commits instead?
There were a lot of changes as part of the 6.3 cycle to completely rework the semantics of the devlink instance reference count. As part of these changes, commit d77278196441 ("devlink: bump the instance index directly when iterating") inadvertently fixed the bug mentioned in this patch. This commit cannot be applied to 6.1.y as-is because a prior commit (also in 6.3) moved the code to a different file (leftover.c -> core.c). There might be more dependencies that I'm currently unaware of.
The alternative, proposed in this patch, is to provide a minimal and contained fix for the bug introduced in upstream commit c2368b19807a ("net: devlink: introduce "unregistering" mark and use it during devlinks iteration") as part of the 6.0 cycle.
The above explains why the patch is only relevant to 6.1.y.
Thanks for bringing up this topic!
For what it's worth, syzbot would also greatly benefit from your fix: https://github.com/google/syzkaller/issues/5328
I've built a kernel locally with your changes, run syzkaller against it, and I can confirm that the kernel no longer crashes due to devlink.
Good to know :)
I hope this patch can be accepted as-is instead of a much larger patch. Will copy you on the next version.
Thanks for testing!
On Tue, 1 Oct 2024 16:38:39 +0300 Ido Schimmel wrote:
You need to document the heck out of why this is only relevant for this one specific kernel branch IN the changelog text, so that we understand what is going on, AND you need to get acks from the relevant maintainers of this area of the kernel to accept something that is not in Linus's tree.
But first of, why? Why not just take the upstrema commits instead?
There were a lot of changes as part of the 6.3 cycle to completely rework the semantics of the devlink instance reference count. As part of these changes, commit d77278196441 ("devlink: bump the instance index directly when iterating") inadvertently fixed the bug mentioned in this patch. This commit cannot be applied to 6.1.y as-is because a prior commit (also in 6.3) moved the code to a different file (leftover.c -> core.c). There might be more dependencies that I'm currently unaware of.
The alternative, proposed in this patch, is to provide a minimal and contained fix for the bug introduced in upstream commit c2368b19807a ("net: devlink: introduce "unregistering" mark and use it during devlinks iteration") as part of the 6.0 cycle.
The above explains why the patch is only relevant to 6.1.y.
Jakub / Jiri, what is your preference here? This patch or cherry picking a lot of code from 6.3?
No preference here. The fix as posted looks correct. The backport of the upstream commit should be correct too (I don't see any incompatibilities) but as you said the code has moved and got exposed via a header, so the diff will look quite different.
I think Greg would still prefer to use the bastardized upstream commit in such cases.
On Tue, Oct 01, 2024 at 03:39:53PM -0700, Jakub Kicinski wrote:
On Tue, 1 Oct 2024 16:38:39 +0300 Ido Schimmel wrote:
You need to document the heck out of why this is only relevant for this one specific kernel branch IN the changelog text, so that we understand what is going on, AND you need to get acks from the relevant maintainers of this area of the kernel to accept something that is not in Linus's tree.
But first of, why? Why not just take the upstrema commits instead?
There were a lot of changes as part of the 6.3 cycle to completely rework the semantics of the devlink instance reference count. As part of these changes, commit d77278196441 ("devlink: bump the instance index directly when iterating") inadvertently fixed the bug mentioned in this patch. This commit cannot be applied to 6.1.y as-is because a prior commit (also in 6.3) moved the code to a different file (leftover.c -> core.c). There might be more dependencies that I'm currently unaware of.
The alternative, proposed in this patch, is to provide a minimal and contained fix for the bug introduced in upstream commit c2368b19807a ("net: devlink: introduce "unregistering" mark and use it during devlinks iteration") as part of the 6.0 cycle.
The above explains why the patch is only relevant to 6.1.y.
Jakub / Jiri, what is your preference here? This patch or cherry picking a lot of code from 6.3?
No preference here. The fix as posted looks correct. The backport of the upstream commit should be correct too (I don't see any incompatibilities) but as you said the code has moved and got exposed via a header, so the diff will look quite different.
I think Greg would still prefer to use the bastardized upstream commit in such cases.
Greg, if I augment the commit message with the necessary information, would you be willing to take this patch instead of a much larger patch?
On Sun, Oct 06, 2024 at 11:44:42AM +0300, Ido Schimmel wrote:
On Tue, Oct 01, 2024 at 03:39:53PM -0700, Jakub Kicinski wrote:
On Tue, 1 Oct 2024 16:38:39 +0300 Ido Schimmel wrote:
You need to document the heck out of why this is only relevant for this one specific kernel branch IN the changelog text, so that we understand what is going on, AND you need to get acks from the relevant maintainers of this area of the kernel to accept something that is not in Linus's tree.
But first of, why? Why not just take the upstrema commits instead?
There were a lot of changes as part of the 6.3 cycle to completely rework the semantics of the devlink instance reference count. As part of these changes, commit d77278196441 ("devlink: bump the instance index directly when iterating") inadvertently fixed the bug mentioned in this patch. This commit cannot be applied to 6.1.y as-is because a prior commit (also in 6.3) moved the code to a different file (leftover.c -> core.c). There might be more dependencies that I'm currently unaware of.
The alternative, proposed in this patch, is to provide a minimal and contained fix for the bug introduced in upstream commit c2368b19807a ("net: devlink: introduce "unregistering" mark and use it during devlinks iteration") as part of the 6.0 cycle.
The above explains why the patch is only relevant to 6.1.y.
Jakub / Jiri, what is your preference here? This patch or cherry picking a lot of code from 6.3?
No preference here. The fix as posted looks correct. The backport of the upstream commit should be correct too (I don't see any incompatibilities) but as you said the code has moved and got exposed via a header, so the diff will look quite different.
I think Greg would still prefer to use the bastardized upstream commit in such cases.
Greg, if I augment the commit message with the necessary information, would you be willing to take this patch instead of a much larger patch?
I almost always want to take whatever is in Linus's tree to ensure that it will be easier to maintain over time due to other changes needing to happen in the same area over the next 5+ years. ALSO, almost always, without fail, whenever we take code that is NOT in Linus's tree, it's wrong, and it needs to be fixed up again as it's going outside of our normal development and review and testing processes.
So unless there is some MAJOR reason why we can't just take what is in Linus's tree, I strongly prefer that. It is trivial for us to take 10's and even 100's of patches of backports over a one-off change that is going to end up costing us more work and effort over time.
thanks,
greg k-h
On Tue, Oct 01, 2024 at 02:20:35PM +0300, Ido Schimmel wrote:
I read the stable rules and I am not providing an "upstream commit ID" since the code in upstream has been reworked, making this fix irrelevant. The only affected stable kernel is 6.1.y.
Could you please add some information about *how* the code has changed so backporting became infeasible? In our experience, custom backports tend to cause more pain than what they fix, so at least our future-selves could use some pointers around what happened here.
linux-stable-mirror@lists.linaro.org