The first patch fixes the incorrect locks using in bond driver. The second patch fixes the xfrm offload feature during setup active-backup mode. The third patch add a ipsec offload testing.
v5: use list_for_each_entry_safe() when del item in list (Nikolay Aleksandrov) do not call spin_lock_bh in sleep function xdo_dev_state_free (Nikolay Aleksandrov) set xso.real_dev = NULL to avoid __xfrm_state_delete() is called in parallel() (Cosmin Ratiu) remove spin lock in bond_ipsec_add_sa_all() as it doesn't resolve the race condition. v4: hold xs->lock for bond_ipsec_{del, add}_sa_all (Cosmin Ratiu) use the defer helpers in lib.sh for selftest (Petr Machata) v3: move the ipsec deletion to bond_ipsec_free_sa (Cosmin Ratiu) v2: do not turn carrier on if bond change link failed (Nikolay Aleksandrov) move the mutex lock to a work queue (Cosmin Ratiu)
Hangbin Liu (3): bonding: fix calling sleeping function in spin lock and some race conditions bonding: fix xfrm offload feature setup on active-backup mode selftests: bonding: add ipsec offload test
drivers/net/bonding/bond_main.c | 71 +++++--- drivers/net/bonding/bond_netlink.c | 16 +- include/net/bonding.h | 1 + .../selftests/drivers/net/bonding/Makefile | 3 +- .../drivers/net/bonding/bond_ipsec_offload.sh | 154 ++++++++++++++++++ .../selftests/drivers/net/bonding/config | 4 + 6 files changed, 222 insertions(+), 27 deletions(-) create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_ipsec_offload.sh
The fixed commit placed mutex_lock() inside spin_lock_bh(), which triggers a warning:
BUG: sleeping function called from invalid context at...
Fix this by moving the IPsec deletion operation to bond_ipsec_free_sa, which is not held by spin_lock_bh().
Additionally, there are also some race conditions as bond_ipsec_del_sa_all() and __xfrm_state_delete could running in parallel without any lock. e.g.
bond_ipsec_del_sa_all() __xfrm_state_delete() - .xdo_dev_state_delete - bond_ipsec_del_sa() - .xdo_dev_state_free - .xdo_dev_state_delete() - bond_ipsec_free_sa() bond active_slave changes - .xdo_dev_state_free()
bond_ipsec_add_sa_all() - ipsec->xs->xso.real_dev = real_dev; - xdo_dev_state_add
To fix this, let's add xs->lock during bond_ipsec_del_sa_all(), and delete the IPsec list when the XFRM state is DEAD, which could prevent xdo_dev_state_free() from being triggered again in bond_ipsec_free_sa().
In bond_ipsec_add_sa(), if .xdo_dev_state_add() failed, the xso.real_dev is set without clean. Which will cause trouble if __xfrm_state_delete is called at the same time. Reset the xso.real_dev to NULL if state add failed.
Despite the above fixes, there are still races in bond_ipsec_add_sa() and bond_ipsec_add_sa_all(). If __xfrm_state_delete() is called immediately after we set the xso.real_dev and before .xdo_dev_state_add() is finished, like
ipsec->xs->xso.real_dev = real_dev; __xfrm_state_delete - bond_ipsec_del_sa() - .xdo_dev_state_delete() - bond_ipsec_free_sa() - .xdo_dev_state_free() .xdo_dev_state_add()
But there is no good solution yet. So I just added a FIXME note in here and hope we can fix it in future.
Fixes: 2aeeef906d5a ("bonding: change ipsec_lock from spin lock to mutex") Reported-by: Jakub Kicinski kuba@kernel.org Closes: https://lore.kernel.org/netdev/20241212062734.182a0164@kernel.org Suggested-by: Cosmin Ratiu cratiu@nvidia.com Signed-off-by: Hangbin Liu liuhangbin@gmail.com --- drivers/net/bonding/bond_main.c | 69 ++++++++++++++++++++++++--------- 1 file changed, 51 insertions(+), 18 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e45bba240cbc..dd3d0d41d98f 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -506,6 +506,7 @@ static int bond_ipsec_add_sa(struct xfrm_state *xs, list_add(&ipsec->list, &bond->ipsec_list); mutex_unlock(&bond->ipsec_lock); } else { + xs->xso.real_dev = NULL; kfree(ipsec); } out: @@ -541,7 +542,15 @@ static void bond_ipsec_add_sa_all(struct bonding *bond) if (ipsec->xs->xso.real_dev == real_dev) continue;
+ /* Skip dead xfrm states, they'll be freed later. */ + if (ipsec->xs->km.state == XFRM_STATE_DEAD) + continue; + ipsec->xs->xso.real_dev = real_dev; + /* FIXME: there is a race that before .xdo_dev_state_add() + * is called, the __xfrm_state_delete() is called in parallel, + * which will call .xdo_dev_state_delete() and xdo_dev_state_free() + */ if (real_dev->xfrmdev_ops->xdo_dev_state_add(ipsec->xs, NULL)) { slave_warn(bond_dev, real_dev, "%s: failed to add SA\n", __func__); ipsec->xs->xso.real_dev = NULL; @@ -560,7 +569,6 @@ static void bond_ipsec_del_sa(struct xfrm_state *xs) struct net_device *bond_dev = xs->xso.dev; struct net_device *real_dev; netdevice_tracker tracker; - struct bond_ipsec *ipsec; struct bonding *bond; struct slave *slave;
@@ -592,22 +600,13 @@ static void bond_ipsec_del_sa(struct xfrm_state *xs) real_dev->xfrmdev_ops->xdo_dev_state_delete(xs); out: netdev_put(real_dev, &tracker); - mutex_lock(&bond->ipsec_lock); - list_for_each_entry(ipsec, &bond->ipsec_list, list) { - if (ipsec->xs == xs) { - list_del(&ipsec->list); - kfree(ipsec); - break; - } - } - mutex_unlock(&bond->ipsec_lock); }
static void bond_ipsec_del_sa_all(struct bonding *bond) { struct net_device *bond_dev = bond->dev; + struct bond_ipsec *ipsec, *tmp_ipsec; struct net_device *real_dev; - struct bond_ipsec *ipsec; struct slave *slave;
slave = rtnl_dereference(bond->curr_active_slave); @@ -616,9 +615,22 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) return;
mutex_lock(&bond->ipsec_lock); - list_for_each_entry(ipsec, &bond->ipsec_list, list) { - if (!ipsec->xs->xso.real_dev) + list_for_each_entry_safe(ipsec, tmp_ipsec, &bond->ipsec_list, list) { + spin_lock_bh(&ipsec->xs->lock); + if (!ipsec->xs->xso.real_dev) { + spin_unlock_bh(&ipsec->xs->lock); continue; + } + + if (ipsec->xs->km.state == XFRM_STATE_DEAD) { + list_del(&ipsec->list); + kfree(ipsec); + /* Need to free device here, or the xs->xso.real_dev + * may changed in bond_ipsec_add_sa_all and free + * on old device will never be called. + */ + goto next; + }
if (!real_dev->xfrmdev_ops || !real_dev->xfrmdev_ops->xdo_dev_state_delete || @@ -626,11 +638,20 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) slave_warn(bond_dev, real_dev, "%s: no slave xdo_dev_state_delete\n", __func__); - } else { - real_dev->xfrmdev_ops->xdo_dev_state_delete(ipsec->xs); - if (real_dev->xfrmdev_ops->xdo_dev_state_free) - real_dev->xfrmdev_ops->xdo_dev_state_free(ipsec->xs); + spin_unlock_bh(&ipsec->xs->lock); + continue; } + + real_dev->xfrmdev_ops->xdo_dev_state_delete(ipsec->xs); + +next: + /* set real_dev to NULL in case __xfrm_state_delete() is called in parallel */ + ipsec->xs->xso.real_dev = NULL; + + /* Unlock before freeing device state, it could sleep. */ + spin_unlock_bh(&ipsec->xs->lock); + if (real_dev->xfrmdev_ops->xdo_dev_state_free) + real_dev->xfrmdev_ops->xdo_dev_state_free(ipsec->xs); } mutex_unlock(&bond->ipsec_lock); } @@ -638,6 +659,7 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) static void bond_ipsec_free_sa(struct xfrm_state *xs) { struct net_device *bond_dev = xs->xso.dev; + struct bond_ipsec *ipsec, *tmp_ipsec; struct net_device *real_dev; netdevice_tracker tracker; struct bonding *bond; @@ -659,13 +681,24 @@ static void bond_ipsec_free_sa(struct xfrm_state *xs) if (!xs->xso.real_dev) goto out;
- WARN_ON(xs->xso.real_dev != real_dev); + if (xs->xso.real_dev != real_dev) + goto out;
if (real_dev && real_dev->xfrmdev_ops && real_dev->xfrmdev_ops->xdo_dev_state_free) real_dev->xfrmdev_ops->xdo_dev_state_free(xs); out: netdev_put(real_dev, &tracker); + + mutex_lock(&bond->ipsec_lock); + list_for_each_entry_safe(ipsec, tmp_ipsec, &bond->ipsec_list, list) { + if (ipsec->xs == xs) { + list_del(&ipsec->list); + kfree(ipsec); + break; + } + } + mutex_unlock(&bond->ipsec_lock); }
/**
On 3/7/25 05:19, Hangbin Liu wrote:
The fixed commit placed mutex_lock() inside spin_lock_bh(), which triggers a warning:
BUG: sleeping function called from invalid context at...
Fix this by moving the IPsec deletion operation to bond_ipsec_free_sa, which is not held by spin_lock_bh().
Additionally, there are also some race conditions as bond_ipsec_del_sa_all() and __xfrm_state_delete could running in parallel without any lock. e.g.
bond_ipsec_del_sa_all() __xfrm_state_delete() - .xdo_dev_state_delete - bond_ipsec_del_sa() - .xdo_dev_state_free - .xdo_dev_state_delete() - bond_ipsec_free_sa() bond active_slave changes - .xdo_dev_state_free()
bond_ipsec_add_sa_all() - ipsec->xs->xso.real_dev = real_dev; - xdo_dev_state_add
To fix this, let's add xs->lock during bond_ipsec_del_sa_all(), and delete the IPsec list when the XFRM state is DEAD, which could prevent xdo_dev_state_free() from being triggered again in bond_ipsec_free_sa().
In bond_ipsec_add_sa(), if .xdo_dev_state_add() failed, the xso.real_dev is set without clean. Which will cause trouble if __xfrm_state_delete is called at the same time. Reset the xso.real_dev to NULL if state add failed.
Despite the above fixes, there are still races in bond_ipsec_add_sa() and bond_ipsec_add_sa_all(). If __xfrm_state_delete() is called immediately after we set the xso.real_dev and before .xdo_dev_state_add() is finished, like
ipsec->xs->xso.real_dev = real_dev; __xfrm_state_delete - bond_ipsec_del_sa() - .xdo_dev_state_delete() - bond_ipsec_free_sa() - .xdo_dev_state_free() .xdo_dev_state_add()
But there is no good solution yet. So I just added a FIXME note in here and hope we can fix it in future.
Fixes: 2aeeef906d5a ("bonding: change ipsec_lock from spin lock to mutex") Reported-by: Jakub Kicinski kuba@kernel.org Closes: https://lore.kernel.org/netdev/20241212062734.182a0164@kernel.org Suggested-by: Cosmin Ratiu cratiu@nvidia.com Signed-off-by: Hangbin Liu liuhangbin@gmail.com
drivers/net/bonding/bond_main.c | 69 ++++++++++++++++++++++++--------- 1 file changed, 51 insertions(+), 18 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e45bba240cbc..dd3d0d41d98f 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -506,6 +506,7 @@ static int bond_ipsec_add_sa(struct xfrm_state *xs, list_add(&ipsec->list, &bond->ipsec_list); mutex_unlock(&bond->ipsec_lock); } else {
kfree(ipsec); }xs->xso.real_dev = NULL;
out: @@ -541,7 +542,15 @@ static void bond_ipsec_add_sa_all(struct bonding *bond) if (ipsec->xs->xso.real_dev == real_dev) continue;
/* Skip dead xfrm states, they'll be freed later. */
if (ipsec->xs->km.state == XFRM_STATE_DEAD)
continue;
As we commented earlier, reading this state without x->lock is wrong.
- ipsec->xs->xso.real_dev = real_dev;
/* FIXME: there is a race that before .xdo_dev_state_add()
* is called, the __xfrm_state_delete() is called in parallel,
* which will call .xdo_dev_state_delete() and xdo_dev_state_free()
if (real_dev->xfrmdev_ops->xdo_dev_state_add(ipsec->xs, NULL)) { slave_warn(bond_dev, real_dev, "%s: failed to add SA\n", __func__); ipsec->xs->xso.real_dev = NULL;*/
[snip]
TBH, keeping buggy code with a comment doesn't sound good to me. I'd rather remove this support than tell people "good luck, it might crash". It's better to be safe until a correct design is in place which takes care of these issues.
Cheers, Nik
Hi Nikolay, On Fri, Mar 07, 2025 at 09:42:49AM +0200, Nikolay Aleksandrov wrote:
On 3/7/25 05:19, Hangbin Liu wrote:
The fixed commit placed mutex_lock() inside spin_lock_bh(), which triggers a warning:
BUG: sleeping function called from invalid context at...
Fix this by moving the IPsec deletion operation to bond_ipsec_free_sa, which is not held by spin_lock_bh().
Additionally, there are also some race conditions as bond_ipsec_del_sa_all() and __xfrm_state_delete could running in parallel without any lock. e.g.
bond_ipsec_del_sa_all() __xfrm_state_delete() - .xdo_dev_state_delete - bond_ipsec_del_sa() - .xdo_dev_state_free - .xdo_dev_state_delete() - bond_ipsec_free_sa() bond active_slave changes - .xdo_dev_state_free()
bond_ipsec_add_sa_all() - ipsec->xs->xso.real_dev = real_dev; - xdo_dev_state_add
To fix this, let's add xs->lock during bond_ipsec_del_sa_all(), and delete the IPsec list when the XFRM state is DEAD, which could prevent xdo_dev_state_free() from being triggered again in bond_ipsec_free_sa().
In bond_ipsec_add_sa(), if .xdo_dev_state_add() failed, the xso.real_dev is set without clean. Which will cause trouble if __xfrm_state_delete is called at the same time. Reset the xso.real_dev to NULL if state add failed.
Despite the above fixes, there are still races in bond_ipsec_add_sa() and bond_ipsec_add_sa_all(). If __xfrm_state_delete() is called immediately after we set the xso.real_dev and before .xdo_dev_state_add() is finished, like
ipsec->xs->xso.real_dev = real_dev; __xfrm_state_delete - bond_ipsec_del_sa() - .xdo_dev_state_delete() - bond_ipsec_free_sa() - .xdo_dev_state_free() .xdo_dev_state_add()
But there is no good solution yet. So I just added a FIXME note in here and hope we can fix it in future.
Fixes: 2aeeef906d5a ("bonding: change ipsec_lock from spin lock to mutex") Reported-by: Jakub Kicinski kuba@kernel.org Closes: https://lore.kernel.org/netdev/20241212062734.182a0164@kernel.org Suggested-by: Cosmin Ratiu cratiu@nvidia.com Signed-off-by: Hangbin Liu liuhangbin@gmail.com
drivers/net/bonding/bond_main.c | 69 ++++++++++++++++++++++++--------- 1 file changed, 51 insertions(+), 18 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e45bba240cbc..dd3d0d41d98f 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -506,6 +506,7 @@ static int bond_ipsec_add_sa(struct xfrm_state *xs, list_add(&ipsec->list, &bond->ipsec_list); mutex_unlock(&bond->ipsec_lock); } else {
kfree(ipsec); }xs->xso.real_dev = NULL;
out: @@ -541,7 +542,15 @@ static void bond_ipsec_add_sa_all(struct bonding *bond) if (ipsec->xs->xso.real_dev == real_dev) continue;
/* Skip dead xfrm states, they'll be freed later. */
if (ipsec->xs->km.state == XFRM_STATE_DEAD)
continue;
As we commented earlier, reading this state without x->lock is wrong.
But even we add the lock, like
spin_lock_bh(&ipsec->xs->lock); if (ipsec->xs->km.state == XFRM_STATE_DEAD) { spin_unlock_bh(&ipsec->xs->lock); continue; }
We still may got the race condition. Like the following note said. So I just leave it as the current status. But I can add the spin lock if you insist.
- ipsec->xs->xso.real_dev = real_dev;
/* FIXME: there is a race that before .xdo_dev_state_add()
* is called, the __xfrm_state_delete() is called in parallel,
* which will call .xdo_dev_state_delete() and xdo_dev_state_free()
if (real_dev->xfrmdev_ops->xdo_dev_state_add(ipsec->xs, NULL)) { slave_warn(bond_dev, real_dev, "%s: failed to add SA\n", __func__); ipsec->xs->xso.real_dev = NULL;*/
[snip]
TBH, keeping buggy code with a comment doesn't sound good to me. I'd rather remove this support than tell people "good luck, it might crash". It's better to be safe until a correct design is in place which takes care of these issues.
I agree it's not a good experience to let users using an unstable feature. But this is a race condition, although we don't have a good fix yet.
On the other hand, I think we can't remove a feature people is using, can we? What I can do is try fix the issues as my best.
By the way, I started this patch because my patch 2/3 is blocked by the selftest results from patch 3/3...
Thanks Hangbin
On 3/7/25 10:11, Hangbin Liu wrote:
Hi Nikolay, On Fri, Mar 07, 2025 at 09:42:49AM +0200, Nikolay Aleksandrov wrote:
On 3/7/25 05:19, Hangbin Liu wrote:
The fixed commit placed mutex_lock() inside spin_lock_bh(), which triggers a warning:
BUG: sleeping function called from invalid context at...
Fix this by moving the IPsec deletion operation to bond_ipsec_free_sa, which is not held by spin_lock_bh().
Additionally, there are also some race conditions as bond_ipsec_del_sa_all() and __xfrm_state_delete could running in parallel without any lock. e.g.
bond_ipsec_del_sa_all() __xfrm_state_delete() - .xdo_dev_state_delete - bond_ipsec_del_sa() - .xdo_dev_state_free - .xdo_dev_state_delete() - bond_ipsec_free_sa() bond active_slave changes - .xdo_dev_state_free()
bond_ipsec_add_sa_all() - ipsec->xs->xso.real_dev = real_dev; - xdo_dev_state_add
To fix this, let's add xs->lock during bond_ipsec_del_sa_all(), and delete the IPsec list when the XFRM state is DEAD, which could prevent xdo_dev_state_free() from being triggered again in bond_ipsec_free_sa().
In bond_ipsec_add_sa(), if .xdo_dev_state_add() failed, the xso.real_dev is set without clean. Which will cause trouble if __xfrm_state_delete is called at the same time. Reset the xso.real_dev to NULL if state add failed.
Despite the above fixes, there are still races in bond_ipsec_add_sa() and bond_ipsec_add_sa_all(). If __xfrm_state_delete() is called immediately after we set the xso.real_dev and before .xdo_dev_state_add() is finished, like
ipsec->xs->xso.real_dev = real_dev; __xfrm_state_delete - bond_ipsec_del_sa() - .xdo_dev_state_delete() - bond_ipsec_free_sa() - .xdo_dev_state_free() .xdo_dev_state_add()
But there is no good solution yet. So I just added a FIXME note in here and hope we can fix it in future.
Fixes: 2aeeef906d5a ("bonding: change ipsec_lock from spin lock to mutex") Reported-by: Jakub Kicinski kuba@kernel.org Closes: https://lore.kernel.org/netdev/20241212062734.182a0164@kernel.org Suggested-by: Cosmin Ratiu cratiu@nvidia.com Signed-off-by: Hangbin Liu liuhangbin@gmail.com
drivers/net/bonding/bond_main.c | 69 ++++++++++++++++++++++++--------- 1 file changed, 51 insertions(+), 18 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e45bba240cbc..dd3d0d41d98f 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -506,6 +506,7 @@ static int bond_ipsec_add_sa(struct xfrm_state *xs, list_add(&ipsec->list, &bond->ipsec_list); mutex_unlock(&bond->ipsec_lock); } else {
kfree(ipsec); }xs->xso.real_dev = NULL;
out: @@ -541,7 +542,15 @@ static void bond_ipsec_add_sa_all(struct bonding *bond) if (ipsec->xs->xso.real_dev == real_dev) continue;
/* Skip dead xfrm states, they'll be freed later. */
if (ipsec->xs->km.state == XFRM_STATE_DEAD)
continue;
As we commented earlier, reading this state without x->lock is wrong.
But even we add the lock, like
spin_lock_bh(&ipsec->xs->lock); if (ipsec->xs->km.state == XFRM_STATE_DEAD) { spin_unlock_bh(&ipsec->xs->lock); continue; }
We still may got the race condition. Like the following note said. So I just leave it as the current status. But I can add the spin lock if you insist.
I don't insist at all, I just pointed out that this is buggy and the value doesn't make sense used like that. Adding more bugs to the existing code wouldn't make it better.
- ipsec->xs->xso.real_dev = real_dev;
/* FIXME: there is a race that before .xdo_dev_state_add()
* is called, the __xfrm_state_delete() is called in parallel,
* which will call .xdo_dev_state_delete() and xdo_dev_state_free()
if (real_dev->xfrmdev_ops->xdo_dev_state_add(ipsec->xs, NULL)) { slave_warn(bond_dev, real_dev, "%s: failed to add SA\n", __func__); ipsec->xs->xso.real_dev = NULL;*/
[snip]
TBH, keeping buggy code with a comment doesn't sound good to me. I'd rather remove this support than tell people "good luck, it might crash". It's better to be safe until a correct design is in place which takes care of these issues.
I agree it's not a good experience to let users using an unstable feature. But this is a race condition, although we don't have a good fix yet.
On the other hand, I think we can't remove a feature people is using, can we? What I can do is try fix the issues as my best.
I do appreciate the hard work you've been doing on this, don't get me wrong, but this is not really uapi, it's an optimization. The path will become slower as it won't be offloaded, but it will still work and will be stable until a proper fix or new design comes in.
Are you suggesting to knowingly leave a race condition that might lead to a number of problems in place with a comment? IMO that is not ok, but ultimately it's up to the maintainers to decide if they can live with it. :)
By the way, I started this patch because my patch 2/3 is blocked by the selftest results from patch 3/3...
Thanks Hangbin
On Fri, Mar 07, 2025 at 10:33:57AM +0200, Nikolay Aleksandrov wrote:
On 3/7/25 10:11, Hangbin Liu wrote:
Hi Nikolay, On Fri, Mar 07, 2025 at 09:42:49AM +0200, Nikolay Aleksandrov wrote:
On 3/7/25 05:19, Hangbin Liu wrote:
The fixed commit placed mutex_lock() inside spin_lock_bh(), which triggers a warning:
BUG: sleeping function called from invalid context at...
Fix this by moving the IPsec deletion operation to bond_ipsec_free_sa, which is not held by spin_lock_bh().
Additionally, there are also some race conditions as bond_ipsec_del_sa_all() and __xfrm_state_delete could running in parallel without any lock. e.g.
bond_ipsec_del_sa_all() __xfrm_state_delete() - .xdo_dev_state_delete - bond_ipsec_del_sa() - .xdo_dev_state_free - .xdo_dev_state_delete() - bond_ipsec_free_sa() bond active_slave changes - .xdo_dev_state_free()
bond_ipsec_add_sa_all() - ipsec->xs->xso.real_dev = real_dev; - xdo_dev_state_add
To fix this, let's add xs->lock during bond_ipsec_del_sa_all(), and delete the IPsec list when the XFRM state is DEAD, which could prevent xdo_dev_state_free() from being triggered again in bond_ipsec_free_sa().
In bond_ipsec_add_sa(), if .xdo_dev_state_add() failed, the xso.real_dev is set without clean. Which will cause trouble if __xfrm_state_delete is called at the same time. Reset the xso.real_dev to NULL if state add failed.
Despite the above fixes, there are still races in bond_ipsec_add_sa() and bond_ipsec_add_sa_all(). If __xfrm_state_delete() is called immediately after we set the xso.real_dev and before .xdo_dev_state_add() is finished, like
ipsec->xs->xso.real_dev = real_dev; __xfrm_state_delete - bond_ipsec_del_sa() - .xdo_dev_state_delete() - bond_ipsec_free_sa() - .xdo_dev_state_free() .xdo_dev_state_add()
But there is no good solution yet. So I just added a FIXME note in here and hope we can fix it in future.
Fixes: 2aeeef906d5a ("bonding: change ipsec_lock from spin lock to mutex") Reported-by: Jakub Kicinski kuba@kernel.org Closes: https://lore.kernel.org/netdev/20241212062734.182a0164@kernel.org Suggested-by: Cosmin Ratiu cratiu@nvidia.com Signed-off-by: Hangbin Liu liuhangbin@gmail.com
drivers/net/bonding/bond_main.c | 69 ++++++++++++++++++++++++--------- 1 file changed, 51 insertions(+), 18 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e45bba240cbc..dd3d0d41d98f 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -506,6 +506,7 @@ static int bond_ipsec_add_sa(struct xfrm_state *xs, list_add(&ipsec->list, &bond->ipsec_list); mutex_unlock(&bond->ipsec_lock); } else {
kfree(ipsec); }xs->xso.real_dev = NULL;
out: @@ -541,7 +542,15 @@ static void bond_ipsec_add_sa_all(struct bonding *bond) if (ipsec->xs->xso.real_dev == real_dev) continue;
/* Skip dead xfrm states, they'll be freed later. */
if (ipsec->xs->km.state == XFRM_STATE_DEAD)
continue;
As we commented earlier, reading this state without x->lock is wrong.
But even we add the lock, like
spin_lock_bh(&ipsec->xs->lock); if (ipsec->xs->km.state == XFRM_STATE_DEAD) { spin_unlock_bh(&ipsec->xs->lock); continue; }
We still may got the race condition. Like the following note said. So I just leave it as the current status. But I can add the spin lock if you insist.
I don't insist at all, I just pointed out that this is buggy and the value doesn't make sense used like that. Adding more bugs to the existing code wouldn't make it better.
I'm a little lost here. Do you mean we should hold spin lock for all xs readings or just the km.state. The current bonding code didn't hold any xs lock for xs readings. And I saw in xfrm code it only hold rcu read lock when check the km.state.
- ipsec->xs->xso.real_dev = real_dev;
/* FIXME: there is a race that before .xdo_dev_state_add()
* is called, the __xfrm_state_delete() is called in parallel,
* which will call .xdo_dev_state_delete() and xdo_dev_state_free()
if (real_dev->xfrmdev_ops->xdo_dev_state_add(ipsec->xs, NULL)) { slave_warn(bond_dev, real_dev, "%s: failed to add SA\n", __func__); ipsec->xs->xso.real_dev = NULL;*/
[snip]
TBH, keeping buggy code with a comment doesn't sound good to me. I'd rather remove this support than tell people "good luck, it might crash". It's better to be safe until a correct design is in place which takes care of these issues.
I agree it's not a good experience to let users using an unstable feature. But this is a race condition, although we don't have a good fix yet.
On the other hand, I think we can't remove a feature people is using, can we? What I can do is try fix the issues as my best.
I do appreciate the hard work you've been doing on this, don't get me wrong, but this is
I also appreciate your reviews. You helped find a lot of issues :)
not really uapi, it's an optimization. The path will become slower as it won't be offloaded, but it will still work and will be stable until a proper fix or new design comes in.
I'm afraid it doesn't work correctly before this patch. The race condition here has a long time. What I do is fix part of the races. And leave bond_ipsec_add_sa() and bond_ipsec_add_sa_all() almost not changed.
The comment in the code is just to notify others there is something wrong. Instead of just hide it there and only we know this.
Do I miss something?
Thanks hangbin
Are you suggesting to knowingly leave a race condition that might lead to a number of problems in place with a comment? IMO that is not ok, but ultimately it's up to the maintainers to decide if they can live with it. :)
By the way, I started this patch because my patch 2/3 is blocked by the selftest results from patch 3/3...
Thanks Hangbin
On Fri, 7 Mar 2025 09:42:49 +0200 Nikolay Aleksandrov wrote:
TBH, keeping buggy code with a comment doesn't sound good to me. I'd rather remove this support than tell people "good luck, it might crash". It's better to be safe until a correct design is in place which takes care of these issues.
That's my feeling too, FWIW. I think we knew about this issue for a while now, the longer we wait the more users we may disrupt with the revert.
On Fri, Mar 07, 2025 at 09:03:32AM -0800, Jakub Kicinski wrote:
On Fri, 7 Mar 2025 09:42:49 +0200 Nikolay Aleksandrov wrote:
TBH, keeping buggy code with a comment doesn't sound good to me. I'd rather remove this support than tell people "good luck, it might crash". It's better to be safe until a correct design is in place which takes care of these issues.
That's my feeling too, FWIW. I think we knew about this issue for a while now, the longer we wait the more users we may disrupt with the revert.
Steffen said we can't sleep in xfrm_timer_handler(), which calls __xfrm_state_delete(). So I can't find a way to handle the race condition between bond_ipsec_add_sa_all() -> xdo_dev_state_add, which may sleep. And __xfrm_state_delete() -> xdo_dev_state_delete, which can't sleep.
Hi Jay, do you have any comments?
Thanks Hangbin
On Fri, 2025-03-07 at 09:03 -0800, Jakub Kicinski wrote:
On Fri, 7 Mar 2025 09:42:49 +0200 Nikolay Aleksandrov wrote:
TBH, keeping buggy code with a comment doesn't sound good to me. I'd rather remove this support than tell people "good luck, it might crash". It's better to be safe until a correct design is in place which takes care of these issues.
That's my feeling too, FWIW. I think we knew about this issue for a while now, the longer we wait the more users we may disrupt with the revert.
These are preexisting races between the bond link failover and the user removing the xfrm states. Unless the user wants to intentionally trigger these bugs, chances are nobody has ever encountered them in the wild in normal operation. In steady state, bond link failover works, and adding/removing states works. It's the combination of the two control plane events that may have a chance to double free or leak states.
I would not pull everything out just yet.
Today, I managed to find a solution for these races (I think), based on a patch series I am preparing against ipsec-next with other changes related to real_dev.
Hangbin, do you mind if I take over fixing the locking issue as part of my series? I plan to send it upstream the following days.
Cosmin.
On Tue, Mar 11, 2025 at 09:08:49PM +0000, Cosmin Ratiu wrote:
On Fri, 2025-03-07 at 09:03 -0800, Jakub Kicinski wrote:
On Fri, 7 Mar 2025 09:42:49 +0200 Nikolay Aleksandrov wrote:
TBH, keeping buggy code with a comment doesn't sound good to me. I'd rather remove this support than tell people "good luck, it might crash". It's better to be safe until a correct design is in place which takes care of these issues.
That's my feeling too, FWIW. I think we knew about this issue for a while now, the longer we wait the more users we may disrupt with the revert.
These are preexisting races between the bond link failover and the user removing the xfrm states. Unless the user wants to intentionally trigger these bugs, chances are nobody has ever encountered them in the wild in normal operation. In steady state, bond link failover works, and adding/removing states works. It's the combination of the two control plane events that may have a chance to double free or leak states.
I would not pull everything out just yet.
Today, I managed to find a solution for these races (I think), based on a patch series I am preparing against ipsec-next with other changes related to real_dev.
Hangbin, do you mind if I take over fixing the locking issue as part of my series? I plan to send it upstream the following days.
No, I don't mind. Please go ahead to fixing the locking issue. And thanks a lot for your reviewing.
Regards Hangbin
On Fri, Mar 07, 2025 at 03:19:01AM +0000, Hangbin Liu wrote:
...
@@ -616,9 +615,22 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) return; mutex_lock(&bond->ipsec_lock);
- list_for_each_entry(ipsec, &bond->ipsec_list, list) {
if (!ipsec->xs->xso.real_dev)
- list_for_each_entry_safe(ipsec, tmp_ipsec, &bond->ipsec_list, list) {
spin_lock_bh(&ipsec->xs->lock);
if (!ipsec->xs->xso.real_dev) {
spin_unlock_bh(&ipsec->xs->lock); continue;
}
if (ipsec->xs->km.state == XFRM_STATE_DEAD) {
list_del(&ipsec->list);
kfree(ipsec);
Hi Hangbin,
Apologies if this was covered elsewhere, but ipsec is kfree'd here...
/* Need to free device here, or the xs->xso.real_dev
* may changed in bond_ipsec_add_sa_all and free
* on old device will never be called.
*/
goto next;
}
if (!real_dev->xfrmdev_ops || !real_dev->xfrmdev_ops->xdo_dev_state_delete || @@ -626,11 +638,20 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) slave_warn(bond_dev, real_dev, "%s: no slave xdo_dev_state_delete\n", __func__);
} else {
real_dev->xfrmdev_ops->xdo_dev_state_delete(ipsec->xs);
if (real_dev->xfrmdev_ops->xdo_dev_state_free)
real_dev->xfrmdev_ops->xdo_dev_state_free(ipsec->xs);
spin_unlock_bh(&ipsec->xs->lock);
}continue;
real_dev->xfrmdev_ops->xdo_dev_state_delete(ipsec->xs);
+next:
/* set real_dev to NULL in case __xfrm_state_delete() is called in parallel */
ipsec->xs->xso.real_dev = NULL;
... and the dereferenced here.
Flagged by Smatch.
/* Unlock before freeing device state, it could sleep. */
spin_unlock_bh(&ipsec->xs->lock);
if (real_dev->xfrmdev_ops->xdo_dev_state_free)
} mutex_unlock(&bond->ipsec_lock);real_dev->xfrmdev_ops->xdo_dev_state_free(ipsec->xs);
}
...
On Sat, Mar 08, 2025 at 08:54:51AM +0000, Simon Horman wrote:
On Fri, Mar 07, 2025 at 03:19:01AM +0000, Hangbin Liu wrote:
...
@@ -616,9 +615,22 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) return; mutex_lock(&bond->ipsec_lock);
- list_for_each_entry(ipsec, &bond->ipsec_list, list) {
if (!ipsec->xs->xso.real_dev)
- list_for_each_entry_safe(ipsec, tmp_ipsec, &bond->ipsec_list, list) {
spin_lock_bh(&ipsec->xs->lock);
if (!ipsec->xs->xso.real_dev) {
spin_unlock_bh(&ipsec->xs->lock); continue;
}
if (ipsec->xs->km.state == XFRM_STATE_DEAD) {
list_del(&ipsec->list);
kfree(ipsec);
Hi Hangbin,
Apologies if this was covered elsewhere, but ipsec is kfree'd here...
Oh.. I need to get the xs with xs = ipsec->xs, then hold the xs lock.
Thanks Hangbin
/* Need to free device here, or the xs->xso.real_dev
* may changed in bond_ipsec_add_sa_all and free
* on old device will never be called.
*/
goto next;
}
if (!real_dev->xfrmdev_ops || !real_dev->xfrmdev_ops->xdo_dev_state_delete || @@ -626,11 +638,20 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) slave_warn(bond_dev, real_dev, "%s: no slave xdo_dev_state_delete\n", __func__);
} else {
real_dev->xfrmdev_ops->xdo_dev_state_delete(ipsec->xs);
if (real_dev->xfrmdev_ops->xdo_dev_state_free)
real_dev->xfrmdev_ops->xdo_dev_state_free(ipsec->xs);
spin_unlock_bh(&ipsec->xs->lock);
}continue;
real_dev->xfrmdev_ops->xdo_dev_state_delete(ipsec->xs);
+next:
/* set real_dev to NULL in case __xfrm_state_delete() is called in parallel */
ipsec->xs->xso.real_dev = NULL;
... and the dereferenced here.
Flagged by Smatch.
/* Unlock before freeing device state, it could sleep. */
spin_unlock_bh(&ipsec->xs->lock);
if (real_dev->xfrmdev_ops->xdo_dev_state_free)
} mutex_unlock(&bond->ipsec_lock);real_dev->xfrmdev_ops->xdo_dev_state_free(ipsec->xs);
}
...
The active-backup bonding mode supports XFRM ESP offload. However, when a bond is added using command like `ip link add bond0 type bond mode 1 miimon 100`, the `ethtool -k` command shows that the XFRM ESP offload is disabled. This occurs because, in bond_newlink(), we change bond link first and register bond device later. So the XFRM feature update in bond_option_mode_set() is not called as the bond device is not yet registered, leading to the offload feature not being set successfully.
To resolve this issue, we can modify the code order in bond_newlink() to ensure that the bond device is registered first before changing the bond link parameters. This change will allow the XFRM ESP offload feature to be correctly enabled.
Fixes: 007ab5345545 ("bonding: fix feature flag setting at init time") Signed-off-by: Hangbin Liu liuhangbin@gmail.com --- drivers/net/bonding/bond_main.c | 2 +- drivers/net/bonding/bond_netlink.c | 16 +++++++++------- include/net/bonding.h | 1 + 3 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index dd3d0d41d98f..a060960927e9 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -4422,7 +4422,7 @@ void bond_work_init_all(struct bonding *bond) INIT_DELAYED_WORK(&bond->slave_arr_work, bond_slave_arr_handler); }
-static void bond_work_cancel_all(struct bonding *bond) +void bond_work_cancel_all(struct bonding *bond) { cancel_delayed_work_sync(&bond->mii_work); cancel_delayed_work_sync(&bond->arp_work); diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c index 2a6a424806aa..ed16af6db557 100644 --- a/drivers/net/bonding/bond_netlink.c +++ b/drivers/net/bonding/bond_netlink.c @@ -568,18 +568,20 @@ static int bond_newlink(struct net *src_net, struct net_device *bond_dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { + struct bonding *bond = netdev_priv(bond_dev); int err;
- err = bond_changelink(bond_dev, tb, data, extack); - if (err < 0) + err = register_netdevice(bond_dev); + if (err) return err;
- err = register_netdevice(bond_dev); - if (!err) { - struct bonding *bond = netdev_priv(bond_dev); + netif_carrier_off(bond_dev); + bond_work_init_all(bond);
- netif_carrier_off(bond_dev); - bond_work_init_all(bond); + err = bond_changelink(bond_dev, tb, data, extack); + if (err) { + bond_work_cancel_all(bond); + unregister_netdevice(bond_dev); }
return err; diff --git a/include/net/bonding.h b/include/net/bonding.h index 8bb5f016969f..e5e005cd2e17 100644 --- a/include/net/bonding.h +++ b/include/net/bonding.h @@ -707,6 +707,7 @@ struct bond_vlan_tag *bond_verify_device_path(struct net_device *start_dev, int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave); void bond_slave_arr_work_rearm(struct bonding *bond, unsigned long delay); void bond_work_init_all(struct bonding *bond); +void bond_work_cancel_all(struct bonding *bond);
#ifdef CONFIG_PROC_FS void bond_create_proc_entry(struct bonding *bond);
This introduces a test for IPSec offload over bonding, utilizing netdevsim for the testing process, as veth interfaces do not support IPSec offload. The test will ensure that the IPSec offload functionality remains operational even after a failover event occurs in the bonding configuration.
Here is the test result:
TEST: bond_ipsec_offload (active_slave eth0) [ OK ] TEST: bond_ipsec_offload (active_slave eth1) [ OK ]
Reviewed-by: Petr Machata petrm@nvidia.com Signed-off-by: Hangbin Liu liuhangbin@gmail.com --- .../selftests/drivers/net/bonding/Makefile | 3 +- .../drivers/net/bonding/bond_ipsec_offload.sh | 154 ++++++++++++++++++ .../selftests/drivers/net/bonding/config | 4 + 3 files changed, 160 insertions(+), 1 deletion(-) create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_ipsec_offload.sh
diff --git a/tools/testing/selftests/drivers/net/bonding/Makefile b/tools/testing/selftests/drivers/net/bonding/Makefile index 2b10854e4b1e..d5a7de16d33a 100644 --- a/tools/testing/selftests/drivers/net/bonding/Makefile +++ b/tools/testing/selftests/drivers/net/bonding/Makefile @@ -10,7 +10,8 @@ TEST_PROGS := \ mode-2-recovery-updelay.sh \ bond_options.sh \ bond-eth-type-change.sh \ - bond_macvlan_ipvlan.sh + bond_macvlan_ipvlan.sh \ + bond_ipsec_offload.sh
TEST_FILES := \ lag_lib.sh \ diff --git a/tools/testing/selftests/drivers/net/bonding/bond_ipsec_offload.sh b/tools/testing/selftests/drivers/net/bonding/bond_ipsec_offload.sh new file mode 100755 index 000000000000..4b19949a4c33 --- /dev/null +++ b/tools/testing/selftests/drivers/net/bonding/bond_ipsec_offload.sh @@ -0,0 +1,154 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +# IPsec over bonding offload test: +# +# +----------------+ +# | bond0 | +# | | | +# | eth0 eth1 | +# +---+-------+----+ +# +# We use netdevsim instead of physical interfaces +#------------------------------------------------------------------- +# Example commands +# ip x s add proto esp src 192.0.2.1 dst 192.0.2.2 \ +# spi 0x07 mode transport reqid 0x07 replay-window 32 \ +# aead 'rfc4106(gcm(aes))' 1234567890123456dcba 128 \ +# sel src 192.0.2.1/24 dst 192.0.2.2/24 +# offload dev bond0 dir out +# ip x p add dir out src 192.0.2.1/24 dst 192.0.2.2/24 \ +# tmpl proto esp src 192.0.2.1 dst 192.0.2.2 \ +# spi 0x07 mode transport reqid 0x07 +# +#------------------------------------------------------------------- + +lib_dir=$(dirname "$0") +source "$lib_dir"/../../../net/lib.sh +algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 128" +srcip=192.0.2.1 +dstip=192.0.2.2 +ipsec0=/sys/kernel/debug/netdevsim/netdevsim0/ports/0/ipsec +ipsec1=/sys/kernel/debug/netdevsim/netdevsim0/ports/1/ipsec +active_slave="" + +active_slave_changed() +{ + local old_active_slave=$1 + local new_active_slave=$(ip -n ${ns} -d -j link show bond0 | \ + jq -r ".[].linkinfo.info_data.active_slave") + [ "$new_active_slave" != "$old_active_slave" -a "$new_active_slave" != "null" ] +} + +test_offload() +{ + # use ping to exercise the Tx path + ip netns exec $ns ping -I bond0 -c 3 -W 1 -i 0 $dstip >/dev/null + + active_slave=$(ip -n ${ns} -d -j link show bond0 | \ + jq -r ".[].linkinfo.info_data.active_slave") + + if [ $active_slave = $nic0 ]; then + sysfs=$ipsec0 + elif [ $active_slave = $nic1 ]; then + sysfs=$ipsec1 + else + check_err 1 "bond_ipsec_offload invalid active_slave $active_slave" + fi + + # The tx/rx order in sysfs may changed after failover + grep -q "SA count=2 tx=3" $sysfs && grep -q "tx ipaddr=$dstip" $sysfs + check_err $? "incorrect tx count with link ${active_slave}" + + log_test bond_ipsec_offload "active_slave ${active_slave}" +} + +setup_env() +{ + if ! mount | grep -q debugfs; then + mount -t debugfs none /sys/kernel/debug/ &> /dev/null + defer umount /sys/kernel/debug/ + + fi + + # setup netdevsim since dummy/veth dev doesn't have offload support + if [ ! -w /sys/bus/netdevsim/new_device ] ; then + modprobe -q netdevsim + if [ $? -ne 0 ]; then + echo "SKIP: can't load netdevsim for ipsec offload" + exit $ksft_skip + fi + defer modprobe -r netdevsim + fi + + setup_ns ns + defer cleanup_ns $ns +} + +setup_bond() +{ + ip -n $ns link add bond0 type bond mode active-backup miimon 100 + ip -n $ns addr add $srcip/24 dev bond0 + ip -n $ns link set bond0 up + + ifaces=$(ip netns exec $ns bash -c ' + sysfsnet=/sys/bus/netdevsim/devices/netdevsim0/net/ + echo "0 2" > /sys/bus/netdevsim/new_device + while [ ! -d $sysfsnet ] ; do :; done + udevadm settle + ls $sysfsnet + ') + nic0=$(echo $ifaces | cut -f1 -d ' ') + nic1=$(echo $ifaces | cut -f2 -d ' ') + ip -n $ns link set $nic0 master bond0 + ip -n $ns link set $nic1 master bond0 + + # we didn't create a peer, make sure we can Tx by adding a permanent + # neighbour this need to be added after enslave + ip -n $ns neigh add $dstip dev bond0 lladdr 00:11:22:33:44:55 + + # create offloaded SAs, both in and out + ip -n $ns x p add dir out src $srcip/24 dst $dstip/24 \ + tmpl proto esp src $srcip dst $dstip spi 9 \ + mode transport reqid 42 + + ip -n $ns x p add dir in src $dstip/24 dst $srcip/24 \ + tmpl proto esp src $dstip dst $srcip spi 9 \ + mode transport reqid 42 + + ip -n $ns x s add proto esp src $srcip dst $dstip spi 9 \ + mode transport reqid 42 $algo sel src $srcip/24 dst $dstip/24 \ + offload dev bond0 dir out + + ip -n $ns x s add proto esp src $dstip dst $srcip spi 9 \ + mode transport reqid 42 $algo sel src $dstip/24 dst $srcip/24 \ + offload dev bond0 dir in + + # does offload show up in ip output + lines=`ip -n $ns x s list | grep -c "crypto offload parameters: dev bond0 dir"` + if [ $lines -ne 2 ] ; then + check_err 1 "bond_ipsec_offload SA offload missing from list output" + fi +} + +trap defer_scopes_cleanup EXIT +setup_env +setup_bond + +# start Offload testing +test_offload + +# do failover and re-test +ip -n $ns link set $active_slave down +slowwait 5 active_slave_changed $active_slave +test_offload + +# make sure offload get removed from driver +ip -n $ns x s flush +ip -n $ns x p flush +line0=$(grep -c "SA count=0" $ipsec0) +line1=$(grep -c "SA count=0" $ipsec1) +[ $line0 -ne 1 -o $line1 -ne 1 ] +check_fail $? "bond_ipsec_offload SA not removed from driver" + +exit $EXIT_STATUS diff --git a/tools/testing/selftests/drivers/net/bonding/config b/tools/testing/selftests/drivers/net/bonding/config index dad4e5fda4db..054fb772846f 100644 --- a/tools/testing/selftests/drivers/net/bonding/config +++ b/tools/testing/selftests/drivers/net/bonding/config @@ -9,3 +9,7 @@ CONFIG_NET_CLS_FLOWER=y CONFIG_NET_SCH_INGRESS=y CONFIG_NLMON=y CONFIG_VETH=y +CONFIG_INET_ESP=y +CONFIG_INET_ESP_OFFLOAD=y +CONFIG_XFRM_USER=m +CONFIG_NETDEVSIM=m
linux-kselftest-mirror@lists.linaro.org