Hi Cosmin, On Fri, Feb 28, 2025 at 10:31:58AM +0000, Cosmin Ratiu wrote:
On Fri, 2025-02-28 at 02:20 +0000, Hangbin Liu wrote:
On Thu, Feb 27, 2025 at 03:31:01PM +0200, Nikolay Aleksandrov wrote:
One more thing - note I'm not an xfrm expert by far but it seems to me here you have to also callĀ xdo_dev_state_free() with the old active slave dev otherwise that will never get called with the original real_dev after the switch to a new active slave (or more accurately it might if the GC runs between the switching but it is a race), care must be taken wrt sequence of events because the XFRM
Can we just call xs->xso.real_dev->xfrmdev_ops-
xdo_dev_state_free(xs)
no matter xs->xso.real_dev == real_dev or not? I'm afraid calling xdo_dev_state_free() every where may make us lot more easily.
You'd have to check all drivers that implement the callback to answer that and even then I'd stick to the canonical way of how it's done in xfrm and make the bond just passthrough. Any other games become dangerous and new code will have to be carefully reviewed every time, calling another device's free_sa when it wasn't added before doesn't sound good.
GC may be running in parallel which probably means that in bond_ipsec_free_sa() you'll have to take the mutex before calling xdo_dev_state_free() and check if the entry is still linked in the bond's ipsec list before calling the free_sa callback, if it isn't then del_sa_all got to it before the GC and there's nothing to do if it also called the dev's free_sa callback. The check for real_dev doesn't seem enough to protect against this race.
I agree that we need to take the mutex before calling xdo_dev_state_free() in bond_ipsec_free_sa(). Do you think if this is enough? I'm a bit lot here.
Thanks Hangbin
Well, the race is between the xfrm GC and del_sa_all, in bond's free_sa if you walk the list under the mutex before calling real_dev's free callback and don't find the current element that's being freed in free_sa then it was cleaned up by del_sa_all, otherwise del_sa_all is waiting to walk that list and clean the entries. I think it should be fine as long as free_sa was called once with the proper device.
OK, so the free will be called either in del_sa_all() or free_sa(). Something like this?
[...]
Unfortunately, after applying these changes and reasoning about them for a bit, I don't think this will work. There are still races left. For example:
- An xs is marked DEAD (in __xfrm_state_delete, with x->lock held) and
before .xdo_dev_state_delete() is called on it, bond_ipsec_del_sa_all is called in parallel, doesn't call delete on xs (because it's dead), then calls free (incorrect without delete first), then removes the list entry. Later, xdo_dev_state_delete( == bond_ipsec_del_sa) is called, and calls delete (incorrect, out of order with free). Finally, bond_ipsec_free_sa is called, which fortunately doesn't do anything silly in the new proposed form because xs is no longer in the list.
- A more sinister form of the above race can happen when
bond_ipsec_del_sa_all() calls delete on real_dev, then in parallel and immediately after __xfrm_state_delete marks xs as DEAD and calls bond_ipsec_del_sa() which happily calls delete on real_dev again.
In order to fix these races (and others like it), I think bond_ipsec_del_sa_all and bond_ipsec_add_sa_all *need* to acquire x-
lock for each xs being processed. This would prevent xfrm from
concurrently initiating add/delete operations on the managed states.
Just to make sure I added the lock in correct place, would you please help confirm.
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e85878b12376..c59ad3a5cf43 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -537,19 +537,25 @@ static void bond_ipsec_add_sa_all(struct bonding *bond) }
list_for_each_entry(ipsec, &bond->ipsec_list, list) { + spin_lock_bh(&ipsec->xs->lock); /* Skip dead xfrm states, they'll be freed later. */ - if (ipsec->xs->km.state == XFRM_STATE_DEAD) + if (ipsec->xs->km.state == XFRM_STATE_DEAD) { + spin_unlock_bh(&ipsec->xs->lock); continue; + }
/* If new state is added before ipsec_lock acquired */ - if (ipsec->xs->xso.real_dev == real_dev) + if (ipsec->xs->xso.real_dev == real_dev) { + spin_unlock_bh(&ipsec->xs->lock); continue; + }
ipsec->xs->xso.real_dev = real_dev; if (real_dev->xfrmdev_ops->xdo_dev_state_add(ipsec->xs, NULL)) { slave_warn(bond_dev, real_dev, "%s: failed to add SA\n", __func__); ipsec->xs->xso.real_dev = NULL; } + spin_unlock_bh(&ipsec->xs->lock); } out: mutex_unlock(&bond->ipsec_lock); @@ -614,6 +620,7 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) if (!ipsec->xs->xso.real_dev) continue;
+ spin_lock_bh(&ipsec->xs->lock); if (ipsec->xs->km.state == XFRM_STATE_DEAD) { /* already dead no need to delete again */ if (ipsec->xs->xso.real_dev == real_dev && @@ -621,6 +628,7 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) real_dev->xfrmdev_ops->xdo_dev_state_free(ipsec->xs); list_del(&ipsec->list); kfree(ipsec); + spin_unlock_bh(&ipsec->xs->lock); continue; }
@@ -635,6 +643,7 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) if (real_dev->xfrmdev_ops->xdo_dev_state_free) real_dev->xfrmdev_ops->xdo_dev_state_free(ipsec->xs); } + spin_unlock_bh(&ipsec->xs->lock); } mutex_unlock(&bond->ipsec_lock); }
Thanks Hangbin