A few cleanups and a bugfix that are either suitable after the swap table phase I or found during code review.
Patch 1 is a bugfix and needs to be included in the stable branch, the rest have no behavior change.
--- Kairui Song (4): mm, swap: do not perform synchronous discard during allocation mm, swap: rename helper for setup bad slots mm, swap: cleanup swap entry allocation parameter mm/migrate, swap: drop usage of folio_index
include/linux/swap.h | 4 ++-- mm/migrate.c | 4 ++-- mm/shmem.c | 2 +- mm/swap.h | 21 ----------------- mm/swapfile.c | 64 ++++++++++++++++++++++++++++++++++++---------------- mm/vmscan.c | 4 ++-- 6 files changed, 52 insertions(+), 47 deletions(-) --- base-commit: 53e573001f2b5168f9b65d2b79e9563a3b479c17 change-id: 20251007-swap-clean-after-swap-table-p1-b9a7635ee3fa
Best regards,
From: Kairui Song kasong@tencent.com
Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path"), swap allocation is protected by a local lock, which means we can't do any sleeping calls during allocation.
However, the discard routine is not taken well care of. When the swap allocator failed to find any usable cluster, it would look at the pending discard cluster and try to issue some blocking discards. It may not necessarily sleep, but the cond_resched at the bio layer indicates this is wrong when combined with a local lock. And the bio GFP flag used for discard bio is also wrong (not atomic).
It's arguable whether this synchronous discard is helpful at all. In most cases, the async discard is good enough. And the swap allocator is doing very differently at organizing the clusters since the recent change, so it is very rare to see discard clusters piling up.
So far, no issues have been observed or reported with typical SSD setups under months of high pressure. This issue was found during my code review. But by hacking the kernel a bit: adding a mdelay(100) in the async discard path, this issue will be observable with WARNING triggered by the wrong GFP and cond_resched in the bio layer.
So let's fix this issue in a safe way: remove the synchronous discard in the swap allocation path. And when order 0 is failing with all cluster list drained on all swap devices, try to do a discard following the swap device priority list. If any discards released some cluster, try the allocation again. This way, we can still avoid OOM due to swap failure if the hardware is very slow and memory pressure is extremely high.
Cc: stable@vger.kernel.org Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path") Signed-off-by: Kairui Song kasong@tencent.com --- mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++------- 1 file changed, 33 insertions(+), 7 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index cb2392ed8e0e..0d1924f6f495 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o goto done; }
- /* - * We don't have free cluster but have some clusters in discarding, - * do discard now and reclaim them. - */ - if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si)) - goto new_cluster; - if (order) goto done;
@@ -1394,6 +1387,33 @@ static bool swap_alloc_slow(swp_entry_t *entry, return false; }
+/* + * Discard pending clusters in a synchronized way when under high pressure. + * Return: true if any cluster is discarded. + */ +static bool swap_sync_discard(void) +{ + bool ret = false; + int nid = numa_node_id(); + struct swap_info_struct *si, *next; + + spin_lock(&swap_avail_lock); + plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) { + spin_unlock(&swap_avail_lock); + if (get_swap_device_info(si)) { + if (si->flags & SWP_PAGE_DISCARD) + ret = swap_do_scheduled_discard(si); + put_swap_device(si); + } + if (ret) + break; + spin_lock(&swap_avail_lock); + } + spin_unlock(&swap_avail_lock); + + return ret; +} + /** * folio_alloc_swap - allocate swap space for a folio * @folio: folio we want to move to swap @@ -1432,11 +1452,17 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp) } }
+again: local_lock(&percpu_swap_cluster.lock); if (!swap_alloc_fast(&entry, order)) swap_alloc_slow(&entry, order); local_unlock(&percpu_swap_cluster.lock);
+ if (unlikely(!order && !entry.val)) { + if (swap_sync_discard()) + goto again; + } + /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */ if (mem_cgroup_try_charge_swap(folio, entry)) goto out_free;
On Mon, Oct 6, 2025 at 1:03 PM Kairui Song ryncsn@gmail.com wrote:
From: Kairui Song kasong@tencent.com
Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path"), swap allocation is protected by a local lock, which means we can't do any sleeping calls during allocation.
However, the discard routine is not taken well care of. When the swap allocator failed to find any usable cluster, it would look at the pending discard cluster and try to issue some blocking discards. It may not necessarily sleep, but the cond_resched at the bio layer indicates this is wrong when combined with a local lock. And the bio GFP flag used for discard bio is also wrong (not atomic).
It's arguable whether this synchronous discard is helpful at all. In most cases, the async discard is good enough. And the swap allocator is doing very differently at organizing the clusters since the recent change, so it is very rare to see discard clusters piling up.
So far, no issues have been observed or reported with typical SSD setups under months of high pressure. This issue was found during my code review. But by hacking the kernel a bit: adding a mdelay(100) in the async discard path, this issue will be observable with WARNING triggered by the wrong GFP and cond_resched in the bio layer.
So let's fix this issue in a safe way: remove the synchronous discard in the swap allocation path. And when order 0 is failing with all cluster list drained on all swap devices, try to do a discard following the swap device priority list. If any discards released some cluster, try the allocation again. This way, we can still avoid OOM due to swap failure if the hardware is very slow and memory pressure is extremely high.
Cc: stable@vger.kernel.org Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path") Signed-off-by: Kairui Song kasong@tencent.com
Seems reasonable to me.
Acked-by: Nhat Pham nphamcs@gmail.com
Hi Kairui,
First of all, your title is a bit misleading: "do not perform synchronous discard during allocation"
You still do the synchronous discard, just limited to order 0 failing.
Also your commit did not describe the behavior change of this patch. The behavior change is that, it now prefers to allocate from the fragment list before waiting for the discard. Which I feel is not justified.
After reading your patch, I feel that you still do the synchronous discard, just now you do it with less lock held. I suggest you just fix the lock held issue without changing the discard ordering behavior.
On Mon, Oct 6, 2025 at 1:03 PM Kairui Song ryncsn@gmail.com wrote:
From: Kairui Song kasong@tencent.com
Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path"), swap allocation is protected by a local lock, which means we can't do any sleeping calls during allocation.
However, the discard routine is not taken well care of. When the swap allocator failed to find any usable cluster, it would look at the pending discard cluster and try to issue some blocking discards. It may not necessarily sleep, but the cond_resched at the bio layer indicates this is wrong when combined with a local lock. And the bio GFP flag used for discard bio is also wrong (not atomic).
If lock is the issue, let's fix the lock issue.
It's arguable whether this synchronous discard is helpful at all. In most cases, the async discard is good enough. And the swap allocator is doing very differently at organizing the clusters since the recent change, so it is very rare to see discard clusters piling up.
Very rare does not mean this never happens. If you have a cluster on the discarding queue, I think it is better to wait for the discard to complete before using the fragmented list, to reduce the fragmentation. So it seems the real issue is holding a lock while doing the block discard?
So far, no issues have been observed or reported with typical SSD setups under months of high pressure. This issue was found during my code review. But by hacking the kernel a bit: adding a mdelay(100) in the async discard path, this issue will be observable with WARNING triggered by the wrong GFP and cond_resched in the bio layer.
I think that makes an assumption on how slow the SSD discard is. Some SSD can be really slow. We want our kernel to work for those slow discard SSD cases as well.
So let's fix this issue in a safe way: remove the synchronous discard in the swap allocation path. And when order 0 is failing with all cluster list drained on all swap devices, try to do a discard following the swap
I don't feel that changing the discard behavior is justified here, the real fix is discarding with less lock held. Am I missing something? If I understand correctly, we should be able to keep the current discard ordering behavior, discard before the fragment list. But with less lock held as your current patch does.
I suggest the allocation here detects there is a discard pending and running out of free blocks. Return there and indicate the need to discard. The caller performs the discard without holding the lock, similar to what you do with the order == 0 case.
device priority list. If any discards released some cluster, try the allocation again. This way, we can still avoid OOM due to swap failure if the hardware is very slow and memory pressure is extremely high.
Cc: stable@vger.kernel.org Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path") Signed-off-by: Kairui Song kasong@tencent.com
mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++------- 1 file changed, 33 insertions(+), 7 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index cb2392ed8e0e..0d1924f6f495 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o goto done; }
/** We don't have free cluster but have some clusters in discarding,* do discard now and reclaim them.*/if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))goto new_cluster;
Assume you follow my suggestion. Change this to some function to detect if there is a pending discard on this device. Return to the caller indicating that you need a discard for this device that has a pending discard. Add an output argument to indicate the discard device "discard" if needed.
if (order) goto done;@@ -1394,6 +1387,33 @@ static bool swap_alloc_slow(swp_entry_t *entry, return false; }
+/*
- Discard pending clusters in a synchronized way when under high pressure.
- Return: true if any cluster is discarded.
- */
+static bool swap_sync_discard(void) +{
This function discards all swap devices. I am wondering if we should just discard the current working device instead. Another device supposedly discarded is already on going with the work queue. We don't have to wait for that.
To unblock the current swap allocation. We only need to wait for the discard on the current swap device to indicate it needs to wait for discard. Assume you take my above suggestion.
bool ret = false;int nid = numa_node_id();struct swap_info_struct *si, *next;spin_lock(&swap_avail_lock);plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {spin_unlock(&swap_avail_lock);if (get_swap_device_info(si)) {if (si->flags & SWP_PAGE_DISCARD)ret = swap_do_scheduled_discard(si);put_swap_device(si);}if (ret)break;spin_lock(&swap_avail_lock);}spin_unlock(&swap_avail_lock);return ret;+}
/**
- folio_alloc_swap - allocate swap space for a folio
- @folio: folio we want to move to swap
@@ -1432,11 +1452,17 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp) } }
+again: local_lock(&percpu_swap_cluster.lock); if (!swap_alloc_fast(&entry, order)) swap_alloc_slow(&entry, order);
Here we can have a "discard" output function argument to indicate which swap device needs to be discarded.
local_unlock(&percpu_swap_cluster.lock);
if (unlikely(!order && !entry.val)) {
If you take the above suggestion, here will be just check if the "discard" device is not NULL, perform discard on that device and done.
if (swap_sync_discard())goto again;}/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */ if (mem_cgroup_try_charge_swap(folio, entry)) goto out_free;
Chris
On Thu, Oct 9, 2025 at 5:10 AM Chris Li chrisl@kernel.org wrote:
Hi Kairui,
First of all, your title is a bit misleading: "do not perform synchronous discard during allocation"
You still do the synchronous discard, just limited to order 0 failing.
Also your commit did not describe the behavior change of this patch. The behavior change is that, it now prefers to allocate from the fragment list before waiting for the discard. Which I feel is not justified.
After reading your patch, I feel that you still do the synchronous discard, just now you do it with less lock held. I suggest you just fix the lock held issue without changing the discard ordering behavior.
On Mon, Oct 6, 2025 at 1:03 PM Kairui Song ryncsn@gmail.com wrote:
From: Kairui Song kasong@tencent.com
Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path"), swap allocation is protected by a local lock, which means we can't do any sleeping calls during allocation.
However, the discard routine is not taken well care of. When the swap allocator failed to find any usable cluster, it would look at the pending discard cluster and try to issue some blocking discards. It may not necessarily sleep, but the cond_resched at the bio layer indicates this is wrong when combined with a local lock. And the bio GFP flag used for discard bio is also wrong (not atomic).
If lock is the issue, let's fix the lock issue.
It's arguable whether this synchronous discard is helpful at all. In most cases, the async discard is good enough. And the swap allocator is doing very differently at organizing the clusters since the recent change, so it is very rare to see discard clusters piling up.
Very rare does not mean this never happens. If you have a cluster on the discarding queue, I think it is better to wait for the discard to complete before using the fragmented list, to reduce the fragmentation. So it seems the real issue is holding a lock while doing the block discard?
So far, no issues have been observed or reported with typical SSD setups under months of high pressure. This issue was found during my code review. But by hacking the kernel a bit: adding a mdelay(100) in the async discard path, this issue will be observable with WARNING triggered by the wrong GFP and cond_resched in the bio layer.
I think that makes an assumption on how slow the SSD discard is. Some SSD can be really slow. We want our kernel to work for those slow discard SSD cases as well.
So let's fix this issue in a safe way: remove the synchronous discard in the swap allocation path. And when order 0 is failing with all cluster list drained on all swap devices, try to do a discard following the swap
I don't feel that changing the discard behavior is justified here, the real fix is discarding with less lock held. Am I missing something? If I understand correctly, we should be able to keep the current discard ordering behavior, discard before the fragment list. But with less lock held as your current patch does.
I suggest the allocation here detects there is a discard pending and running out of free blocks. Return there and indicate the need to discard. The caller performs the discard without holding the lock, similar to what you do with the order == 0 case.
Thanks for the suggestion. Right, that sounds even better. My initial though was that maybe we can just remove this discard completely since it rarely helps, and if the SSD is really that slow, OOM under heavy pressure might even be an acceptable behaviour. But to make it safer, I made it do discard only when order 0 is failing so the code is simpler.
Let me sent a V2 to handle the discard carefully to reduce potential impact.
device priority list. If any discards released some cluster, try the allocation again. This way, we can still avoid OOM due to swap failure if the hardware is very slow and memory pressure is extremely high.
Cc: stable@vger.kernel.org Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path") Signed-off-by: Kairui Song kasong@tencent.com
mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++------- 1 file changed, 33 insertions(+), 7 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index cb2392ed8e0e..0d1924f6f495 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o goto done; }
/** We don't have free cluster but have some clusters in discarding,* do discard now and reclaim them.*/if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))goto new_cluster;Assume you follow my suggestion. Change this to some function to detect if there is a pending discard on this device. Return to the caller indicating that you need a discard for this device that has a pending discard.
Checking `!list_empty(si->discard_clusters)` should be good enough.
On Thu, Oct 9, 2025 at 8:33 AM Kairui Song ryncsn@gmail.com wrote:
On Thu, Oct 9, 2025 at 5:10 AM Chris Li chrisl@kernel.org wrote:
I suggest the allocation here detects there is a discard pending and running out of free blocks. Return there and indicate the need to discard. The caller performs the discard without holding the lock, similar to what you do with the order == 0 case.
Thanks for the suggestion. Right, that sounds even better. My initial though was that maybe we can just remove this discard completely since it rarely helps, and if the SSD is really that slow, OOM under heavy
Your argument is that cases happen very rarely. I agree with you on that. The follow up question is that, if that rare case does happen, are we doing the best we can in that situation? The V1 patch is not doing the best as we can, it is pretty much I don't care about the discard much, just ignore it unless order 0 failing forces our hand. As far as I can tell, properly handling that having discard pending condition is not much more complicated than your V1 patch, it might be even simpler because you don't have that order 0 failing logic any more.
pressure might even be an acceptable behaviour. But to make it safer, I made it do discard only when order 0 is failing so the code is simpler.
Let me sent a V2 to handle the discard carefully to reduce potential impact.
Great. Looking forward to it.
BTW, In the caller retry loop, the caller can retry the very swap device that has discard just perform on it, it does not need to retry from the very first swap device. In that regard, it is also a better behavior than V1 or even existing discard behavior, which waits for all devices to discard.
Chris
On Thu, Oct 9, 2025 at 5:10 AM Chris Li chrisl@kernel.org wrote:
Hi Kairui,
First of all, your title is a bit misleading: "do not perform synchronous discard during allocation"
You still do the synchronous discard, just limited to order 0 failing.
Also your commit did not describe the behavior change of this patch. The behavior change is that, it now prefers to allocate from the fragment list before waiting for the discard. Which I feel is not justified.
After reading your patch, I feel that you still do the synchronous discard, just now you do it with less lock held. I suggest you just fix the lock held issue without changing the discard ordering behavior.
On Mon, Oct 6, 2025 at 1:03 PM Kairui Song ryncsn@gmail.com wrote:
From: Kairui Song kasong@tencent.com
Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path"), swap allocation is protected by a local lock, which means we can't do any sleeping calls during allocation.
However, the discard routine is not taken well care of. When the swap allocator failed to find any usable cluster, it would look at the pending discard cluster and try to issue some blocking discards. It may not necessarily sleep, but the cond_resched at the bio layer indicates this is wrong when combined with a local lock. And the bio GFP flag used for discard bio is also wrong (not atomic).
If lock is the issue, let's fix the lock issue.
It's arguable whether this synchronous discard is helpful at all. In most cases, the async discard is good enough. And the swap allocator is doing very differently at organizing the clusters since the recent change, so it is very rare to see discard clusters piling up.
Very rare does not mean this never happens. If you have a cluster on the discarding queue, I think it is better to wait for the discard to complete before using the fragmented list, to reduce the fragmentation. So it seems the real issue is holding a lock while doing the block discard?
So far, no issues have been observed or reported with typical SSD setups under months of high pressure. This issue was found during my code review. But by hacking the kernel a bit: adding a mdelay(100) in the async discard path, this issue will be observable with WARNING triggered by the wrong GFP and cond_resched in the bio layer.
I think that makes an assumption on how slow the SSD discard is. Some SSD can be really slow. We want our kernel to work for those slow discard SSD cases as well.
So let's fix this issue in a safe way: remove the synchronous discard in the swap allocation path. And when order 0 is failing with all cluster list drained on all swap devices, try to do a discard following the swap
I don't feel that changing the discard behavior is justified here, the real fix is discarding with less lock held. Am I missing something? If I understand correctly, we should be able to keep the current discard ordering behavior, discard before the fragment list. But with less lock held as your current patch does.
I suggest the allocation here detects there is a discard pending and running out of free blocks. Return there and indicate the need to discard. The caller performs the discard without holding the lock, similar to what you do with the order == 0 case.
device priority list. If any discards released some cluster, try the allocation again. This way, we can still avoid OOM due to swap failure if the hardware is very slow and memory pressure is extremely high.
Cc: stable@vger.kernel.org Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path") Signed-off-by: Kairui Song kasong@tencent.com
mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++------- 1 file changed, 33 insertions(+), 7 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index cb2392ed8e0e..0d1924f6f495 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o goto done; }
/** We don't have free cluster but have some clusters in discarding,* do discard now and reclaim them.*/if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))goto new_cluster;Assume you follow my suggestion. Change this to some function to detect if there is a pending discard on this device. Return to the caller indicating that you need a discard for this device that has a pending discard. Add an output argument to indicate the discard device "discard" if needed.
The problem I just realized is that, if we just bail out here, we are forbidding order 0 to steal if there is any discarding cluster. We just return here to let the caller handle the discard outside the lock.
It may just discard the cluster just fine, then retry from free clusters. Then everything is fine, that's the easy part.
But it might also fail, and interestingly, in the failure case we need to try again as well. It might fail with a race with another discard, in that case order 0 steal is still feasible. Or it fail with get_swap_device_info (we have to release the device to return here), in that case we should go back to the plist and try other devices.
This is doable but seems kind of fragile, we'll have something like this in the folio_alloc_swap function:
local_lock(&percpu_swap_cluster.lock); if (!swap_alloc_fast(&entry, order)) swap_alloc_slow(&entry, order, &discard_si); local_unlock(&percpu_swap_cluster.lock);
+if (discard_si) { + if (get_swap_device_info(discard_si)) { + swap_do_scheduled_discard(discard_si); + put_swap_device(discard_si); + /* + * Ignoring the return value, since we need to try + * again even if the discard failed. If failed due to + * race with another discard, we should still try + * order 0 steal. + */ + } else { + discard_si = NULL; + /* + * If raced with swapoff, we should try again too but + * not using the discard device anymore. + */ + } + goto again; +}
And the `again` retry we'll have to always start from free_clusters again, unless we have another parameter just to indicate that we want to skip everything and jump to stealing, or pass and reuse the discard_si pointer as return argument to cluster_alloc_swap_entry as well, as the indicator to jump to stealing directly.
It looks kind of strange. So far swap_do_scheduled_discard can only fail due to a race with another successful discard, so retrying is safe and won't run into an endless loop. But it seems easy to break, e.g. if we may handle bio alloc failure of discard request in the future. And trying again if get_swap_device_info failed makes no sense if there is only one device, but has to be done here to cover multi-device usage, or we have to add more special checks.
swap_alloc_slow will be a bit longer too if we want to prevent touching plist again: +/* + * Resuming after trying to discard cluster on a swap device, + * try the discarded device first. + */ +si = *discard_si; +if (unlikely(si)) { + *discard_si = NULL; + if (get_swap_device_info(si)) { + offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE, &need_discard); + put_swap_device(si); + if (offset) { + *entry = swp_entry(si->type, offset); + return true; + } + if (need_discard) { + *discard_si = si; + return false; + } + } +}
The logic of the workflow jumping between several functions might also be kind of hard to follow. Some cleanup can be done later though.
Considering the discard issue is really rare, I'm not sure if this is the right way to go? How do you think?
BTW: The logic of V1 can be optimized a little bit to let discards also happen with order > 0 cases too. That seems closer to what the current upstream kernel was doing except: Allocator prefers to try another device instead of waiting for discard, which seems OK? And order 0 steal can happen without waiting for discard. Fragmentation under extreme pressure might not be that serious an issue if we are having really slow SSDs, and might even be no longer an issue if we have a generic solution for frags?
On Sun, Oct 12, 2025 at 9:49 AM Kairui Song ryncsn@gmail.com wrote:
diff --git a/mm/swapfile.c b/mm/swapfile.c index cb2392ed8e0e..0d1924f6f495 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o goto done; }
/** We don't have free cluster but have some clusters in discarding,* do discard now and reclaim them.*/if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))goto new_cluster;Assume you follow my suggestion. Change this to some function to detect if there is a pending discard on this device. Return to the caller indicating that you need a discard for this device that has a pending discard. Add an output argument to indicate the discard device "discard" if needed.
The problem I just realized is that, if we just bail out here, we are forbidding order 0 to steal if there is any discarding cluster. We just return here to let the caller handle the discard outside the lock.
Oh, yes, there might be a bit of change in behavior. However I can't see it is such a bad thing if we wait for the pending discard to complete before stealing and fragmenting the existing folio list. We will have less fragments compared to the original result. Again, my point is not that we always keep 100% the old behavior, then there is no room for improvement.
My point is that, are we doing the best we can in that situation, regardless how unlikely it is.
It may just discard the cluster just fine, then retry from free clusters. Then everything is fine, that's the easy part.
Ack.
But it might also fail, and interestingly, in the failure case we need
Can you spell out the failure case you have in mind? Do you mean the discard did happen but another thread stole "the recently discarded then became free cluster"?
Anyway, in such a case, the swap allocator should continue and find out we don't have things to discard now, it will continue to the "steal from other order > 0 list".
to try again as well. It might fail with a race with another discard, in that case order 0 steal is still feasible. Or it fail with get_swap_device_info (we have to release the device to return here), in that case we should go back to the plist and try other devices.
When stealing from the other order >0 list failed, we should try another device in the plist.
This is doable but seems kind of fragile, we'll have something like this in the folio_alloc_swap function:
local_lock(&percpu_swap_cluster.lock); if (!swap_alloc_fast(&entry, order)) swap_alloc_slow(&entry, order, &discard_si); local_unlock(&percpu_swap_cluster.lock);
+if (discard_si) {
I feel the discard logic should be inside the swap_alloc_slow(). There is a plist_for_each_entry_safe(), inside that loop to do the discard and retry(). If I previously suggested it change in here, sorry I have changed my mind after reasoning the code a bit more.
The fast path layer should not know about the discard() and also should not retry the fast path if after waiting for the discard to complete.
The discard should be on the slow path for sure.
- if (get_swap_device_info(discard_si)) {
Inside the slow path there is get_swap_device_info(si), you should be able to reuse those?
swap_do_scheduled_discard(discard_si);put_swap_device(discard_si);/** Ignoring the return value, since we need to try* again even if the discard failed. If failed due to* race with another discard, we should still try* order 0 steal.*/- } else {
Shouldn't need the "else", the swap_alloc_slow() can always set dicard_si = NULL internally if no device to discard or just set discard = NULL regardless.
discard_si = NULL;/** If raced with swapoff, we should try again too but* not using the discard device anymore.*/- }
- goto again;
+}
And the `again` retry we'll have to always start from free_clusters again,
That is fine, because discard causes clusters to move into free_clusters now.
unless we have another parameter just to indicate that we want to skip everything and jump to stealing, or pass and reuse the discard_si pointer as return argument to cluster_alloc_swap_entry as well, as the indicator to jump to stealing directly.
It is a rare case, we don't have to jump directly to stealing. If the discard happens and that discarded cluster gets stolen by other threads, I think it is fine going through the fragment list before going to the order 0 stealing from another order fragment list.
It looks kind of strange. So far swap_do_scheduled_discard can only fail due to a race with another successful discard, so retrying is safe and won't run into an endless loop. But it seems easy to break, e.g. if we may handle bio alloc failure of discard request in the future. And trying again if get_swap_device_info failed makes no sense if there is only one device, but has to be done here to cover multi-device usage, or we have to add more special checks.
Well, you can have sync wait check check for discard if there is >0 number of clusters successfully discarded.
swap_alloc_slow will be a bit longer too if we want to prevent touching plist again: +/*
- Resuming after trying to discard cluster on a swap device,
- try the discarded device first.
- */
+si = *discard_si; +if (unlikely(si)) {
- *discard_si = NULL;
- if (get_swap_device_info(si)) {
offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE,&need_discard);
put_swap_device(si);if (offset) {*entry = swp_entry(si->type, offset);return true;}if (need_discard) {*discard_si = si;
return false;
I haven't tried it myself. but I feel we should move the sync wait for discard here but with the lock released then re-acquire the lock. That might simplify the logic. The discard should belong to the slow path behavior, definitely not part of the fast path.
}- }
+}
The logic of the workflow jumping between several functions might also be kind of hard to follow. Some cleanup can be done later though.
Considering the discard issue is really rare, I'm not sure if this is the right way to go? How do you think?
Let's try moving the discard and retry inside the slow path but release the lock and see how it feels. If you want, I can also give it a try, I just don't want to step on your toes.
BTW: The logic of V1 can be optimized a little bit to let discards also happen with order > 0 cases too. That seems closer to what the current upstream kernel was doing except: Allocator prefers to try another device instead of waiting for discard, which seems OK?
I think we should wait for the discard. Having discard means the device can have maybe (many?) free clusters soon. We can wait. It is a rare case anyway. From the swap.tiers point of view, it would be better to exhaust the current high priority device before consuming the low priority device. Otherwise you will have very minor swap device priority inversion for a few swap entries, those swap entries otherwise can be allocated on the discarded free cluster from high priority swapdevice.
And order 0 steal can happen without waiting for discard.
I am OK to change the behavior to let order 0 wait for the discard as well. It happens so rarely and we have less fragmented clusters compared to the alternatives of stealing from higher order clusters now. I think that is OK. We end up having less fragmented clusters, which is a good thing.
Fragmentation under extreme pressure might not be that serious an issue if we are having really slow SSDs, and might even be no longer an issue if we have a generic solution for frags?
Chris
On Tue, Oct 14, 2025 at 2:27 PM Chris Li chrisl@kernel.org wrote:
On Sun, Oct 12, 2025 at 9:49 AM Kairui Song ryncsn@gmail.com wrote:
diff --git a/mm/swapfile.c b/mm/swapfile.c index cb2392ed8e0e..0d1924f6f495 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o goto done; }
/** We don't have free cluster but have some clusters in discarding,* do discard now and reclaim them.*/if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))goto new_cluster;Assume you follow my suggestion. Change this to some function to detect if there is a pending discard on this device. Return to the caller indicating that you need a discard for this device that has a pending discard. Add an output argument to indicate the discard device "discard" if needed.
The problem I just realized is that, if we just bail out here, we are forbidding order 0 to steal if there is any discarding cluster. We just return here to let the caller handle the discard outside the lock.
Oh, yes, there might be a bit of change in behavior. However I can't see it is such a bad thing if we wait for the pending discard to complete before stealing and fragmenting the existing folio list. We will have less fragments compared to the original result. Again, my point is not that we always keep 100% the old behavior, then there is no room for improvement.
My point is that, are we doing the best we can in that situation, regardless how unlikely it is.
It may just discard the cluster just fine, then retry from free clusters. Then everything is fine, that's the easy part.
Ack.
But it might also fail, and interestingly, in the failure case we need
Can you spell out the failure case you have in mind? Do you mean the discard did happen but another thread stole "the recently discarded then became free cluster"?
Anyway, in such a case, the swap allocator should continue and find out we don't have things to discard now, it will continue to the "steal from other order > 0 list".
to try again as well. It might fail with a race with another discard, in that case order 0 steal is still feasible. Or it fail with get_swap_device_info (we have to release the device to return here), in that case we should go back to the plist and try other devices.
When stealing from the other order >0 list failed, we should try another device in the plist.
This is doable but seems kind of fragile, we'll have something like this in the folio_alloc_swap function:
local_lock(&percpu_swap_cluster.lock); if (!swap_alloc_fast(&entry, order)) swap_alloc_slow(&entry, order, &discard_si); local_unlock(&percpu_swap_cluster.lock);
+if (discard_si) {
I feel the discard logic should be inside the swap_alloc_slow(). There is a plist_for_each_entry_safe(), inside that loop to do the discard and retry(). If I previously suggested it change in here, sorry I have changed my mind after reasoning the code a bit more.
Actually now I have given it a bit more thought, one thing I realized is that you might need to hold the percpu_swap_cluster lock all the time during allocation. That might force you to do the release lock and discard in the current position.
If that is the case, then just making the small change in your patch to allow hold waiting to discard before trying the fragmentation list might be good enough.
Chris
The fast path layer should not know about the discard() and also should not retry the fast path if after waiting for the discard to complete.
The discard should be on the slow path for sure.
- if (get_swap_device_info(discard_si)) {
Inside the slow path there is get_swap_device_info(si), you should be able to reuse those?
swap_do_scheduled_discard(discard_si);put_swap_device(discard_si);/** Ignoring the return value, since we need to try* again even if the discard failed. If failed due to* race with another discard, we should still try* order 0 steal.*/- } else {
Shouldn't need the "else", the swap_alloc_slow() can always set dicard_si = NULL internally if no device to discard or just set discard = NULL regardless.
discard_si = NULL;/** If raced with swapoff, we should try again too but* not using the discard device anymore.*/- }
- goto again;
+}
And the `again` retry we'll have to always start from free_clusters again,
That is fine, because discard causes clusters to move into free_clusters now.
unless we have another parameter just to indicate that we want to skip everything and jump to stealing, or pass and reuse the discard_si pointer as return argument to cluster_alloc_swap_entry as well, as the indicator to jump to stealing directly.
It is a rare case, we don't have to jump directly to stealing. If the discard happens and that discarded cluster gets stolen by other threads, I think it is fine going through the fragment list before going to the order 0 stealing from another order fragment list.
It looks kind of strange. So far swap_do_scheduled_discard can only fail due to a race with another successful discard, so retrying is safe and won't run into an endless loop. But it seems easy to break, e.g. if we may handle bio alloc failure of discard request in the future. And trying again if get_swap_device_info failed makes no sense if there is only one device, but has to be done here to cover multi-device usage, or we have to add more special checks.
Well, you can have sync wait check check for discard if there is >0 number of clusters successfully discarded.
swap_alloc_slow will be a bit longer too if we want to prevent touching plist again: +/*
- Resuming after trying to discard cluster on a swap device,
- try the discarded device first.
- */
+si = *discard_si; +if (unlikely(si)) {
- *discard_si = NULL;
- if (get_swap_device_info(si)) {
offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE,&need_discard);
put_swap_device(si);if (offset) {*entry = swp_entry(si->type, offset);return true;}if (need_discard) {*discard_si = si;
return false;I haven't tried it myself. but I feel we should move the sync wait for discard here but with the lock released then re-acquire the lock. That might simplify the logic. The discard should belong to the slow path behavior, definitely not part of the fast path.
}- }
+}
The logic of the workflow jumping between several functions might also be kind of hard to follow. Some cleanup can be done later though.
Considering the discard issue is really rare, I'm not sure if this is the right way to go? How do you think?
Let's try moving the discard and retry inside the slow path but release the lock and see how it feels. If you want, I can also give it a try, I just don't want to step on your toes.
BTW: The logic of V1 can be optimized a little bit to let discards also happen with order > 0 cases too. That seems closer to what the current upstream kernel was doing except: Allocator prefers to try another device instead of waiting for discard, which seems OK?
I think we should wait for the discard. Having discard means the device can have maybe (many?) free clusters soon. We can wait. It is a rare case anyway. From the swap.tiers point of view, it would be better to exhaust the current high priority device before consuming the low priority device. Otherwise you will have very minor swap device priority inversion for a few swap entries, those swap entries otherwise can be allocated on the discarded free cluster from high priority swapdevice.
And order 0 steal can happen without waiting for discard.
I am OK to change the behavior to let order 0 wait for the discard as well. It happens so rarely and we have less fragmented clusters compared to the alternatives of stealing from higher order clusters now. I think that is OK. We end up having less fragmented clusters, which is a good thing.
Fragmentation under extreme pressure might not be that serious an issue if we are having really slow SSDs, and might even be no longer an issue if we have a generic solution for frags?
Chris
On Wed, Oct 15, 2025 at 12:00 PM Chris Li chrisl@kernel.org wrote:
On Tue, Oct 14, 2025 at 2:27 PM Chris Li chrisl@kernel.org wrote:
On Sun, Oct 12, 2025 at 9:49 AM Kairui Song ryncsn@gmail.com wrote:
diff --git a/mm/swapfile.c b/mm/swapfile.c index cb2392ed8e0e..0d1924f6f495 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o goto done; }
/** We don't have free cluster but have some clusters in discarding,* do discard now and reclaim them.*/if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))goto new_cluster;Assume you follow my suggestion. Change this to some function to detect if there is a pending discard on this device. Return to the caller indicating that you need a discard for this device that has a pending discard. Add an output argument to indicate the discard device "discard" if needed.
The problem I just realized is that, if we just bail out here, we are forbidding order 0 to steal if there is any discarding cluster. We just return here to let the caller handle the discard outside the lock.
Oh, yes, there might be a bit of change in behavior. However I can't see it is such a bad thing if we wait for the pending discard to complete before stealing and fragmenting the existing folio list. We will have less fragments compared to the original result. Again, my point is not that we always keep 100% the old behavior, then there is no room for improvement.
My point is that, are we doing the best we can in that situation, regardless how unlikely it is.
It may just discard the cluster just fine, then retry from free clusters. Then everything is fine, that's the easy part.
Ack.
But it might also fail, and interestingly, in the failure case we need
Can you spell out the failure case you have in mind? Do you mean the discard did happen but another thread stole "the recently discarded then became free cluster"?
Anyway, in such a case, the swap allocator should continue and find out we don't have things to discard now, it will continue to the "steal from other order > 0 list".
to try again as well. It might fail with a race with another discard, in that case order 0 steal is still feasible. Or it fail with get_swap_device_info (we have to release the device to return here), in that case we should go back to the plist and try other devices.
When stealing from the other order >0 list failed, we should try another device in the plist.
This is doable but seems kind of fragile, we'll have something like this in the folio_alloc_swap function:
local_lock(&percpu_swap_cluster.lock); if (!swap_alloc_fast(&entry, order)) swap_alloc_slow(&entry, order, &discard_si); local_unlock(&percpu_swap_cluster.lock);
+if (discard_si) {
I feel the discard logic should be inside the swap_alloc_slow(). There is a plist_for_each_entry_safe(), inside that loop to do the discard and retry(). If I previously suggested it change in here, sorry I have changed my mind after reasoning the code a bit more.
Actually now I have given it a bit more thought, one thing I realized is that you might need to hold the percpu_swap_cluster lock all the time during allocation. That might force you to do the release lock and discard in the current position.
If that is the case, then just making the small change in your patch to allow hold waiting to discard before trying the fragmentation list might be good enough.
Chris
Thanks, I was composing a reply on this and just saw your new comment. I agree with this.
On Wed, Oct 15, 2025 at 2:24 PM Kairui Song ryncsn@gmail.com wrote:
On Wed, Oct 15, 2025 at 12:00 PM Chris Li chrisl@kernel.org wrote:
On Tue, Oct 14, 2025 at 2:27 PM Chris Li chrisl@kernel.org wrote:
On Sun, Oct 12, 2025 at 9:49 AM Kairui Song ryncsn@gmail.com wrote:
diff --git a/mm/swapfile.c b/mm/swapfile.c index cb2392ed8e0e..0d1924f6f495 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o goto done; }
/** We don't have free cluster but have some clusters in discarding,* do discard now and reclaim them.*/if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))goto new_cluster;Assume you follow my suggestion. Change this to some function to detect if there is a pending discard on this device. Return to the caller indicating that you need a discard for this device that has a pending discard. Add an output argument to indicate the discard device "discard" if needed.
The problem I just realized is that, if we just bail out here, we are forbidding order 0 to steal if there is any discarding cluster. We just return here to let the caller handle the discard outside the lock.
Oh, yes, there might be a bit of change in behavior. However I can't see it is such a bad thing if we wait for the pending discard to complete before stealing and fragmenting the existing folio list. We will have less fragments compared to the original result. Again, my point is not that we always keep 100% the old behavior, then there is no room for improvement.
My point is that, are we doing the best we can in that situation, regardless how unlikely it is.
It may just discard the cluster just fine, then retry from free clusters. Then everything is fine, that's the easy part.
Ack.
But it might also fail, and interestingly, in the failure case we need
Can you spell out the failure case you have in mind? Do you mean the discard did happen but another thread stole "the recently discarded then became free cluster"?
Anyway, in such a case, the swap allocator should continue and find out we don't have things to discard now, it will continue to the "steal from other order > 0 list".
to try again as well. It might fail with a race with another discard, in that case order 0 steal is still feasible. Or it fail with get_swap_device_info (we have to release the device to return here), in that case we should go back to the plist and try other devices.
When stealing from the other order >0 list failed, we should try another device in the plist.
This is doable but seems kind of fragile, we'll have something like this in the folio_alloc_swap function:
local_lock(&percpu_swap_cluster.lock); if (!swap_alloc_fast(&entry, order)) swap_alloc_slow(&entry, order, &discard_si); local_unlock(&percpu_swap_cluster.lock);
+if (discard_si) {
I feel the discard logic should be inside the swap_alloc_slow(). There is a plist_for_each_entry_safe(), inside that loop to do the discard and retry(). If I previously suggested it change in here, sorry I have changed my mind after reasoning the code a bit more.
Actually now I have given it a bit more thought, one thing I realized is that you might need to hold the percpu_swap_cluster lock all the time during allocation. That might force you to do the release lock and discard in the current position.
If that is the case, then just making the small change in your patch to allow hold waiting to discard before trying the fragmentation list might be good enough.
Chris
Thanks, I was composing a reply on this and just saw your new comment. I agree with this.
Hmm, it turns out modifying V1 to handle non-order 0 allocation failure also has some minor issues. Every mTHP SWAP allocation failure will have a slight higher overhead due to the discard check. V1 is fine since it only checks discard for order 0, and order 0 alloc failure is uncommon and usually means OOM already.
I'm not saying V1 is the final solution, but I think maybe we can just keep V1 as it is? That's easier for a stable backport too, and this is doing far better than what it was like. The sync discard was added in 2013 and the later added percpu cluster at the same year never treated it carefully. And the discard during allocation after recent swap allocator rework has been kind of broken for a while.
To optimize it further in a clean way, we have to reverse the allocator's handling order of the plist and fast / slow path. Current order is local_lock -> fast -> slow (plist).
We can walk the plist first, then do the fast / slow path: plist (or maybe something faster than plist but handles the priority) -> local_lock -> fast -> slow (bonus: this is more friendly to RT kernels too I think). That way we don't need to rewalk the plist after releasing the local_lock and save a lot of trouble. I remember I discussed with Youngjun on this sometime ago in the mail list, I know things have changed a lot but some ideas seems are still similar. I think his series is moving the percpu cluster into each device (si), we can also move the local_lock there, which is just what I'm talking about here.
On Wed, Oct 15, 2025 at 9:46 AM Kairui Song ryncsn@gmail.com wrote:
On Wed, Oct 15, 2025 at 2:24 PM Kairui Song ryncsn@gmail.com wrote:
On Wed, Oct 15, 2025 at 12:00 PM Chris Li chrisl@kernel.org wrote:
On Tue, Oct 14, 2025 at 2:27 PM Chris Li chrisl@kernel.org wrote:
On Sun, Oct 12, 2025 at 9:49 AM Kairui Song ryncsn@gmail.com wrote:
> diff --git a/mm/swapfile.c b/mm/swapfile.c > index cb2392ed8e0e..0d1924f6f495 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o > goto done; > } > > - /* > - * We don't have free cluster but have some clusters in discarding, > - * do discard now and reclaim them. > - */ > - if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si)) > - goto new_cluster;
Assume you follow my suggestion. Change this to some function to detect if there is a pending discard on this device. Return to the caller indicating that you need a discard for this device that has a pending discard. Add an output argument to indicate the discard device "discard" if needed.
The problem I just realized is that, if we just bail out here, we are forbidding order 0 to steal if there is any discarding cluster. We just return here to let the caller handle the discard outside the lock.
Oh, yes, there might be a bit of change in behavior. However I can't see it is such a bad thing if we wait for the pending discard to complete before stealing and fragmenting the existing folio list. We will have less fragments compared to the original result. Again, my point is not that we always keep 100% the old behavior, then there is no room for improvement.
My point is that, are we doing the best we can in that situation, regardless how unlikely it is.
It may just discard the cluster just fine, then retry from free clusters. Then everything is fine, that's the easy part.
Ack.
But it might also fail, and interestingly, in the failure case we need
Can you spell out the failure case you have in mind? Do you mean the discard did happen but another thread stole "the recently discarded then became free cluster"?
Anyway, in such a case, the swap allocator should continue and find out we don't have things to discard now, it will continue to the "steal from other order > 0 list".
to try again as well. It might fail with a race with another discard, in that case order 0 steal is still feasible. Or it fail with get_swap_device_info (we have to release the device to return here), in that case we should go back to the plist and try other devices.
When stealing from the other order >0 list failed, we should try another device in the plist.
This is doable but seems kind of fragile, we'll have something like this in the folio_alloc_swap function:
local_lock(&percpu_swap_cluster.lock); if (!swap_alloc_fast(&entry, order)) swap_alloc_slow(&entry, order, &discard_si); local_unlock(&percpu_swap_cluster.lock);
+if (discard_si) {
I feel the discard logic should be inside the swap_alloc_slow(). There is a plist_for_each_entry_safe(), inside that loop to do the discard and retry(). If I previously suggested it change in here, sorry I have changed my mind after reasoning the code a bit more.
Actually now I have given it a bit more thought, one thing I realized is that you might need to hold the percpu_swap_cluster lock all the time during allocation. That might force you to do the release lock and discard in the current position.
If that is the case, then just making the small change in your patch to allow hold waiting to discard before trying the fragmentation list might be good enough.
Chris
Thanks, I was composing a reply on this and just saw your new comment. I agree with this.
Hmm, it turns out modifying V1 to handle non-order 0 allocation failure also has some minor issues. Every mTHP SWAP allocation failure will have a slight higher overhead due to the discard check. V1 is fine since it only checks discard for order 0, and order 0 alloc failure is uncommon and usually means OOM already.
I'm not saying V1 is the final solution, but I think maybe we can just keep V1 as it is? That's easier for a stable backport too, and this is
I am fine with that, assuming you need to adjust the presentation to push V1 as hotfix. I can ack your newer patch to adjust the presentation.
doing far better than what it was like. The sync discard was added in 2013 and the later added percpu cluster at the same year never treated it carefully. And the discard during allocation after recent swap allocator rework has been kind of broken for a while.
To optimize it further in a clean way, we have to reverse the allocator's handling order of the plist and fast / slow path. Current order is local_lock -> fast -> slow (plist).
I like that. I think that is the eventual way to go. I want to see how it integrates with the swap.tiers patches. If you let me pick, I would go straight with this one for 6.19.
We can walk the plist first, then do the fast / slow path: plist (or maybe something faster than plist but handles the priority) -> local_lock -> fast -> slow (bonus: this is more friendly to RT kernels too I think). That way we don't need to rewalk the plist after releasing the local_lock and save a lot of trouble. I remember I discussed with Youngjun on this sometime ago in the mail list, I know things have changed a lot but some ideas seems are still similar. I think his series is moving the percpu cluster into each device (si), we can also move the local_lock there, which is just what I'm talking about here.
Ack. We will need to see both patches to figure out how to integrate them together. Right now we have two moving parts. More to the point that we get the eventual patch sooner.
Chris
On Tue, Oct 21, 2025 at 3:05 PM Chris Li chrisl@kernel.org wrote:
On Wed, Oct 15, 2025 at 9:46 AM Kairui Song ryncsn@gmail.com wrote:
On Wed, Oct 15, 2025 at 2:24 PM Kairui Song ryncsn@gmail.com wrote:
On Wed, Oct 15, 2025 at 12:00 PM Chris Li chrisl@kernel.org wrote:
On Tue, Oct 14, 2025 at 2:27 PM Chris Li chrisl@kernel.org wrote:
On Sun, Oct 12, 2025 at 9:49 AM Kairui Song ryncsn@gmail.com wrote:
> > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index cb2392ed8e0e..0d1924f6f495 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o > > goto done; > > } > > > > - /* > > - * We don't have free cluster but have some clusters in discarding, > > - * do discard now and reclaim them. > > - */ > > - if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si)) > > - goto new_cluster; > > Assume you follow my suggestion. > Change this to some function to detect if there is a pending discard > on this device. Return to the caller indicating that you need a > discard for this device that has a pending discard. > Add an output argument to indicate the discard device "discard" if needed.
The problem I just realized is that, if we just bail out here, we are forbidding order 0 to steal if there is any discarding cluster. We just return here to let the caller handle the discard outside the lock.
Oh, yes, there might be a bit of change in behavior. However I can't see it is such a bad thing if we wait for the pending discard to complete before stealing and fragmenting the existing folio list. We will have less fragments compared to the original result. Again, my point is not that we always keep 100% the old behavior, then there is no room for improvement.
My point is that, are we doing the best we can in that situation, regardless how unlikely it is.
It may just discard the cluster just fine, then retry from free clusters. Then everything is fine, that's the easy part.
Ack.
But it might also fail, and interestingly, in the failure case we need
Can you spell out the failure case you have in mind? Do you mean the discard did happen but another thread stole "the recently discarded then became free cluster"?
Anyway, in such a case, the swap allocator should continue and find out we don't have things to discard now, it will continue to the "steal from other order > 0 list".
to try again as well. It might fail with a race with another discard, in that case order 0 steal is still feasible. Or it fail with get_swap_device_info (we have to release the device to return here), in that case we should go back to the plist and try other devices.
When stealing from the other order >0 list failed, we should try another device in the plist.
This is doable but seems kind of fragile, we'll have something like this in the folio_alloc_swap function:
local_lock(&percpu_swap_cluster.lock); if (!swap_alloc_fast(&entry, order)) swap_alloc_slow(&entry, order, &discard_si); local_unlock(&percpu_swap_cluster.lock);
+if (discard_si) {
I feel the discard logic should be inside the swap_alloc_slow(). There is a plist_for_each_entry_safe(), inside that loop to do the discard and retry(). If I previously suggested it change in here, sorry I have changed my mind after reasoning the code a bit more.
Actually now I have given it a bit more thought, one thing I realized is that you might need to hold the percpu_swap_cluster lock all the time during allocation. That might force you to do the release lock and discard in the current position.
If that is the case, then just making the small change in your patch to allow hold waiting to discard before trying the fragmentation list might be good enough.
Chris
Thanks, I was composing a reply on this and just saw your new comment. I agree with this.
Hmm, it turns out modifying V1 to handle non-order 0 allocation failure also has some minor issues. Every mTHP SWAP allocation failure will have a slight higher overhead due to the discard check. V1 is fine since it only checks discard for order 0, and order 0 alloc failure is uncommon and usually means OOM already.
I'm not saying V1 is the final solution, but I think maybe we can just keep V1 as it is? That's easier for a stable backport too, and this is
I am fine with that, assuming you need to adjust the presentation to push V1 as hotfix. I can ack your newer patch to adjust the presentation.
Thanks, I'll update it then.
doing far better than what it was like. The sync discard was added in 2013 and the later added percpu cluster at the same year never treated it carefully. And the discard during allocation after recent swap allocator rework has been kind of broken for a while.
To optimize it further in a clean way, we have to reverse the allocator's handling order of the plist and fast / slow path. Current order is local_lock -> fast -> slow (plist).
I like that. I think that is the eventual way to go. I want to see how it integrates with the swap.tiers patches. If you let me pick, I would go straight with this one for 6.19.
We can walk the plist first, then do the fast / slow path: plist (or maybe something faster than plist but handles the priority) -> local_lock -> fast -> slow (bonus: this is more friendly to RT kernels too I think). That way we don't need to rewalk the plist after releasing the local_lock and save a lot of trouble. I remember I discussed with Youngjun on this sometime ago in the mail list, I know things have changed a lot but some ideas seems are still similar. I think his series is moving the percpu cluster into each device (si), we can also move the local_lock there, which is just what I'm talking about here.
Ack. We will need to see both patches to figure out how to integrate them together. Right now we have two moving parts. More to the point that we get the eventual patch sooner.
BTW I found there are some minor cleanups needed, mostly trivial, I'll include them in the next update I think.
Thanks, I was composing a reply on this and just saw your new comment. I agree with this.
Hmm, it turns out modifying V1 to handle non-order 0 allocation failure also has some minor issues. Every mTHP SWAP allocation failure will have a slight higher overhead due to the discard check. V1 is fine since it only checks discard for order 0, and order 0 alloc failure is uncommon and usually means OOM already.
Looking at the original proposed patch.
+ spin_lock(&swap_avail_lock); + plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) { + spin_unlock(&swap_avail_lock); + if (get_swap_device_info(si)) { + if (si->flags & SWP_PAGE_DISCARD) + ret = swap_do_scheduled_discard(si); + put_swap_device(si); + } + if (ret) + break;
if ret is true and we break, wouldn’t that cause spin_unlock to run without the lock being held?
+ spin_lock(&swap_avail_lock); + } + spin_unlock(&swap_avail_lock); <- unlocked without lock grab. + + return ret; +}
I'm not saying V1 is the final solution, but I think maybe we can just keep V1 as it is? That's easier for a stable backport too, and this is doing far better than what it was like. The sync discard was added in 2013 and the later added percpu cluster at the same year never treated it carefully. And the discard during allocation after recent swap allocator rework has been kind of broken for a while.
To optimize it further in a clean way, we have to reverse the allocator's handling order of the plist and fast / slow path. Current order is local_lock -> fast -> slow (plist). We can walk the plist first, then do the fast / slow path: plist (or maybe something faster than plist but handles the priority) -> local_lock -> fast -> slow (bonus: this is more friendly to RT kernels
I think the idea is good, but when approaching it that way, I am curious about rotation handling.
In the current code, rotation is always done when traversing the plist in the slow path. If we traverse the plist first, how should rotation be handled?
1. Do a naive rotation at plist traversal time. (But then fast path might allocate from an si we didn’t select.) 2. Rotate when allocating in the slow path. (But between releasing swap_avail_lock, we might access an si that wasn’t rotated.)
Both cases could break rotation behavior — what do you think?
too I think). That way we don't need to rewalk the plist after releasing the local_lock and save a lot of trouble. I remember I discussed with Youngjun on this sometime ago in the mail list, I know
Recapping your earlier idea: cache only the swap device per cgroup in percpu, and keep the cluster inside the swap device. Applied to swap tiers, cache only the percpu si per tier, and keep the cluster in the swap device. This seems to fit well with your previous suggestion.
However, since we shifted from per-cgroup swap priority to swap tier, and will re-submit RFC for swap tier, we’ll need to revisit the discussion.
Youngjun Park
On Tue, Oct 21, 2025 at 3:34 PM YoungJun Park youngjun.park@lge.com wrote:
Thanks, I was composing a reply on this and just saw your new comment. I agree with this.
Hmm, it turns out modifying V1 to handle non-order 0 allocation failure also has some minor issues. Every mTHP SWAP allocation failure will have a slight higher overhead due to the discard check. V1 is fine since it only checks discard for order 0, and order 0 alloc failure is uncommon and usually means OOM already.
Looking at the original proposed patch.
spin_lock(&swap_avail_lock);plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {spin_unlock(&swap_avail_lock);if (get_swap_device_info(si)) {if (si->flags & SWP_PAGE_DISCARD)ret = swap_do_scheduled_discard(si);put_swap_device(si);}if (ret)break;if ret is true and we break, wouldn’t that cause spin_unlock to run without the lock being held?
Thanks for catching this! Right, I need to return directly instead of break. I've fixed that.
spin_lock(&swap_avail_lock);}spin_unlock(&swap_avail_lock); <- unlocked without lock grab.return ret;+}
I'm not saying V1 is the final solution, but I think maybe we can just keep V1 as it is? That's easier for a stable backport too, and this is doing far better than what it was like. The sync discard was added in 2013 and the later added percpu cluster at the same year never treated it carefully. And the discard during allocation after recent swap allocator rework has been kind of broken for a while.
To optimize it further in a clean way, we have to reverse the allocator's handling order of the plist and fast / slow path. Current order is local_lock -> fast -> slow (plist). We can walk the plist first, then do the fast / slow path: plist (or maybe something faster than plist but handles the priority) -> local_lock -> fast -> slow (bonus: this is more friendly to RT kernels
I think the idea is good, but when approaching it that way, I am curious about rotation handling.
In the current code, rotation is always done when traversing the plist in the slow path. If we traverse the plist first, how should rotation be handled?
That's a very good question, things always get tricky when it comes to the details...
- Do a naive rotation at plist traversal time.
(But then fast path might allocate from an si we didn’t select.) 2. Rotate when allocating in the slow path. (But between releasing swap_avail_lock, we might access an si that wasn’t rotated.)
Both cases could break rotation behavior — what do you think?
I think cluster level rotating is better, it prevents things from going too fragmented and spreads the workload between devices in a helpful way, but just my guess.
We can change the rotation behavior if the test shows some other strategy is better.
Maybe we'll need something with a better design, like a alloc counter for rotation. And if we look at the plist before the fast path we may need to do some optimization for the plist lock too...
On Tue, 07 Oct 2025 04:02:32 +0800 Kairui Song ryncsn@gmail.com wrote:
A few cleanups and a bugfix that are either suitable after the swap table phase I or found during code review.
Patch 1 is a bugfix and needs to be included in the stable branch, the rest have no behavior change.
fyi, the presentation of the series suggests that [1/4] is not a hotfix - that it won't hit mainline (and then -stable) until after 6.19-rc1.
Which sounds OK given this:
So far, no issues have been observed or reported with typical SSD setups under months of high pressure. This issue was found during my code review. But by hacking the kernel a bit: adding a mdelay(100) in the async discard path, this issue will be observable with WARNING triggered by the wrong GFP and cond_resched in the bio layer.
linux-stable-mirror@lists.linaro.org