On Tue, Oct 21, 2025 at 3:34 PM YoungJun Park youngjun.park@lge.com wrote:
Thanks, I was composing a reply on this and just saw your new comment. I agree with this.
Hmm, it turns out modifying V1 to handle non-order 0 allocation failure also has some minor issues. Every mTHP SWAP allocation failure will have a slight higher overhead due to the discard check. V1 is fine since it only checks discard for order 0, and order 0 alloc failure is uncommon and usually means OOM already.
Looking at the original proposed patch.
spin_lock(&swap_avail_lock);plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {spin_unlock(&swap_avail_lock);if (get_swap_device_info(si)) {if (si->flags & SWP_PAGE_DISCARD)ret = swap_do_scheduled_discard(si);put_swap_device(si);}if (ret)break;if ret is true and we break, wouldn’t that cause spin_unlock to run without the lock being held?
Thanks for catching this! Right, I need to return directly instead of break. I've fixed that.
spin_lock(&swap_avail_lock);}spin_unlock(&swap_avail_lock); <- unlocked without lock grab.return ret;+}
I'm not saying V1 is the final solution, but I think maybe we can just keep V1 as it is? That's easier for a stable backport too, and this is doing far better than what it was like. The sync discard was added in 2013 and the later added percpu cluster at the same year never treated it carefully. And the discard during allocation after recent swap allocator rework has been kind of broken for a while.
To optimize it further in a clean way, we have to reverse the allocator's handling order of the plist and fast / slow path. Current order is local_lock -> fast -> slow (plist). We can walk the plist first, then do the fast / slow path: plist (or maybe something faster than plist but handles the priority) -> local_lock -> fast -> slow (bonus: this is more friendly to RT kernels
I think the idea is good, but when approaching it that way, I am curious about rotation handling.
In the current code, rotation is always done when traversing the plist in the slow path. If we traverse the plist first, how should rotation be handled?
That's a very good question, things always get tricky when it comes to the details...
- Do a naive rotation at plist traversal time.
(But then fast path might allocate from an si we didn’t select.) 2. Rotate when allocating in the slow path. (But between releasing swap_avail_lock, we might access an si that wasn’t rotated.)
Both cases could break rotation behavior — what do you think?
I think cluster level rotating is better, it prevents things from going too fragmented and spreads the workload between devices in a helpful way, but just my guess.
We can change the rotation behavior if the test shows some other strategy is better.
Maybe we'll need something with a better design, like a alloc counter for rotation. And if we look at the plist before the fast path we may need to do some optimization for the plist lock too...