There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries.
Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch") Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat... Cc: stable@vger.kernel.org Signed-off-by: Ryan Roberts ryan.roberts@arm.com ---
Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4).
Thanks, Ryan
mm/swapfile.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 2b3a2d85e350..f580e6abc674 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) smp_rmb(); offset = swp_offset(entry); if (offset >= si->max) - goto put_out; + goto bad_offset; + if (data_race(!si->swap_map[swp_offset(entry)])) + goto bad_free;
return si; bad_nofile: @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) out: return NULL; put_out: - pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); percpu_ref_put(&si->users); return NULL; +bad_offset: + pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); + goto put_out; +bad_free: + pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val); + goto put_out; }
static unsigned char __swap_entry_free(struct swap_info_struct *p, @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry) if (non_swap_entry(entry)) return 1;
- p = _swap_info_get(entry); + p = get_swap_device(entry); if (p) { count = __swap_entry_free(p, entry); if (count == SWAP_HAS_CACHE && !swap_page_trans_huge_swapped(p, entry)) __try_to_reclaim_swap(p, swp_offset(entry), TTRS_UNMAPPED | TTRS_FULL); + put_swap_device(p); } return p != NULL; } -- 2.25.1
On 05.03.24 16:13, Ryan Roberts wrote:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
Almost
"s/Hilenbrand/Hildenbrand/" :)
--8<-----
__swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries.
Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch") Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat... Cc: stable@vger.kernel.org Signed-off-by: Ryan Roberts ryan.roberts@arm.com
Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4).
Thanks, Ryan
mm/swapfile.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 2b3a2d85e350..f580e6abc674 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) smp_rmb(); offset = swp_offset(entry); if (offset >= si->max)
goto put_out;
goto bad_offset;
if (data_race(!si->swap_map[swp_offset(entry)]))
goto bad_free;
return si; bad_nofile:
@@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) out: return NULL; put_out:
- pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); percpu_ref_put(&si->users); return NULL;
+bad_offset:
- pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
- goto put_out;
+bad_free:
pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
goto put_out; }
static unsigned char __swap_entry_free(struct swap_info_struct *p,
@@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry) if (non_swap_entry(entry)) return 1;
- p = _swap_info_get(entry);
- p = get_swap_device(entry); if (p) { count = __swap_entry_free(p, entry); if (count == SWAP_HAS_CACHE && !swap_page_trans_huge_swapped(p, entry)) __try_to_reclaim_swap(p, swp_offset(entry), TTRS_UNMAPPED | TTRS_FULL);
} return p != NULL; }put_swap_device(p);
-- 2.25.1
LGTM
Are you planning on sending a doc extension for get_swap_device()?
On 05/03/2024 15:50, David Hildenbrand wrote:
On 05.03.24 16:13, Ryan Roberts wrote:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
Almost
"s/Hilenbrand/Hildenbrand/" :)
Ahh sorry... I even explicitly checked it against your name on emails... fat fingers...
--8<-----
__swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries.
Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch") Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat... Cc: stable@vger.kernel.org Signed-off-by: Ryan Roberts ryan.roberts@arm.com
Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4).
Thanks, Ryan
mm/swapfile.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 2b3a2d85e350..f580e6abc674 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) smp_rmb(); offset = swp_offset(entry); if (offset >= si->max) - goto put_out; + goto bad_offset; + if (data_race(!si->swap_map[swp_offset(entry)])) + goto bad_free;
return si; bad_nofile: @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) out: return NULL; put_out: - pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); percpu_ref_put(&si->users); return NULL; +bad_offset: + pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); + goto put_out; +bad_free: + pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val); + goto put_out; }
static unsigned char __swap_entry_free(struct swap_info_struct *p, @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry) if (non_swap_entry(entry)) return 1;
- p = _swap_info_get(entry); + p = get_swap_device(entry); if (p) { count = __swap_entry_free(p, entry); if (count == SWAP_HAS_CACHE && !swap_page_trans_huge_swapped(p, entry)) __try_to_reclaim_swap(p, swp_offset(entry), TTRS_UNMAPPED | TTRS_FULL); + put_swap_device(p); } return p != NULL; } -- 2.25.1
LGTM
Are you planning on sending a doc extension for get_swap_device()?
I saw your comment:
We should likely update the documentation of get_swap_device(), that after decrementing the refcount, the SI might become stale and should not be touched without a prior get_swap_device().
But when I went to make the changes, I saw the documentation already said:
...we need to enclose all swap related functions with get_swap_device() and put_swap_device()... Notice that swapoff ... can still happen before the percpu_ref_tryget_live() in get_swap_device() or after the percpu_ref_put() in put_swap_device()... The caller must be prepared for that.
I thought that already covered it? I'm sure as usual, I've misunderstood your point. Happy to respin if you have something in mind?
On 05.03.24 17:33, Ryan Roberts wrote:
On 05/03/2024 15:50, David Hildenbrand wrote:
On 05.03.24 16:13, Ryan Roberts wrote:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
Almost
"s/Hilenbrand/Hildenbrand/" :)
Ahh sorry... I even explicitly checked it against your name on emails... fat fingers...
No need to be sorry. Even your average German person would get it wrong, because there are other (more common) variants :)
[...]
LGTM
Are you planning on sending a doc extension for get_swap_device()?
I saw your comment:
We should likely update the documentation of get_swap_device(), that after decrementing the refcount, the SI might become stale and should not be touched without a prior get_swap_device().
But when I went to make the changes, I saw the documentation already said:
...we need to enclose all swap related functions with get_swap_device() and put_swap_device()... Notice that swapoff ... can still happen before the percpu_ref_tryget_live() in get_swap_device() or after the percpu_ref_put() in put_swap_device()... The caller must be prepared for that.
I thought that already covered it? I'm sure as usual, I've misunderstood your point. Happy to respin if you have something in mind?
No need to respin, we could clarify on top, if we decide it makes sense.
I was thinking about something like this, making it clearer that the PTL discussion above does not express the corner case we discovered:
diff --git a/mm/swapfile.c b/mm/swapfile.c index 2b3a2d85e350b..646a436581eee 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1232,6 +1232,11 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p, * with get_swap_device() and put_swap_device(), unless the swap * functions call get/put_swap_device() by themselves. * + * Note that when only holding the PTL, swapoff might succeed immediately + * after freeing a swap entry. Therefore, immediately after + * __swap_entry_free(), the swap info might become stale and should not + * be touched without a prior get_swap_device(). + * * Check whether swap entry is valid in the swap device. If so, * return pointer to swap_info_struct, and keep the swap entry valid * via preventing the swap device from being swapoff, until
David Hildenbrand david@redhat.com writes:
On 05.03.24 17:33, Ryan Roberts wrote:
On 05/03/2024 15:50, David Hildenbrand wrote:
On 05.03.24 16:13, Ryan Roberts wrote:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
Almost
"s/Hilenbrand/Hildenbrand/" :)
Ahh sorry... I even explicitly checked it against your name on emails... fat fingers...
No need to be sorry. Even your average German person would get it wrong, because there are other (more common) variants :)
[...]
LGTM
Are you planning on sending a doc extension for get_swap_device()?
I saw your comment:
We should likely update the documentation of get_swap_device(), that after decrementing the refcount, the SI might become stale and should not be touched without a prior get_swap_device().
But when I went to make the changes, I saw the documentation already said:
...we need to enclose all swap related functions with get_swap_device() and put_swap_device()... Notice that swapoff ... can still happen before the percpu_ref_tryget_live() in get_swap_device() or after the percpu_ref_put() in put_swap_device()... The caller must be prepared for that.
I thought that already covered it? I'm sure as usual, I've misunderstood your point. Happy to respin if you have something in mind?
No need to respin, we could clarify on top, if we decide it makes sense.
I was thinking about something like this, making it clearer that the PTL discussion above does not express the corner case we discovered:
diff --git a/mm/swapfile.c b/mm/swapfile.c index 2b3a2d85e350b..646a436581eee 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1232,6 +1232,11 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
- with get_swap_device() and put_swap_device(), unless the swap
- functions call get/put_swap_device() by themselves.
- Note that when only holding the PTL, swapoff might succeed immediately
- after freeing a swap entry. Therefore, immediately after
- __swap_entry_free(), the swap info might become stale and should not
- be touched without a prior get_swap_device().
- Check whether swap entry is valid in the swap device. If so,
- return pointer to swap_info_struct, and keep the swap entry valid
- via preventing the swap device from being swapoff, until
LGTM. Thanks!
-- Best Regards, Huang, Ying
On 05/03/2024 22:05, David Hildenbrand wrote:
On 05.03.24 17:33, Ryan Roberts wrote:
On 05/03/2024 15:50, David Hildenbrand wrote:
On 05.03.24 16:13, Ryan Roberts wrote:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
Almost
"s/Hilenbrand/Hildenbrand/" :)
Ahh sorry... I even explicitly checked it against your name on emails... fat fingers...
No need to be sorry. Even your average German person would get it wrong, because there are other (more common) variants :)
[...]
LGTM
Are you planning on sending a doc extension for get_swap_device()?
I saw your comment:
We should likely update the documentation of get_swap_device(), that after decrementing the refcount, the SI might become stale and should not be touched without a prior get_swap_device().
But when I went to make the changes, I saw the documentation already said:
...we need to enclose all swap related functions with get_swap_device() and put_swap_device()... Notice that swapoff ... can still happen before the percpu_ref_tryget_live() in get_swap_device() or after the percpu_ref_put() in put_swap_device()... The caller must be prepared for that.
I thought that already covered it? I'm sure as usual, I've misunderstood your point. Happy to respin if you have something in mind?
No need to respin, we could clarify on top, if we decide it makes sense.
I was thinking about something like this, making it clearer that the PTL discussion above does not express the corner case we discovered:
diff --git a/mm/swapfile.c b/mm/swapfile.c index 2b3a2d85e350b..646a436581eee 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1232,6 +1232,11 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p, * with get_swap_device() and put_swap_device(), unless the swap * functions call get/put_swap_device() by themselves. *
- Note that when only holding the PTL, swapoff might succeed immediately
- after freeing a swap entry. Therefore, immediately after
- __swap_entry_free(), the swap info might become stale and should not
- be touched without a prior get_swap_device().
Are yes, this is useful. I'm going to have to respin anyway, so will include it in the next version. Thanks!
* Check whether swap entry is valid in the swap device. If so, * return pointer to swap_info_struct, and keep the swap entry valid * via preventing the swap device from being swapoff, until
Hi Ryan,
On Tue, Mar 05, 2024 at 03:13:49PM +0000, Ryan Roberts wrote:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Unfortunately, I found one, testing current mm-everything:
[ 189.420777] get_swap_device: Unused swap offset entry 000641ae [ 189.426648] ------------[ cut here ]------------ [ 189.431290] WARNING: CPU: 3 PID: 369 at mm/swapfile.c:1301 get_swap_device+0x2da/0x3f0 [ 189.439242] CPU: 3 PID: 369 Comm: cachish Not tainted 6.8.0-rc5-00527-g19d98776f227-dirty #30 [ 189.447791] Hardware name: Micro-Star International Co., Ltd. MS-7B98/Z390-A PRO (MS-7B98), BIOS 1.80 12/25/2019 [ 189.457998] RIP: 0010:get_swap_device+0x2da/0x3f0 [ 189.462721] Code: a8 03 75 2a 65 48 ff 08 e9 36 ff ff ff 4c 89 e9 48 c7 c2 40 fd 91 83 48 c7 c6 c0 f9 91 83 48 c7 c7 60 ee 91 83 e8 26 2f af ff <0f> 0b eb af 4c 8d 6b 08 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48 [ 189.481497] RSP: 0000:ffffc90000cff8a8 EFLAGS: 00010282 [ 189.486760] RAX: 0000000000000032 RBX: ffff8881262eee00 RCX: 0000000000000000 [ 189.493909] RDX: 0000000000000001 RSI: ffffffff83a1e620 RDI: 0000000000000001 [ 189.501054] RBP: 1ffff9200019ff15 R08: 0000000000000001 R09: fffff5200019fee1 [ 189.508202] R10: ffffc90000cff70f R11: 0000000000000001 R12: ffffc900018d51ae [ 189.515346] R13: 00000000000641ae R14: 0000000000000000 R15: 00000000000641af [ 189.522494] FS: 00007f7120263680(0000) GS:ffff88841c380000(0000) knlGS:0000000000000000 [ 189.530591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 189.536373] CR2: 00007f6e659a2ea3 CR3: 0000000046860004 CR4: 00000000003706f0 [ 189.543516] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 189.550661] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 189.557811] Call Trace: [ 189.560276] <TASK> [ 189.562393] ? __warn+0xc4/0x250 [ 189.565647] ? get_swap_device+0x2da/0x3f0 [ 189.569761] ? report_bug+0x348/0x440 [ 189.573444] ? handle_bug+0x6d/0x90 [ 189.576951] ? exc_invalid_op+0x13/0x40 [ 189.580810] ? asm_exc_invalid_op+0x16/0x20 [ 189.585019] ? get_swap_device+0x2da/0x3f0 [ 189.589142] ? get_swap_device+0x2da/0x3f0 [ 189.593255] ? __pfx_get_swap_device+0x10/0x10 [ 189.597717] __read_swap_cache_async+0x9f/0x630 [ 189.602281] ? __pfx___read_swap_cache_async+0x10/0x10 [ 189.607439] ? __mod_memcg_lruvec_state+0x238/0x4f0 [ 189.612344] ? __pfx_swp_swap_info+0x10/0x10 [ 189.616652] swap_cluster_readahead+0x2cd/0x510 [ 189.621206] ? __pfx_swap_cluster_readahead+0x10/0x10 [ 189.626279] ? swap_cache_get_folio+0xcd/0x360 [ 189.630760] ? __count_memcg_events+0x10a/0x370 [ 189.635318] shmem_swapin_folio+0x2f2/0xc60 [ 189.639525] ? __pfx__raw_spin_lock+0x10/0x10 [ 189.643908] ? __pte_offset_map+0x19/0x1d0 [ 189.648024] shmem_get_folio_gfp+0x307/0xe30 [ 189.652323] ? __schedule+0x9f0/0x1fe0 [ 189.656096] ? __pfx_shmem_get_folio_gfp+0x10/0x10 [ 189.660923] ? filemap_map_pages+0x999/0xe60 [ 189.665211] shmem_fault+0x1d9/0x810 [ 189.668834] ? __pfx_shmem_fault+0x10/0x10 [ 189.672954] ? __pfx_filemap_map_pages+0x10/0x10 [ 189.677590] __do_fault+0xed/0x390 [ 189.681012] __handle_mm_fault+0x1ba1/0x2e80 [ 189.685297] ? __pfx___handle_mm_fault+0x10/0x10 [ 189.689933] ? __pfx_down_read_trylock+0x10/0x10 [ 189.694570] ? __pfx_hrtimer_nanosleep+0x10/0x10 [ 189.699215] handle_mm_fault+0xe0/0x560 [ 189.703074] ? __pfx_restore_fpregs_from_fpstate+0x10/0x10 [ 189.708620] do_user_addr_fault+0x2ba/0x9d0 [ 189.712828] exc_page_fault+0x54/0x90 [ 189.716508] asm_exc_page_fault+0x22/0x30 [ 189.720535] RIP: 0033:0x5640dc2d72b5 [ 189.724131] Code: 98 48 ba 00 00 00 00 03 00 00 00 48 89 d6 48 2b 75 a0 ba 00 00 00 00 48 f7 f6 48 89 d1 48 89 ca 48 8b 45 a0 48 01 d0 48 01 d8 <0f> b6 00 bf e8 03 00 00 e8 1e fe ff ff 48 83 45 a8 01 48 8d 45 d0 [ 189.742922] RSP: 002b:00007ffc227e3f60 EFLAGS: 00010206 [ 189.748165] RAX: 00007f6e659a2ea3 RBX: 00007f6e2007e000 RCX: 0000000045924ea3 [ 189.755311] RDX: 0000000045924ea3 RSI: 0000000300000000 RDI: 00007f71202586a0 [ 189.762483] RBP: 00007ffc227e3fe0 R08: 00007f7120258074 R09: 00007f71202580a0 [ 189.769633] R10: 0000000000019458 R11: 00000000008aa400 R12: 0000000000000000 [ 189.776781] R13: 00007ffc227e4128 R14: 00007f712029d000 R15: 00005640dc2d9dd8 [ 189.783928] </TASK> [ 189.786126] ---[ end trace 0000000000000000 ]--- [ 285.827888] get_swap_device: Unused swap offset entry 0018403f [ 320.699306] get_swap_device: Unused swap offset entry 000b001b [ 354.031339] get_swap_device: Unused swap offset entry 000681a9 [ 364.958435] get_swap_device: Unused swap offset entry 001f4055 [ 364.976235] get_swap_device: Unused swap offset entry 001f4057 [ 365.530174] get_swap_device: Unused swap offset entry 000d415c [ 394.223792] get_swap_device: Unused swap offset entry 001540d0 [ 394.317299] get_swap_device: Unused swap offset entry 000341d9 [ 394.341727] get_swap_device: Unused swap offset entry 0006c07e [ 396.062365] get_swap_device: Unused swap offset entry 000541a4 [ 396.068262] get_swap_device: Unused swap offset entry 000541a7 [ 402.629551] get_swap_device: Unused swap offset entry 00294021 [ 436.740622] get_swap_device: Unused swap offset entry 00334155 [ 436.758527] get_swap_device: Unused swap offset entry 001b417c
swap_cluster_readahead() calls __read_swap_cache_async() on a range of made-up swap entries around the faulting slot. The device and the range (si->max) are valid, but the specific entry might not be in use. __read_swap_cache_async() instead relies on swapcache_prepare() returning -ENOENT to catch this and skip gracefully.
Confirmed that reverting the patch makes the warnings go away.
On 06/03/2024 02:19, Johannes Weiner wrote:
Hi Ryan,
On Tue, Mar 05, 2024 at 03:13:49PM +0000, Ryan Roberts wrote:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Unfortunately, I found one, testing current mm-everything:
[ 189.420777] get_swap_device: Unused swap offset entry 000641ae [ 189.426648] ------------[ cut here ]------------ [ 189.431290] WARNING: CPU: 3 PID: 369 at mm/swapfile.c:1301 get_swap_device+0x2da/0x3f0 [ 189.439242] CPU: 3 PID: 369 Comm: cachish Not tainted 6.8.0-rc5-00527-g19d98776f227-dirty #30 [ 189.447791] Hardware name: Micro-Star International Co., Ltd. MS-7B98/Z390-A PRO (MS-7B98), BIOS 1.80 12/25/2019 [ 189.457998] RIP: 0010:get_swap_device+0x2da/0x3f0 [ 189.462721] Code: a8 03 75 2a 65 48 ff 08 e9 36 ff ff ff 4c 89 e9 48 c7 c2 40 fd 91 83 48 c7 c6 c0 f9 91 83 48 c7 c7 60 ee 91 83 e8 26 2f af ff <0f> 0b eb af 4c 8d 6b 08 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48 [ 189.481497] RSP: 0000:ffffc90000cff8a8 EFLAGS: 00010282 [ 189.486760] RAX: 0000000000000032 RBX: ffff8881262eee00 RCX: 0000000000000000 [ 189.493909] RDX: 0000000000000001 RSI: ffffffff83a1e620 RDI: 0000000000000001 [ 189.501054] RBP: 1ffff9200019ff15 R08: 0000000000000001 R09: fffff5200019fee1 [ 189.508202] R10: ffffc90000cff70f R11: 0000000000000001 R12: ffffc900018d51ae [ 189.515346] R13: 00000000000641ae R14: 0000000000000000 R15: 00000000000641af [ 189.522494] FS: 00007f7120263680(0000) GS:ffff88841c380000(0000) knlGS:0000000000000000 [ 189.530591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 189.536373] CR2: 00007f6e659a2ea3 CR3: 0000000046860004 CR4: 00000000003706f0 [ 189.543516] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 189.550661] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 189.557811] Call Trace: [ 189.560276] <TASK> [ 189.562393] ? __warn+0xc4/0x250 [ 189.565647] ? get_swap_device+0x2da/0x3f0 [ 189.569761] ? report_bug+0x348/0x440 [ 189.573444] ? handle_bug+0x6d/0x90 [ 189.576951] ? exc_invalid_op+0x13/0x40 [ 189.580810] ? asm_exc_invalid_op+0x16/0x20 [ 189.585019] ? get_swap_device+0x2da/0x3f0 [ 189.589142] ? get_swap_device+0x2da/0x3f0 [ 189.593255] ? __pfx_get_swap_device+0x10/0x10 [ 189.597717] __read_swap_cache_async+0x9f/0x630 [ 189.602281] ? __pfx___read_swap_cache_async+0x10/0x10 [ 189.607439] ? __mod_memcg_lruvec_state+0x238/0x4f0 [ 189.612344] ? __pfx_swp_swap_info+0x10/0x10 [ 189.616652] swap_cluster_readahead+0x2cd/0x510 [ 189.621206] ? __pfx_swap_cluster_readahead+0x10/0x10 [ 189.626279] ? swap_cache_get_folio+0xcd/0x360 [ 189.630760] ? __count_memcg_events+0x10a/0x370 [ 189.635318] shmem_swapin_folio+0x2f2/0xc60 [ 189.639525] ? __pfx__raw_spin_lock+0x10/0x10 [ 189.643908] ? __pte_offset_map+0x19/0x1d0 [ 189.648024] shmem_get_folio_gfp+0x307/0xe30 [ 189.652323] ? __schedule+0x9f0/0x1fe0 [ 189.656096] ? __pfx_shmem_get_folio_gfp+0x10/0x10 [ 189.660923] ? filemap_map_pages+0x999/0xe60 [ 189.665211] shmem_fault+0x1d9/0x810 [ 189.668834] ? __pfx_shmem_fault+0x10/0x10 [ 189.672954] ? __pfx_filemap_map_pages+0x10/0x10 [ 189.677590] __do_fault+0xed/0x390 [ 189.681012] __handle_mm_fault+0x1ba1/0x2e80 [ 189.685297] ? __pfx___handle_mm_fault+0x10/0x10 [ 189.689933] ? __pfx_down_read_trylock+0x10/0x10 [ 189.694570] ? __pfx_hrtimer_nanosleep+0x10/0x10 [ 189.699215] handle_mm_fault+0xe0/0x560 [ 189.703074] ? __pfx_restore_fpregs_from_fpstate+0x10/0x10 [ 189.708620] do_user_addr_fault+0x2ba/0x9d0 [ 189.712828] exc_page_fault+0x54/0x90 [ 189.716508] asm_exc_page_fault+0x22/0x30 [ 189.720535] RIP: 0033:0x5640dc2d72b5 [ 189.724131] Code: 98 48 ba 00 00 00 00 03 00 00 00 48 89 d6 48 2b 75 a0 ba 00 00 00 00 48 f7 f6 48 89 d1 48 89 ca 48 8b 45 a0 48 01 d0 48 01 d8 <0f> b6 00 bf e8 03 00 00 e8 1e fe ff ff 48 83 45 a8 01 48 8d 45 d0 [ 189.742922] RSP: 002b:00007ffc227e3f60 EFLAGS: 00010206 [ 189.748165] RAX: 00007f6e659a2ea3 RBX: 00007f6e2007e000 RCX: 0000000045924ea3 [ 189.755311] RDX: 0000000045924ea3 RSI: 0000000300000000 RDI: 00007f71202586a0 [ 189.762483] RBP: 00007ffc227e3fe0 R08: 00007f7120258074 R09: 00007f71202580a0 [ 189.769633] R10: 0000000000019458 R11: 00000000008aa400 R12: 0000000000000000 [ 189.776781] R13: 00007ffc227e4128 R14: 00007f712029d000 R15: 00005640dc2d9dd8 [ 189.783928] </TASK> [ 189.786126] ---[ end trace 0000000000000000 ]--- [ 285.827888] get_swap_device: Unused swap offset entry 0018403f [ 320.699306] get_swap_device: Unused swap offset entry 000b001b [ 354.031339] get_swap_device: Unused swap offset entry 000681a9 [ 364.958435] get_swap_device: Unused swap offset entry 001f4055 [ 364.976235] get_swap_device: Unused swap offset entry 001f4057 [ 365.530174] get_swap_device: Unused swap offset entry 000d415c [ 394.223792] get_swap_device: Unused swap offset entry 001540d0 [ 394.317299] get_swap_device: Unused swap offset entry 000341d9 [ 394.341727] get_swap_device: Unused swap offset entry 0006c07e [ 396.062365] get_swap_device: Unused swap offset entry 000541a4 [ 396.068262] get_swap_device: Unused swap offset entry 000541a7 [ 402.629551] get_swap_device: Unused swap offset entry 00294021 [ 436.740622] get_swap_device: Unused swap offset entry 00334155 [ 436.758527] get_swap_device: Unused swap offset entry 001b417c
swap_cluster_readahead() calls __read_swap_cache_async() on a range of made-up swap entries around the faulting slot. The device and the range (si->max) are valid, but the specific entry might not be in use. __read_swap_cache_async() instead relies on swapcache_prepare() returning -ENOENT to catch this and skip gracefully.
Sorry this got in your way, but thanks for the report! Clearly I didn't review the call sites thoroughly enough. I'll move the extra check to free_swap_and_cache(), retest and hopefully repost later today.
Confirmed that reverting the patch makes the warnings go away.
Ryan Roberts ryan.roberts@arm.com writes:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries.
Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
I think that this can be simplified. Even for a 4K folio, this could happen.
CPU0 CPU1 ---- ----
zap_pte_range free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */
swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch") Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat... Cc: stable@vger.kernel.org Signed-off-by: Ryan Roberts ryan.roberts@arm.com
Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4).
Thanks, Ryan
mm/swapfile.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 2b3a2d85e350..f580e6abc674 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) smp_rmb(); offset = swp_offset(entry); if (offset >= si->max)
goto put_out;
goto bad_offset;
if (data_race(!si->swap_map[swp_offset(entry)]))
goto bad_free;
return si;
bad_nofile: @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) out: return NULL; put_out:
- pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); percpu_ref_put(&si->users); return NULL;
+bad_offset:
- pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
- goto put_out;
+bad_free:
- pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
- goto put_out;
}
I don't think that it's a good idea to warn for bad free entries. get_swap_device() could be called without enough lock to prevent parallel swap operations on entry. So, false positive is possible there. I think that it's good to add more checks in general, for example, in free_swap_and_cache(), we can check more because we are sure the swap entry will not be freed by parallel swap operations. Just don't put the check in general get_swap_device(). We can add another helper to check that.
I found that there are other checks in get_swap_device() already. I think that we may need to revert,
commit 23b230ba8ac3 ("mm/swap: print bad swap offset entry in get_swap_device")
which introduces it. And add check in appropriate places.
-- Best Regards, Huang, Ying
static unsigned char __swap_entry_free(struct swap_info_struct *p, @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry) if (non_swap_entry(entry)) return 1;
- p = _swap_info_get(entry);
- p = get_swap_device(entry); if (p) { count = __swap_entry_free(p, entry); if (count == SWAP_HAS_CACHE && !swap_page_trans_huge_swapped(p, entry)) __try_to_reclaim_swap(p, swp_offset(entry), TTRS_UNMAPPED | TTRS_FULL);
} return p != NULL;put_swap_device(p);
}
2.25.1
On 2024/3/6 10:52, Huang, Ying wrote:
Ryan Roberts ryan.roberts@arm.com writes:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries.
Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
I think that this can be simplified. Even for a 4K folio, this could happen.
CPU0 CPU1
zap_pte_range free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */
swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held. So synchronize_rcu (called by swapoff) will wait zap_pte_range to release the pte lock. So this theoretical problem can't happen. Or am I miss something?
CPU0 CPU1 ---- ----
zap_pte_range pte_offset_map_lock -- spin_lock is held. free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ percpu_ref_kill(&p->users); swap_page_trans_huge_swapped pte_unmap_unlock -- spin_lock is released. synchronize_rcu(); --> Will wait pte_unmap_unlock to be called? /* free si->swap_map[] */
Thanks.
On 06/03/2024 08:51, Miaohe Lin wrote:
On 2024/3/6 10:52, Huang, Ying wrote:
Ryan Roberts ryan.roberts@arm.com writes:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries.
Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
I think that this can be simplified. Even for a 4K folio, this could happen.
CPU0 CPU1
zap_pte_range free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */
swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
I don't beleive it has the PTL when called by shmem.
So synchronize_rcu (called by swapoff) will wait zap_pte_range to release the pte lock. So this theoretical problem can't happen. Or am I miss something?
For Huang Ying's example, I agree this can't happen because try_to_unuse() will be waiting for the PTL (see the reply I just sent).
CPU0 CPU1
zap_pte_range pte_offset_map_lock -- spin_lock is held. free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ percpu_ref_kill(&p->users); swap_page_trans_huge_swapped pte_unmap_unlock -- spin_lock is released. synchronize_rcu(); --> Will wait pte_unmap_unlock to be called?
Perhaps you can educate me here; I thought that synchronize_rcu() will only wait for RCU critical sections to complete. The PTL is a spin lock, so why would synchronize_rcu() wait for the PTL to become unlocked?
/* free si->swap_map[] */
Thanks.
On 2024/3/6 17:31, Ryan Roberts wrote:
On 06/03/2024 08:51, Miaohe Lin wrote:
On 2024/3/6 10:52, Huang, Ying wrote:
Ryan Roberts ryan.roberts@arm.com writes:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries.
Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
I think that this can be simplified. Even for a 4K folio, this could happen.
CPU0 CPU1
zap_pte_range free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */
swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
I don't beleive it has the PTL when called by shmem.
In the case of shmem, folio_lock is used to guard against the race.
So synchronize_rcu (called by swapoff) will wait zap_pte_range to release the pte lock. So this theoretical problem can't happen. Or am I miss something?
For Huang Ying's example, I agree this can't happen because try_to_unuse() will be waiting for the PTL (see the reply I just sent).
Do you mean the below message? " I don't think si->inuse_pages is decremented until __try_to_reclaim_swap() is called (per David, above), which is called after swap_page_trans_huge_swapped() has executed. So in CPU1, try_to_unuse() wouldn't see si->inuse_pages being zero until after CPU0 has completed accessing si->swap_map, so if swapoff starts where you have put it, it would get stalled waiting for the PTL which CPU0 has. "
I agree try_to_unuse() will wait for si->inuse_pages being zero. But why will it waits for the PTL? It seems PTL is not used to protect si->inuse_pages. Or am I miss something?
CPU0 CPU1
zap_pte_range pte_offset_map_lock -- spin_lock is held. free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ percpu_ref_kill(&p->users); swap_page_trans_huge_swapped pte_unmap_unlock -- spin_lock is released. synchronize_rcu(); --> Will wait pte_unmap_unlock to be called?
Perhaps you can educate me here; I thought that synchronize_rcu() will only wait for RCU critical sections to complete. The PTL is a spin lock, so why would synchronize_rcu() wait for the PTL to become unlocked?
I assume PTL will always disable preemption which disables a grace period until PTL is released. But this might be fragile and I'm not really sure. I might be wrong.
Thanks.
/* free si->swap_map[] */
Thanks.
.
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/6 17:31, Ryan Roberts wrote:
On 06/03/2024 08:51, Miaohe Lin wrote:
On 2024/3/6 10:52, Huang, Ying wrote:
Ryan Roberts ryan.roberts@arm.com writes:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries.
Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
I think that this can be simplified. Even for a 4K folio, this could happen.
CPU0 CPU1
zap_pte_range free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */
swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
I don't beleive it has the PTL when called by shmem.
In the case of shmem, folio_lock is used to guard against the race.
I don't find folio is lock for shmem. find_lock_entries() will only lock the folio if (!xa_is_value()), that is, not swap entry. Can you point out where the folio is locked for shmem?
-- Best Regards, Huang, Ying
So synchronize_rcu (called by swapoff) will wait zap_pte_range to release the pte lock. So this theoretical problem can't happen. Or am I miss something?
For Huang Ying's example, I agree this can't happen because try_to_unuse() will be waiting for the PTL (see the reply I just sent).
Do you mean the below message? " I don't think si->inuse_pages is decremented until __try_to_reclaim_swap() is called (per David, above), which is called after swap_page_trans_huge_swapped() has executed. So in CPU1, try_to_unuse() wouldn't see si->inuse_pages being zero until after CPU0 has completed accessing si->swap_map, so if swapoff starts where you have put it, it would get stalled waiting for the PTL which CPU0 has. "
I agree try_to_unuse() will wait for si->inuse_pages being zero. But why will it waits for the PTL? It seems PTL is not used to protect si->inuse_pages. Or am I miss something?
CPU0 CPU1
zap_pte_range pte_offset_map_lock -- spin_lock is held. free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ percpu_ref_kill(&p->users); swap_page_trans_huge_swapped pte_unmap_unlock -- spin_lock is released. synchronize_rcu(); --> Will wait pte_unmap_unlock to be called?
Perhaps you can educate me here; I thought that synchronize_rcu() will only wait for RCU critical sections to complete. The PTL is a spin lock, so why would synchronize_rcu() wait for the PTL to become unlocked?
I assume PTL will always disable preemption which disables a grace period until PTL is released. But this might be fragile and I'm not really sure. I might be wrong.
Thanks.
/* free si->swap_map[] */
Thanks.
.
On 2024/3/7 13:56, Huang, Ying wrote:
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/6 17:31, Ryan Roberts wrote:
On 06/03/2024 08:51, Miaohe Lin wrote:
On 2024/3/6 10:52, Huang, Ying wrote:
Ryan Roberts ryan.roberts@arm.com writes:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries.
Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
I think that this can be simplified. Even for a 4K folio, this could happen.
CPU0 CPU1
zap_pte_range free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */
swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
I don't beleive it has the PTL when called by shmem.
In the case of shmem, folio_lock is used to guard against the race.
I don't find folio is lock for shmem. find_lock_entries() will only lock the folio if (!xa_is_value()), that is, not swap entry. Can you point out where the folio is locked for shmem?
You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent. shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from shmem_undo_range() after shmem_unuse(). Or am I miss something?
Thanks.
-- Best Regards, Huang, Ying
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/7 13:56, Huang, Ying wrote:
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/6 17:31, Ryan Roberts wrote:
On 06/03/2024 08:51, Miaohe Lin wrote:
On 2024/3/6 10:52, Huang, Ying wrote:
Ryan Roberts ryan.roberts@arm.com writes:
> There was previously a theoretical window where swapoff() could run and > teardown a swap_info_struct while a call to free_swap_and_cache() was > running in another thread. This could cause, amongst other bad > possibilities, swap_page_trans_huge_swapped() (called by > free_swap_and_cache()) to access the freed memory for swap_map. > > This is a theoretical problem and I haven't been able to provoke it from > a test case. But there has been agreement based on code review that this > is possible (see link below). > > Fix it by using get_swap_device()/put_swap_device(), which will stall > swapoff(). There was an extra check in _swap_info_get() to confirm that > the swap entry was valid. This wasn't present in get_swap_device() so > I've added it. I couldn't find any existing get_swap_device() call sites > where this extra check would cause any false alarms. > > Details of how to provoke one possible issue (thanks to David Hilenbrand > for deriving this): > > --8<----- > > __swap_entry_free() might be the last user and result in > "count == SWAP_HAS_CACHE". > > swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. > > So the question is: could someone reclaim the folio and turn > si->inuse_pages==0, before we completed swap_page_trans_huge_swapped(). > > Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are > still references by swap entries. > > Process 1 still references subpage 0 via swap entry. > Process 2 still references subpage 1 via swap entry. > > Process 1 quits. Calls free_swap_and_cache(). > -> count == SWAP_HAS_CACHE > [then, preempted in the hypervisor etc.] > > Process 2 quits. Calls free_swap_and_cache(). > -> count == SWAP_HAS_CACHE > > Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls > __try_to_reclaim_swap(). > > __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> > put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> > swap_entry_free()->swap_range_free()-> > ... > WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); > > What stops swapoff to succeed after process 2 reclaimed the swap cache > but before process1 finished its call to swap_page_trans_huge_swapped()? > > --8<-----
I think that this can be simplified. Even for a 4K folio, this could happen.
CPU0 CPU1
zap_pte_range free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */
swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
I don't beleive it has the PTL when called by shmem.
In the case of shmem, folio_lock is used to guard against the race.
I don't find folio is lock for shmem. find_lock_entries() will only lock the folio if (!xa_is_value()), that is, not swap entry. Can you point out where the folio is locked for shmem?
You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent. shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from shmem_undo_range() after shmem_unuse(). Or am I miss something?
I think the following situation is possible. Right?
CPU0 CPU1 ---- ---- shmem_undo_range shmem_free_swap xa_cmpxchg_irq free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse shmem_unuse /* cannot find swap entry */ find_next_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */ swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
shmem_undo_range can run earlier.
-- Best Regards, Huang, Ying
On 07/03/2024 07:34, Huang, Ying wrote:
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/7 13:56, Huang, Ying wrote:
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/6 17:31, Ryan Roberts wrote:
On 06/03/2024 08:51, Miaohe Lin wrote:
On 2024/3/6 10:52, Huang, Ying wrote: > Ryan Roberts ryan.roberts@arm.com writes: > >> There was previously a theoretical window where swapoff() could run and >> teardown a swap_info_struct while a call to free_swap_and_cache() was >> running in another thread. This could cause, amongst other bad >> possibilities, swap_page_trans_huge_swapped() (called by >> free_swap_and_cache()) to access the freed memory for swap_map. >> >> This is a theoretical problem and I haven't been able to provoke it from >> a test case. But there has been agreement based on code review that this >> is possible (see link below). >> >> Fix it by using get_swap_device()/put_swap_device(), which will stall >> swapoff(). There was an extra check in _swap_info_get() to confirm that >> the swap entry was valid. This wasn't present in get_swap_device() so >> I've added it. I couldn't find any existing get_swap_device() call sites >> where this extra check would cause any false alarms. >> >> Details of how to provoke one possible issue (thanks to David Hilenbrand >> for deriving this): >> >> --8<----- >> >> __swap_entry_free() might be the last user and result in >> "count == SWAP_HAS_CACHE". >> >> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. >> >> So the question is: could someone reclaim the folio and turn >> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped(). >> >> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are >> still references by swap entries. >> >> Process 1 still references subpage 0 via swap entry. >> Process 2 still references subpage 1 via swap entry. >> >> Process 1 quits. Calls free_swap_and_cache(). >> -> count == SWAP_HAS_CACHE >> [then, preempted in the hypervisor etc.] >> >> Process 2 quits. Calls free_swap_and_cache(). >> -> count == SWAP_HAS_CACHE >> >> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls >> __try_to_reclaim_swap(). >> >> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> >> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> >> swap_entry_free()->swap_range_free()-> >> ... >> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >> >> What stops swapoff to succeed after process 2 reclaimed the swap cache >> but before process1 finished its call to swap_page_trans_huge_swapped()? >> >> --8<----- > > I think that this can be simplified. Even for a 4K folio, this could > happen. > > CPU0 CPU1 > ---- ---- > > zap_pte_range > free_swap_and_cache > __swap_entry_free > /* swap count become 0 */ > swapoff > try_to_unuse > filemap_get_folio > folio_free_swap > /* remove swap cache */ > /* free si->swap_map[] */ > > swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
I don't beleive it has the PTL when called by shmem.
In the case of shmem, folio_lock is used to guard against the race.
I don't find folio is lock for shmem. find_lock_entries() will only lock the folio if (!xa_is_value()), that is, not swap entry. Can you point out where the folio is locked for shmem?
You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent. shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from shmem_undo_range() after shmem_unuse(). Or am I miss something?
I think the following situation is possible. Right?
CPU0 CPU1
shmem_undo_range shmem_free_swap xa_cmpxchg_irq free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse shmem_unuse /* cannot find swap entry */ find_next_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */ swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
shmem_undo_range can run earlier.
Yes that's the shmem problem I've been trying to convey. Perhaps there are other (extremely subtle) mechanisms that make this impossible, I don't know.
Either way, given the length of this discussion, and the subtleties in the syncrhonization mechanisms that have so far been identified, I think the safest thing to do is just apply the patch. Then we have explicit syncrhonization that we can trivially reason about.
Thanks, Ryan
P.S. Thanks for the explanation about spinlocks being RCU read-side critical sections - I didn't know that.
-- Best Regards, Huang, Ying
Ryan Roberts ryan.roberts@arm.com writes:
On 07/03/2024 07:34, Huang, Ying wrote:
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/7 13:56, Huang, Ying wrote:
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/6 17:31, Ryan Roberts wrote:
On 06/03/2024 08:51, Miaohe Lin wrote: > On 2024/3/6 10:52, Huang, Ying wrote: >> Ryan Roberts ryan.roberts@arm.com writes: >> >>> There was previously a theoretical window where swapoff() could run and >>> teardown a swap_info_struct while a call to free_swap_and_cache() was >>> running in another thread. This could cause, amongst other bad >>> possibilities, swap_page_trans_huge_swapped() (called by >>> free_swap_and_cache()) to access the freed memory for swap_map. >>> >>> This is a theoretical problem and I haven't been able to provoke it from >>> a test case. But there has been agreement based on code review that this >>> is possible (see link below). >>> >>> Fix it by using get_swap_device()/put_swap_device(), which will stall >>> swapoff(). There was an extra check in _swap_info_get() to confirm that >>> the swap entry was valid. This wasn't present in get_swap_device() so >>> I've added it. I couldn't find any existing get_swap_device() call sites >>> where this extra check would cause any false alarms. >>> >>> Details of how to provoke one possible issue (thanks to David Hilenbrand >>> for deriving this): >>> >>> --8<----- >>> >>> __swap_entry_free() might be the last user and result in >>> "count == SWAP_HAS_CACHE". >>> >>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. >>> >>> So the question is: could someone reclaim the folio and turn >>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped(). >>> >>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are >>> still references by swap entries. >>> >>> Process 1 still references subpage 0 via swap entry. >>> Process 2 still references subpage 1 via swap entry. >>> >>> Process 1 quits. Calls free_swap_and_cache(). >>> -> count == SWAP_HAS_CACHE >>> [then, preempted in the hypervisor etc.] >>> >>> Process 2 quits. Calls free_swap_and_cache(). >>> -> count == SWAP_HAS_CACHE >>> >>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls >>> __try_to_reclaim_swap(). >>> >>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> >>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> >>> swap_entry_free()->swap_range_free()-> >>> ... >>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >>> >>> What stops swapoff to succeed after process 2 reclaimed the swap cache >>> but before process1 finished its call to swap_page_trans_huge_swapped()? >>> >>> --8<----- >> >> I think that this can be simplified. Even for a 4K folio, this could >> happen. >> >> CPU0 CPU1 >> ---- ---- >> >> zap_pte_range >> free_swap_and_cache >> __swap_entry_free >> /* swap count become 0 */ >> swapoff >> try_to_unuse >> filemap_get_folio >> folio_free_swap >> /* remove swap cache */ >> /* free si->swap_map[] */ >> >> swap_page_trans_huge_swapped <-- access freed si->swap_map !!! > > Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
I don't beleive it has the PTL when called by shmem.
In the case of shmem, folio_lock is used to guard against the race.
I don't find folio is lock for shmem. find_lock_entries() will only lock the folio if (!xa_is_value()), that is, not swap entry. Can you point out where the folio is locked for shmem?
You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent. shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from shmem_undo_range() after shmem_unuse(). Or am I miss something?
I think the following situation is possible. Right?
CPU0 CPU1
shmem_undo_range shmem_free_swap xa_cmpxchg_irq free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse shmem_unuse /* cannot find swap entry */ find_next_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */ swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
shmem_undo_range can run earlier.
Yes that's the shmem problem I've been trying to convey. Perhaps there are other (extremely subtle) mechanisms that make this impossible, I don't know.
Either way, given the length of this discussion, and the subtleties in the syncrhonization mechanisms that have so far been identified, I think the safest thing to do is just apply the patch. Then we have explicit syncrhonization that we can trivially reason about.
Yes. This is tricky and we can improve it. So I suggest to,
- Revise the patch description to use shmem race as example except someone found it's impossible.
- Revise the comments of get_swap_device() about RCU reader side lock (including IRQ off, spinlock, etc.) can prevent swapoff via synchronize_rcu() in swapoff().
- Revise the comments of synchronize_rcu() in swapoff(), which can prevent swapoff in parallel with RCU reader side lock including swap cache operations, etc.
-- Best Regards, Huang, Ying
On 07/03/2024 08:54, Huang, Ying wrote:
Ryan Roberts ryan.roberts@arm.com writes:
On 07/03/2024 07:34, Huang, Ying wrote:
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/7 13:56, Huang, Ying wrote:
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/6 17:31, Ryan Roberts wrote: > On 06/03/2024 08:51, Miaohe Lin wrote: >> On 2024/3/6 10:52, Huang, Ying wrote: >>> Ryan Roberts ryan.roberts@arm.com writes: >>> >>>> There was previously a theoretical window where swapoff() could run and >>>> teardown a swap_info_struct while a call to free_swap_and_cache() was >>>> running in another thread. This could cause, amongst other bad >>>> possibilities, swap_page_trans_huge_swapped() (called by >>>> free_swap_and_cache()) to access the freed memory for swap_map. >>>> >>>> This is a theoretical problem and I haven't been able to provoke it from >>>> a test case. But there has been agreement based on code review that this >>>> is possible (see link below). >>>> >>>> Fix it by using get_swap_device()/put_swap_device(), which will stall >>>> swapoff(). There was an extra check in _swap_info_get() to confirm that >>>> the swap entry was valid. This wasn't present in get_swap_device() so >>>> I've added it. I couldn't find any existing get_swap_device() call sites >>>> where this extra check would cause any false alarms. >>>> >>>> Details of how to provoke one possible issue (thanks to David Hilenbrand >>>> for deriving this): >>>> >>>> --8<----- >>>> >>>> __swap_entry_free() might be the last user and result in >>>> "count == SWAP_HAS_CACHE". >>>> >>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. >>>> >>>> So the question is: could someone reclaim the folio and turn >>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped(). >>>> >>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are >>>> still references by swap entries. >>>> >>>> Process 1 still references subpage 0 via swap entry. >>>> Process 2 still references subpage 1 via swap entry. >>>> >>>> Process 1 quits. Calls free_swap_and_cache(). >>>> -> count == SWAP_HAS_CACHE >>>> [then, preempted in the hypervisor etc.] >>>> >>>> Process 2 quits. Calls free_swap_and_cache(). >>>> -> count == SWAP_HAS_CACHE >>>> >>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls >>>> __try_to_reclaim_swap(). >>>> >>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> >>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> >>>> swap_entry_free()->swap_range_free()-> >>>> ... >>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >>>> >>>> What stops swapoff to succeed after process 2 reclaimed the swap cache >>>> but before process1 finished its call to swap_page_trans_huge_swapped()? >>>> >>>> --8<----- >>> >>> I think that this can be simplified. Even for a 4K folio, this could >>> happen. >>> >>> CPU0 CPU1 >>> ---- ---- >>> >>> zap_pte_range >>> free_swap_and_cache >>> __swap_entry_free >>> /* swap count become 0 */ >>> swapoff >>> try_to_unuse >>> filemap_get_folio >>> folio_free_swap >>> /* remove swap cache */ >>> /* free si->swap_map[] */ >>> >>> swap_page_trans_huge_swapped <-- access freed si->swap_map !!! >> >> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held. > > I don't beleive it has the PTL when called by shmem.
In the case of shmem, folio_lock is used to guard against the race.
I don't find folio is lock for shmem. find_lock_entries() will only lock the folio if (!xa_is_value()), that is, not swap entry. Can you point out where the folio is locked for shmem?
You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent. shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from shmem_undo_range() after shmem_unuse(). Or am I miss something?
I think the following situation is possible. Right?
CPU0 CPU1
shmem_undo_range shmem_free_swap xa_cmpxchg_irq free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse shmem_unuse /* cannot find swap entry */ find_next_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */ swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
shmem_undo_range can run earlier.
Yes that's the shmem problem I've been trying to convey. Perhaps there are other (extremely subtle) mechanisms that make this impossible, I don't know.
Either way, given the length of this discussion, and the subtleties in the syncrhonization mechanisms that have so far been identified, I think the safest thing to do is just apply the patch. Then we have explicit syncrhonization that we can trivially reason about.
Yes. This is tricky and we can improve it. So I suggest to,
Revise the patch description to use shmem race as example except someone found it's impossible.
Revise the comments of get_swap_device() about RCU reader side lock (including IRQ off, spinlock, etc.) can prevent swapoff via synchronize_rcu() in swapoff().
Revise the comments of synchronize_rcu() in swapoff(), which can prevent swapoff in parallel with RCU reader side lock including swap cache operations, etc.
The only problem with this is that Andrew has already put my v2 into mm-*stable* :-|
So (1) from that list isn't possible. I could do a patch for (2) and (3), but to be honest, I think you would do a better job of writing it up than I would - any chance you could post the patch?
-- Best Regards, Huang, Ying
Ryan Roberts ryan.roberts@arm.com writes:
On 07/03/2024 08:54, Huang, Ying wrote:
Ryan Roberts ryan.roberts@arm.com writes:
On 07/03/2024 07:34, Huang, Ying wrote:
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/7 13:56, Huang, Ying wrote:
Miaohe Lin linmiaohe@huawei.com writes:
> On 2024/3/6 17:31, Ryan Roberts wrote: >> On 06/03/2024 08:51, Miaohe Lin wrote: >>> On 2024/3/6 10:52, Huang, Ying wrote: >>>> Ryan Roberts ryan.roberts@arm.com writes: >>>> >>>>> There was previously a theoretical window where swapoff() could run and >>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was >>>>> running in another thread. This could cause, amongst other bad >>>>> possibilities, swap_page_trans_huge_swapped() (called by >>>>> free_swap_and_cache()) to access the freed memory for swap_map. >>>>> >>>>> This is a theoretical problem and I haven't been able to provoke it from >>>>> a test case. But there has been agreement based on code review that this >>>>> is possible (see link below). >>>>> >>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall >>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that >>>>> the swap entry was valid. This wasn't present in get_swap_device() so >>>>> I've added it. I couldn't find any existing get_swap_device() call sites >>>>> where this extra check would cause any false alarms. >>>>> >>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand >>>>> for deriving this): >>>>> >>>>> --8<----- >>>>> >>>>> __swap_entry_free() might be the last user and result in >>>>> "count == SWAP_HAS_CACHE". >>>>> >>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. >>>>> >>>>> So the question is: could someone reclaim the folio and turn >>>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped(). >>>>> >>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are >>>>> still references by swap entries. >>>>> >>>>> Process 1 still references subpage 0 via swap entry. >>>>> Process 2 still references subpage 1 via swap entry. >>>>> >>>>> Process 1 quits. Calls free_swap_and_cache(). >>>>> -> count == SWAP_HAS_CACHE >>>>> [then, preempted in the hypervisor etc.] >>>>> >>>>> Process 2 quits. Calls free_swap_and_cache(). >>>>> -> count == SWAP_HAS_CACHE >>>>> >>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls >>>>> __try_to_reclaim_swap(). >>>>> >>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> >>>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> >>>>> swap_entry_free()->swap_range_free()-> >>>>> ... >>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >>>>> >>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache >>>>> but before process1 finished its call to swap_page_trans_huge_swapped()? >>>>> >>>>> --8<----- >>>> >>>> I think that this can be simplified. Even for a 4K folio, this could >>>> happen. >>>> >>>> CPU0 CPU1 >>>> ---- ---- >>>> >>>> zap_pte_range >>>> free_swap_and_cache >>>> __swap_entry_free >>>> /* swap count become 0 */ >>>> swapoff >>>> try_to_unuse >>>> filemap_get_folio >>>> folio_free_swap >>>> /* remove swap cache */ >>>> /* free si->swap_map[] */ >>>> >>>> swap_page_trans_huge_swapped <-- access freed si->swap_map !!! >>> >>> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held. >> >> I don't beleive it has the PTL when called by shmem. > > In the case of shmem, folio_lock is used to guard against the race.
I don't find folio is lock for shmem. find_lock_entries() will only lock the folio if (!xa_is_value()), that is, not swap entry. Can you point out where the folio is locked for shmem?
You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent. shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from shmem_undo_range() after shmem_unuse(). Or am I miss something?
I think the following situation is possible. Right?
CPU0 CPU1
shmem_undo_range shmem_free_swap xa_cmpxchg_irq free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse shmem_unuse /* cannot find swap entry */ find_next_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */ swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
shmem_undo_range can run earlier.
Yes that's the shmem problem I've been trying to convey. Perhaps there are other (extremely subtle) mechanisms that make this impossible, I don't know.
Either way, given the length of this discussion, and the subtleties in the syncrhonization mechanisms that have so far been identified, I think the safest thing to do is just apply the patch. Then we have explicit syncrhonization that we can trivially reason about.
Yes. This is tricky and we can improve it. So I suggest to,
Revise the patch description to use shmem race as example except someone found it's impossible.
Revise the comments of get_swap_device() about RCU reader side lock (including IRQ off, spinlock, etc.) can prevent swapoff via synchronize_rcu() in swapoff().
Revise the comments of synchronize_rcu() in swapoff(), which can prevent swapoff in parallel with RCU reader side lock including swap cache operations, etc.
The only problem with this is that Andrew has already put my v2 into mm-*stable* :-|
So (1) from that list isn't possible. I could do a patch for (2) and (3), but to be honest, I think you would do a better job of writing it up than I would - any chance you could post the patch?
Sure. I will do that.
-- Best Regards, Huang, Ying
On 2024/3/7 15:34, Huang, Ying wrote:
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/7 13:56, Huang, Ying wrote:
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/6 17:31, Ryan Roberts wrote:
On 06/03/2024 08:51, Miaohe Lin wrote:
On 2024/3/6 10:52, Huang, Ying wrote: > Ryan Roberts ryan.roberts@arm.com writes: > >> There was previously a theoretical window where swapoff() could run and >> teardown a swap_info_struct while a call to free_swap_and_cache() was >> running in another thread. This could cause, amongst other bad >> possibilities, swap_page_trans_huge_swapped() (called by >> free_swap_and_cache()) to access the freed memory for swap_map. >> >> This is a theoretical problem and I haven't been able to provoke it from >> a test case. But there has been agreement based on code review that this >> is possible (see link below). >> >> Fix it by using get_swap_device()/put_swap_device(), which will stall >> swapoff(). There was an extra check in _swap_info_get() to confirm that >> the swap entry was valid. This wasn't present in get_swap_device() so >> I've added it. I couldn't find any existing get_swap_device() call sites >> where this extra check would cause any false alarms. >> >> Details of how to provoke one possible issue (thanks to David Hilenbrand >> for deriving this): >> >> --8<----- >> >> __swap_entry_free() might be the last user and result in >> "count == SWAP_HAS_CACHE". >> >> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. >> >> So the question is: could someone reclaim the folio and turn >> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped(). >> >> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are >> still references by swap entries. >> >> Process 1 still references subpage 0 via swap entry. >> Process 2 still references subpage 1 via swap entry. >> >> Process 1 quits. Calls free_swap_and_cache(). >> -> count == SWAP_HAS_CACHE >> [then, preempted in the hypervisor etc.] >> >> Process 2 quits. Calls free_swap_and_cache(). >> -> count == SWAP_HAS_CACHE >> >> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls >> __try_to_reclaim_swap(). >> >> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> >> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> >> swap_entry_free()->swap_range_free()-> >> ... >> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >> >> What stops swapoff to succeed after process 2 reclaimed the swap cache >> but before process1 finished its call to swap_page_trans_huge_swapped()? >> >> --8<----- > > I think that this can be simplified. Even for a 4K folio, this could > happen. > > CPU0 CPU1 > ---- ---- > > zap_pte_range > free_swap_and_cache > __swap_entry_free > /* swap count become 0 */ > swapoff > try_to_unuse > filemap_get_folio > folio_free_swap > /* remove swap cache */ > /* free si->swap_map[] */ > > swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
I don't beleive it has the PTL when called by shmem.
In the case of shmem, folio_lock is used to guard against the race.
I don't find folio is lock for shmem. find_lock_entries() will only lock the folio if (!xa_is_value()), that is, not swap entry. Can you point out where the folio is locked for shmem?
You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent. shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from shmem_undo_range() after shmem_unuse(). Or am I miss something?
I think the following situation is possible. Right?
CPU0 CPU1
shmem_undo_range shmem_free_swap xa_cmpxchg_irq free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse shmem_unuse /* cannot find swap entry */ find_next_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */ swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
shmem_undo_range can run earlier.
Considering above case, I tend to agree it's possible. I can't figure out a mechanism to make it impossible yet.
Thanks.
-- Best Regards, Huang, Ying .
Ryan Roberts ryan.roberts@arm.com writes:
On 06/03/2024 08:51, Miaohe Lin wrote:
On 2024/3/6 10:52, Huang, Ying wrote:
Ryan Roberts ryan.roberts@arm.com writes:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries.
Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
I think that this can be simplified. Even for a 4K folio, this could happen.
CPU0 CPU1
zap_pte_range free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */
swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
I don't beleive it has the PTL when called by shmem.
Yes, we don't hold PTL there.
After checking the code again. I think that there may be race condition as above without PTL. But I may miss something, again.
So synchronize_rcu (called by swapoff) will wait zap_pte_range to release the pte lock. So this theoretical problem can't happen. Or am I miss something?
For Huang Ying's example, I agree this can't happen because try_to_unuse() will be waiting for the PTL (see the reply I just sent).
CPU0 CPU1
zap_pte_range pte_offset_map_lock -- spin_lock is held. free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ percpu_ref_kill(&p->users); swap_page_trans_huge_swapped pte_unmap_unlock -- spin_lock is released. synchronize_rcu(); --> Will wait pte_unmap_unlock to be called?
Perhaps you can educate me here; I thought that synchronize_rcu() will only wait for RCU critical sections to complete. The PTL is a spin lock, so why would synchronize_rcu() wait for the PTL to become unlocked?
Please take a look at the following link,
https://www.kernel.org/doc/html/next/RCU/whatisRCU.html#rcu-read-lock
" Note that anything that disables bottom halves, preemption, or interrupts also enters an RCU read-side critical section. Acquiring a spinlock also enters an RCU read-side critical sections, even for spinlocks that do not disable preemption, as is the case in kernels built with CONFIG_PREEMPT_RT=y. Sleeplocks do not enter RCU read-side critical sections. "
-- Best Regards, Huang, Ying
/* free si->swap_map[] */
Thanks.
Miaohe Lin linmiaohe@huawei.com writes:
On 2024/3/6 10:52, Huang, Ying wrote:
Ryan Roberts ryan.roberts@arm.com writes:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries.
Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
I think that this can be simplified. Even for a 4K folio, this could happen.
CPU0 CPU1
zap_pte_range free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */
swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held. So synchronize_rcu (called by swapoff) will wait zap_pte_range to release the pte lock. So this theoretical problem can't happen. Or am I miss something?
CPU0 CPU1
zap_pte_range pte_offset_map_lock -- spin_lock is held. free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ percpu_ref_kill(&p->users); swap_page_trans_huge_swapped pte_unmap_unlock -- spin_lock is released. synchronize_rcu(); --> Will wait pte_unmap_unlock to be called? /* free si->swap_map[] */
I think that you are right. We are safe if PTL is held. Thanks a lot for pointing this out!
-- Best Regards, Huang, Ying
On 06/03/2024 02:52, Huang, Ying wrote:
Ryan Roberts ryan.roberts@arm.com writes:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries.
Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
I think that this can be simplified. Even for a 4K folio, this could happen.
I'm not so sure...
CPU0 CPU1
zap_pte_range free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */
swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
I don't think si->inuse_pages is decremented until __try_to_reclaim_swap() is called (per David, above), which is called after swap_page_trans_huge_swapped() has executed. So in CPU1, try_to_unuse() wouldn't see si->inuse_pages being zero until after CPU0 has completed accessing si->swap_map, so if swapoff starts where you have put it, it would get stalled waiting for the PTL which CPU0 has.
I'm sure there are other ways that this could be racy for 4K folios, but I don't think this is one of them? e.g. if CPU0 does something with si after it has decremented si->inuse_pages, then there is a problem.
Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch") Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat... Cc: stable@vger.kernel.org Signed-off-by: Ryan Roberts ryan.roberts@arm.com
Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4).
Thanks, Ryan
mm/swapfile.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 2b3a2d85e350..f580e6abc674 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) smp_rmb(); offset = swp_offset(entry); if (offset >= si->max)
goto put_out;
goto bad_offset;
if (data_race(!si->swap_map[swp_offset(entry)]))
goto bad_free;
return si;
bad_nofile: @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) out: return NULL; put_out:
- pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); percpu_ref_put(&si->users); return NULL;
+bad_offset:
- pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
- goto put_out;
+bad_free:
- pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
- goto put_out;
}
I don't think that it's a good idea to warn for bad free entries. get_swap_device() could be called without enough lock to prevent parallel swap operations on entry. So, false positive is possible there. I think that it's good to add more checks in general, for example, in free_swap_and_cache(), we can check more because we are sure the swap entry will not be freed by parallel swap operations.
Yes, agreed. Johannes also reported that he is seeing false alarms due to this. I'm going to remove it and add it to free_swap_and_cache() as you suggest. This also fits well for the batched version I'm working on where we want to check the global si things once, but need to check !free for every entry in the loop (aiming to post that this week).
Just
don't put the check in general get_swap_device(). We can add another helper to check that.
I found that there are other checks in get_swap_device() already. I think that we may need to revert,
commit 23b230ba8ac3 ("mm/swap: print bad swap offset entry in get_swap_device")
Yes agree this should be reverted.
which introduces it. And add check in appropriate places.
I'm not quite sure what the "appropriate places" are. Looking at the call sites for get_swap_device(), it looks like they are all racy except free_swap_and_cache() which is called with the PTL. So check should only really go there?
But... free_swap_and_cache() is called without the PTL by shmem (in 2 places - see shmem_free_swap() wrapper). It looks like in one of those places, the folio lock is held, so I guess this has a similar effect to holding the PTL. But the other shmem_free_swap() call site doesn't appear to have the folio lock. Am I missing something, or does this mean that for this path, free_swap_and_cache() is racy and therefore we shouldn't be doing either the `offset >= si->max` or the `!swap_map[offset]` in free_swap_and_cache() either?
-- Best Regards, Huang, Ying
static unsigned char __swap_entry_free(struct swap_info_struct *p, @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry) if (non_swap_entry(entry)) return 1;
- p = _swap_info_get(entry);
- p = get_swap_device(entry); if (p) { count = __swap_entry_free(p, entry); if (count == SWAP_HAS_CACHE && !swap_page_trans_huge_swapped(p, entry)) __try_to_reclaim_swap(p, swp_offset(entry), TTRS_UNMAPPED | TTRS_FULL);
} return p != NULL;put_swap_device(p);
}
2.25.1
Ryan Roberts ryan.roberts@arm.com writes:
On 06/03/2024 02:52, Huang, Ying wrote:
Ryan Roberts ryan.roberts@arm.com writes:
There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was valid. This wasn't present in get_swap_device() so I've added it. I couldn't find any existing get_swap_device() call sites where this extra check would cause any false alarms.
Details of how to provoke one possible issue (thanks to David Hilenbrand for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries.
Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
I think that this can be simplified. Even for a 4K folio, this could happen.
I'm not so sure...
CPU0 CPU1
zap_pte_range free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */
swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
I don't think si->inuse_pages is decremented until __try_to_reclaim_swap() is called (per David, above), which is called after swap_page_trans_huge_swapped() has executed. So in CPU1, try_to_unuse() wouldn't see si->inuse_pages being zero
try_to_unuse() can decrease si->inuse_pages by itself via the following loop,
while (READ_ONCE(si->inuse_pages) && !signal_pending(current) && (i = find_next_to_unuse(si, i)) != 0) {
entry = swp_entry(type, i); folio = filemap_get_folio(swap_address_space(entry), i); if (IS_ERR(folio)) continue;
/* * It is conceivable that a racing task removed this folio from * swap cache just before we acquired the page lock. The folio * might even be back in swap cache on another swap area. But * that is okay, folio_free_swap() only removes stale folios. */ folio_lock(folio); folio_wait_writeback(folio); folio_free_swap(folio); folio_unlock(folio); folio_put(folio); }
Then it can proceed to swapoff. But as Miaohe pointed out, PTL to prevent swapoff to free data structures later via synchronize_rcu().
until after CPU0 has completed accessing si->swap_map, so if swapoff starts where you have put it, it would get stalled waiting for the PTL which CPU0 has.
I'm sure there are other ways that this could be racy for 4K folios, but I don't think this is one of them? e.g. if CPU0 does something with si after it has decremented si->inuse_pages, then there is a problem.
Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch") Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat... Cc: stable@vger.kernel.org Signed-off-by: Ryan Roberts ryan.roberts@arm.com
Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4).
Thanks, Ryan
mm/swapfile.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 2b3a2d85e350..f580e6abc674 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) smp_rmb(); offset = swp_offset(entry); if (offset >= si->max)
goto put_out;
goto bad_offset;
if (data_race(!si->swap_map[swp_offset(entry)]))
goto bad_free;
return si;
bad_nofile: @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) out: return NULL; put_out:
- pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); percpu_ref_put(&si->users); return NULL;
+bad_offset:
- pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
- goto put_out;
+bad_free:
- pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
- goto put_out;
}
I don't think that it's a good idea to warn for bad free entries. get_swap_device() could be called without enough lock to prevent parallel swap operations on entry. So, false positive is possible there. I think that it's good to add more checks in general, for example, in free_swap_and_cache(), we can check more because we are sure the swap entry will not be freed by parallel swap operations.
Yes, agreed. Johannes also reported that he is seeing false alarms due to this. I'm going to remove it and add it to free_swap_and_cache() as you suggest. This also fits well for the batched version I'm working on where we want to check the global si things once, but need to check !free for every entry in the loop (aiming to post that this week).
Just
don't put the check in general get_swap_device(). We can add another helper to check that.
I found that there are other checks in get_swap_device() already. I think that we may need to revert,
commit 23b230ba8ac3 ("mm/swap: print bad swap offset entry in get_swap_device")
Yes agree this should be reverted.
which introduces it. And add check in appropriate places.
I'm not quite sure what the "appropriate places" are. Looking at the call sites for get_swap_device(), it looks like they are all racy except free_swap_and_cache() which is called with the PTL. So check should only really go there?
But... free_swap_and_cache() is called without the PTL by shmem (in 2 places - see shmem_free_swap() wrapper). It looks like in one of those places, the folio lock is held, so I guess this has a similar effect to holding the PTL. But the other shmem_free_swap() call site doesn't appear to have the folio lock. Am I missing something, or does this mean that for this path, free_swap_and_cache() is racy and therefore we shouldn't be doing either the `offset >= si->max` or the `!swap_map[offset]` in free_swap_and_cache() either?
In shmem_free_swap(), xa_cmpxchg_irq() is used to get the swap entry and clear the original mapping. So, no other code path can free the swap count held by mapping. But, after dropping the swap count, it appears that it's not safe.
-- Best Regards, Huang, Ying
-- Best Regards, Huang, Ying
static unsigned char __swap_entry_free(struct swap_info_struct *p, @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry) if (non_swap_entry(entry)) return 1;
- p = _swap_info_get(entry);
- p = get_swap_device(entry); if (p) { count = __swap_entry_free(p, entry); if (count == SWAP_HAS_CACHE && !swap_page_trans_huge_swapped(p, entry)) __try_to_reclaim_swap(p, swp_offset(entry), TTRS_UNMAPPED | TTRS_FULL);
} return p != NULL;put_swap_device(p);
}
2.25.1
linux-stable-mirror@lists.linaro.org