The si->lock must be held when deleting the si from the available list. Otherwise, another thread can re-add the si to the available list, which can lead to memory corruption. The only place we have found where this happens is in the swapoff path. This case can be described as below:
core 0 core 1 swapoff
del_from_avail_list(si) waiting
try lock si->lock acquire swap_avail_lock and re-add si into swap_avail_head
acquire si->lock but missing si already be added again, and continuing to clear SWP_WRITEOK, etc.
It can be easily found a massive warning messages can be triggered inside get_swap_pages() by some special cases, for example, we call madvise(MADV_PAGEOUT) on blocks of touched memory concurrently, meanwhile, run much swapon-swapoff operations (e.g. stress-ng-swap).
However, in the worst case, panic can be caused by the above scene. In swapoff(), the memory used by si could be kept in swap_info[] after turning off a swap. This means memory corruption will not be caused immediately until allocated and reset for a new swap in the swapon path. A panic message caused: (with CONFIG_PLIST_DEBUG enabled)
------------[ cut here ]------------ top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70 Modules linked in: rfkill(E) crct10dif_ce(E)... CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+ Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) pc : plist_check_prev_next_node+0x50/0x70 lr : plist_check_prev_next_node+0x50/0x70 sp : ffff0018009d3c30 x29: ffff0018009d3c40 x28: ffff800011b32a98 x27: 0000000000000000 x26: ffff001803908000 x25: ffff8000128ea088 x24: ffff800011b32a48 x23: 0000000000000028 x22: ffff001800875c00 x21: ffff800010f9e520 x20: ffff001800875c00 x19: ffff001800fdc6e0 x18: 0000000000000030 x17: 0000000000000000 x16: 0000000000000000 x15: 0736076307640766 x14: 0730073007380731 x13: 0736076307640766 x12: 0730073007380731 x11: 000000000004058d x10: 0000000085a85b76 x9 : ffff8000101436e4 x8 : ffff800011c8ce08 x7 : 0000000000000000 x6 : 0000000000000001 x5 : ffff0017df9ed338 x4 : 0000000000000001 x3 : ffff8017ce62a000 x2 : ffff0017df9ed340 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: plist_check_prev_next_node+0x50/0x70 plist_check_head+0x80/0xf0 plist_add+0x28/0x140 add_to_avail_list+0x9c/0xf0 _enable_swap_info+0x78/0xb4 __do_sys_swapon+0x918/0xa10 __arm64_sys_swapon+0x20/0x30 el0_svc_common+0x8c/0x220 do_el0_svc+0x2c/0x90 el0_svc+0x1c/0x30 el0_sync_handler+0xa8/0xb0 el0_sync+0x148/0x180 irq event stamp: 2082270
Now, si->lock locked before calling 'del_from_avail_list()' to make sure other thread see the si had been deleted and SWP_WRITEOK cleared together, will not reinsert again.
This problem exists in versions after stable 5.10.y.
Cc: stable@vger.kernel.org Tested-by: Yongchen Yin wb-yyc939293@alibaba-inc.com Signed-off-by: Rongwei Wang rongwei.wang@linux.alibaba.com --- mm/swapfile.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 62ba2bf577d7..2c718f45745f 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) { int nid;
+ assert_spin_locked(&p->lock); for_each_node(nid) plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]); } @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) spin_unlock(&swap_lock); goto out_dput; } - del_from_avail_list(p); spin_lock(&p->lock); + del_from_avail_list(p); if (p->prio < 0) { struct swap_info_struct *si = p; int nid;
Hello
I have fix up some stuff base on Patch v1. And in order to help all readers and reviewers to
reproduce this bug, share a reproducer here:
swap_bomb.sh
#!/usr/bin/env bash
stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap --verify -v & stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap --verify -v & stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap --verify -v & stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap --verify -v
madvise_shared.c
#include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <unistd.h>
#define MSIZE (1024 * 1024 * 2)
int main() { char *shm_addr; unsigned long i;
while (1) { // Map shared memory segment shm_addr = mmap(NULL, MSIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (shm_addr == MAP_FAILED) { perror("Failed to map shared memory segment"); exit(EXIT_FAILURE); }
for (i = 0; i < MSIZE; i++) { shm_addr[i] = 1; }
// Advise kernel on usage pattern of shared memory if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) { perror ("Failed to advise kernel on shared memory usage"); exit(EXIT_FAILURE); }
for (i = 0; i < MSIZE; i++) { shm_addr[i] = 1; }
// Advise kernel on usage pattern of shared memory if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) { perror ("Failed to advise kernel on shared memory usage"); exit(EXIT_FAILURE); } // Use shared memory printf("Hello, shared memory: 0x%lx\n", shm_addr);
// Unmap shared memory segment if (munmap(shm_addr, MSIZE) == -1) { perror("Failed to unmap shared memory segment"); exit(EXIT_FAILURE); } }
return 0; }
The bug will reproduce more quickly (about 2~5 minutes) if concurrent more swap_bomb.sh and madvise_shared.
Thanks.
change log:
v1 -> v2
* fix up some commits and add assert_spin_locked(&p->lock) inside __delete_from_avail_list() (suggested by Matthew Wilcox and Bagas Sanjaya)
On 4/4/23 11:47 PM, Rongwei Wang wrote:
The si->lock must be held when deleting the si from the available list. Otherwise, another thread can re-add the si to the available list, which can lead to memory corruption. The only place we have found where this happens is in the swapoff path. This case can be described as below:
core 0 core 1 swapoff
del_from_avail_list(si) waiting
try lock si->lock acquire swap_avail_lock and re-add si into swap_avail_head
acquire si->lock but missing si already be added again, and continuing to clear SWP_WRITEOK, etc.
It can be easily found a massive warning messages can be triggered inside get_swap_pages() by some special cases, for example, we call madvise(MADV_PAGEOUT) on blocks of touched memory concurrently, meanwhile, run much swapon-swapoff operations (e.g. stress-ng-swap).
However, in the worst case, panic can be caused by the above scene. In swapoff(), the memory used by si could be kept in swap_info[] after turning off a swap. This means memory corruption will not be caused immediately until allocated and reset for a new swap in the swapon path. A panic message caused: (with CONFIG_PLIST_DEBUG enabled)
------------[ cut here ]------------ top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70 Modules linked in: rfkill(E) crct10dif_ce(E)... CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+ Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) pc : plist_check_prev_next_node+0x50/0x70 lr : plist_check_prev_next_node+0x50/0x70 sp : ffff0018009d3c30 x29: ffff0018009d3c40 x28: ffff800011b32a98 x27: 0000000000000000 x26: ffff001803908000 x25: ffff8000128ea088 x24: ffff800011b32a48 x23: 0000000000000028 x22: ffff001800875c00 x21: ffff800010f9e520 x20: ffff001800875c00 x19: ffff001800fdc6e0 x18: 0000000000000030 x17: 0000000000000000 x16: 0000000000000000 x15: 0736076307640766 x14: 0730073007380731 x13: 0736076307640766 x12: 0730073007380731 x11: 000000000004058d x10: 0000000085a85b76 x9 : ffff8000101436e4 x8 : ffff800011c8ce08 x7 : 0000000000000000 x6 : 0000000000000001 x5 : ffff0017df9ed338 x4 : 0000000000000001 x3 : ffff8017ce62a000 x2 : ffff0017df9ed340 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: plist_check_prev_next_node+0x50/0x70 plist_check_head+0x80/0xf0 plist_add+0x28/0x140 add_to_avail_list+0x9c/0xf0 _enable_swap_info+0x78/0xb4 __do_sys_swapon+0x918/0xa10 __arm64_sys_swapon+0x20/0x30 el0_svc_common+0x8c/0x220 do_el0_svc+0x2c/0x90 el0_svc+0x1c/0x30 el0_sync_handler+0xa8/0xb0 el0_sync+0x148/0x180 irq event stamp: 2082270
Now, si->lock locked before calling 'del_from_avail_list()' to make sure other thread see the si had been deleted and SWP_WRITEOK cleared together, will not reinsert again.
This problem exists in versions after stable 5.10.y.
Cc: stable@vger.kernel.org Tested-by: Yongchen Yin wb-yyc939293@alibaba-inc.com Signed-off-by: Rongwei Wang rongwei.wang@linux.alibaba.com
mm/swapfile.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 62ba2bf577d7..2c718f45745f 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) { int nid;
- assert_spin_locked(&p->lock); for_each_node(nid) plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]); }
@@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) spin_unlock(&swap_lock); goto out_dput; }
- del_from_avail_list(p); spin_lock(&p->lock);
- del_from_avail_list(p); if (p->prio < 0) { struct swap_info_struct *si = p; int nid;
On Wed, Apr 05, 2023 at 12:08:47AM +0800, Rongwei Wang wrote:
Hello
I have fix up some stuff base on Patch v1. And in order to help all readers and reviewers to
reproduce this bug, share a reproducer here:
I reproduced this problem under a VM this way:
$ sudo ./stress-ng --swap 1 // on another terminal $ for i in `seq 8`; do ./swap & done Looks simpler than yours :-) (Didn't realize you have posted your reproducer here since I'm not CCed and just found it after invented mine) Then the warning message normally appear within a few seconds.
Here is the code for the above swap prog:
#include <stdio.h> #include <stddef.h> #include <sys/mman.h>
#define SIZE 0x100000
int main(void) { int i, ret; void *p;
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); if (p == MAP_FAILED) { perror("mmap"); return -1; }
ret = 0; while (1) { for (i = 0; i < SIZE; i += 0x1000) ((char *)p)[i] = 1; ret = madvise(p, SIZE, MADV_PAGEOUT); if (ret != 0) { perror("madvise"); break; } }
return ret; }
Unfortunately, this test prog did not work on kernels before v5.4 because MADV_PAGEOUT is introduced in v5.4. I tested on v5.4 and the problem is also there.
Haven't found a way to trigger swap with swap device come and go on kernels before v5.4; tried putting the test prog in a memcg with memory limit but then the prog is easily killed due to nowhere to swap out.
swap_bomb.sh
#!/usr/bin/env bash
stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap --verify -v & stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap --verify -v & stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap --verify -v & stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap --verify -v
madvise_shared.c
#include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <unistd.h>
#define MSIZE (1024 * 1024 * 2)
int main() { char *shm_addr; unsigned long i;
while (1) { // Map shared memory segment shm_addr = mmap(NULL, MSIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (shm_addr == MAP_FAILED) { perror("Failed to map shared memory segment"); exit(EXIT_FAILURE); }
for (i = 0; i < MSIZE; i++) { shm_addr[i] = 1; }
// Advise kernel on usage pattern of shared memory if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) { perror ("Failed to advise kernel on shared memory usage"); exit(EXIT_FAILURE); }
for (i = 0; i < MSIZE; i++) { shm_addr[i] = 1; }
// Advise kernel on usage pattern of shared memory if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) { perror ("Failed to advise kernel on shared memory usage"); exit(EXIT_FAILURE); } // Use shared memory printf("Hello, shared memory: 0x%lx\n", shm_addr);
// Unmap shared memory segment if (munmap(shm_addr, MSIZE) == -1) { perror("Failed to unmap shared memory segment"); exit(EXIT_FAILURE); } }
return 0; }
The bug will reproduce more quickly (about 2~5 minutes) if concurrent more swap_bomb.sh and madvise_shared.
Thanks.
Oh, sorry, I miss this email just now, that because of I'm also replying your previous email.
On 2023/4/6 20:12, Aaron Lu wrote:
On Wed, Apr 05, 2023 at 12:08:47AM +0800, Rongwei Wang wrote:
Hello
I have fix up some stuff base on Patch v1. And in order to help all readers and reviewers to
reproduce this bug, share a reproducer here:
I reproduced this problem under a VM this way:
$ sudo ./stress-ng --swap 1 // on another terminal $ for i in `seq 8`; do ./swap & done Looks simpler than yours :-)
Cool, indeed become simpler.
(Didn't realize you have posted your reproducer here since I'm not CCed and just found it after invented mine) Then the warning message normally appear within a few seconds.
Here is the code for the above swap prog:
#include <stdio.h> #include <stddef.h> #include <sys/mman.h>
#define SIZE 0x100000
int main(void) { int i, ret; void *p;
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); if (p == MAP_FAILED) { perror("mmap"); return -1; } ret = 0; while (1) { for (i = 0; i < SIZE; i += 0x1000) ((char *)p)[i] = 1; ret = madvise(p, SIZE, MADV_PAGEOUT); if (ret != 0) { perror("madvise"); break; } } return ret;
}
Unfortunately, this test prog did not work on kernels before v5.4 because MADV_PAGEOUT is introduced in v5.4. I tested on v5.4 and the problem is also there.
Maybe that is this bug can not be found since now. And we found this is triggered by stress-ng-swap and stress-ng-madvise (PAGEOUT) firstly. It seems this is that reason.
It seems MADV_COLD is also introduced together with MADV_PAGEOUT. I have no idea and have to depend on you.:-)
Haven't found a way to trigger swap with swap device come and go on kernels before v5.4; tried putting the test prog in a memcg with memory limit but then the prog is easily killed due to nowhere to swap out.
Personally, I do not intend to continuing searching for the method to reproduce before v5.4. Of course, if you have idea, I can try.
Thanks:-)
On Tue, 4 Apr 2023 23:47:16 +0800 Rongwei Wang rongwei.wang@linux.alibaba.com wrote:
The si->lock must be held when deleting the si from the available list.
...
--- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) { int nid;
- assert_spin_locked(&p->lock); for_each_node(nid) plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
} @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) spin_unlock(&swap_lock); goto out_dput; }
- del_from_avail_list(p); spin_lock(&p->lock);
- del_from_avail_list(p); if (p->prio < 0) { struct swap_info_struct *si = p; int nid;
So we have
swap_avail_lock swap_info_struct.lock swap_cluster_info.lock
Is the ranking of these three clearly documented somewhere?
Did you test this with lockdep fully enabled?
I'm thinking that Aaron's a2468cc9bfdff ("swap: choose swap device according to numa node") is the appropriate Fixes: target - do you agree?
These functions use identifier `p' for the swap_info_struct*, whereas most other code uses the much more sensible `si'. That's just rude. But we shouldn't change that within this fix.
Hi Andrew
On 4/5/23 3:26 AM, Andrew Morton wrote:
On Tue, 4 Apr 2023 23:47:16 +0800 Rongwei Wang rongwei.wang@linux.alibaba.com wrote:
The si->lock must be held when deleting the si from the available list.
...
--- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) { int nid;
- assert_spin_locked(&p->lock); for_each_node(nid) plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]); }
@@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) spin_unlock(&swap_lock); goto out_dput; }
- del_from_avail_list(p); spin_lock(&p->lock);
- del_from_avail_list(p); if (p->prio < 0) { struct swap_info_struct *si = p; int nid;
So we have
swap_avail_lock swap_info_struct.lock swap_cluster_info.lock
Is the ranking of these three clearly documented somewhere?
It seems have
swap_lock
swap_info_struct.lock
swap_avail_lock
I just summary the ranking of these three locks by reading code, not find any documents (maybe have).
Did you test this with lockdep fully enabled?
I'm thinking that Aaron's a2468cc9bfdff ("swap: choose swap device according to numa node") is the appropriate Fixes: target - do you agree?
Yes, I'm sure my latest test version has included Aaron's a2468cc9bfdff, and my test .config has enabled CONFIG
as below:
CONFIG_LOCK_DEBUGGING_SUPPORT=y CONFIG_PROVE_LOCKING=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_LOCKDEP=y CONFIG_DEBUG_LOCKDEP=y CONFIG_DEBUG_ATOMIC_SLEEP=y
These functions use identifier `p' for the swap_info_struct*, whereas most other code uses the much more sensible `si'. That's just rude. But we shouldn't change that within this fix.
Indeed, It's confusing more or less to use both 'si' and 'p'. I can ready for another patch to replace 'p' with 'si'.
Thanks.
Hi Andrew,
Sorry for replying a little late, it's holiday here yesterday.
On Tue, Apr 04, 2023 at 12:26:00PM -0700, Andrew Morton wrote:
On Tue, 4 Apr 2023 23:47:16 +0800 Rongwei Wang rongwei.wang@linux.alibaba.com wrote:
The si->lock must be held when deleting the si from the available list.
...
--- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) { int nid;
- assert_spin_locked(&p->lock); for_each_node(nid) plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
} @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) spin_unlock(&swap_lock); goto out_dput; }
- del_from_avail_list(p); spin_lock(&p->lock);
- del_from_avail_list(p); if (p->prio < 0) { struct swap_info_struct *si = p; int nid;
So we have
swap_avail_lock swap_info_struct.lock swap_cluster_info.lock
Is the ranking of these three clearly documented somewhere?
I see some comments in swapfile.c mentioned something related, e.g. above the definition of swap_avail_heads, the comment mentioned swap_lock has to be taken before si->lock and swap_avail_lock can be taken after si->lock is held, but I'm not aware of a place documenting these things. Documenting these things is useful information I think, let me see if I can come up with something later.
Did you test this with lockdep fully enabled?
I'm thinking that Aaron's a2468cc9bfdff ("swap: choose swap device according to numa node") is the appropriate Fixes: target - do you agree?
It doesn't appear to be the case. For one thing, the problematic code that removes the swap device from the avail list without acquiring si->lock was there before my commit and my commit didn't change that behaviour. For another, I wanted to see if the problem is still there without my commit(just to make sure).
I followed Rongwei's description and used stress-ng/swap test together with some test progs that does memory allocation then MADVISE(pageout) in a loop to reproduce this problem and I can also see the warning like below using Linus' master branch as of today, I believe this is the problem Rongwei described:
[ 1914.518786] ------------[ cut here ]------------ [ 1914.519049] swap_info 9 in list but !SWP_WRITEOK [ 1914.519274] WARNING: CPU: 14 PID: 14307 at mm/swapfile.c:1085 get_swap_pages+0x3b3/0x440 [ 1914.519660] Modules linked in: [ 1914.519811] CPU: 14 PID: 14307 Comm: swap Tainted: G W 6.3.0-rc5-00032-g99ddf2254feb #5 [ 1914.520238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc36 04/01/2014 [ 1914.520641] RIP: 0010:get_swap_pages+0x3b3/0x440 [ 1914.520860] Code: 48 8b 4c 24 30 48 c1 e0 3a 4c 09 e0 48 89 01 e8 43 79 96 00 e9 b2 fd ff ff 41 0f be 77 48 48 c7 c78 [ 1914.521709] RSP: 0018:ffffc9000ba0f838 EFLAGS: 00010282 [ 1914.521950] RAX: 0000000000000000 RBX: ffff888154411400 RCX: 0000000000000000 [ 1914.522273] RDX: 0000000000000004 RSI: ffffffff824035cb RDI: 0000000000000001 [ 1914.522601] RBP: ffff888100d95f68 R08: 0000000000000001 R09: 0000000000000003 [ 1914.522926] R10: ffffffff82a7a420 R11: ffffffff82a7a420 R12: 0000000000000350 [ 1914.523249] R13: ffff888100d95da8 R14: ffff888100d95f50 R15: ffff888100d95c00 [ 1914.523576] FS: 00007f23abea2600(0000) GS:ffff88823b600000(0000) knlGS:0000000000000000 [ 1914.523942] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1914.524206] CR2: 00007f23abbff000 CR3: 0000000104b86004 CR4: 0000000000770ee0 [ 1914.524534] PKRU: 55555554 [ 1914.524661] Call Trace: [ 1914.524782] <TASK> [ 1914.524889] folio_alloc_swap+0xde/0x230 [ 1914.525076] add_to_swap+0x36/0xb0 [ 1914.525242] shrink_folio_list+0x9ab/0xef0 [ 1914.525445] reclaim_folio_list+0x70/0x130 [ 1914.525644] reclaim_pages+0x9c/0x1c0 [ 1914.525819] madvise_cold_or_pageout_pte_range+0x79f/0xc80 [ 1914.526073] walk_pgd_range+0x4d8/0x940 [ 1914.526255] ? mt_find+0x15b/0x490 [ 1914.526426] __walk_page_range+0x211/0x230 [ 1914.526619] walk_page_range+0x17a/0x1e0 [ 1914.526807] madvise_pageout+0xef/0x250
And when I reverted my commit on the same branch(needs some manual edits), the problem is still there.
Another thing is, I noticed Rongwei mentioned "This problem exists in versions after stable 5.10.y." in the changelog while my commit entered mainline in v4.14.
So either this problem is always there, i.e. earlier than my commit; or this problem is indeed only there after v5.10, then it should be something else that triggered it. My qemu refuses to boot v4.14 kernel so I can not verify the former yet.
It doesn't appear to be the case. For one thing, the problematic code that removes the swap device from the avail list without acquiring si->lock was there before my commit and my commit didn't change that behaviour. For another, I wanted to see if the problem is still there without my commit(just to make sure).
I followed Rongwei's description and used stress-ng/swap test together with some test progs that does memory allocation then MADVISE(pageout) in a loop to reproduce this problem and I can also see the warning like below using Linus' master branch as of today, I believe this is the problem Rongwei described:
Hi, Aaron, I can sure this is that bug, and the panic will happen when CONFIG_PLIST_DEBUG enabled (I'm not sure whether you have enabled it).
[ 1914.518786] ------------[ cut here ]------------ [ 1914.519049] swap_info 9 in list but !SWP_WRITEOK [ 1914.519274] WARNING: CPU: 14 PID: 14307 at mm/swapfile.c:1085 get_swap_pages+0x3b3/0x440 [ 1914.519660] Modules linked in: [ 1914.519811] CPU: 14 PID: 14307 Comm: swap Tainted: G W 6.3.0-rc5-00032-g99ddf2254feb #5 [ 1914.520238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc36 04/01/2014 [ 1914.520641] RIP: 0010:get_swap_pages+0x3b3/0x440 [ 1914.520860] Code: 48 8b 4c 24 30 48 c1 e0 3a 4c 09 e0 48 89 01 e8 43 79 96 00 e9 b2 fd ff ff 41 0f be 77 48 48 c7 c78 [ 1914.521709] RSP: 0018:ffffc9000ba0f838 EFLAGS: 00010282 [ 1914.521950] RAX: 0000000000000000 RBX: ffff888154411400 RCX: 0000000000000000 [ 1914.522273] RDX: 0000000000000004 RSI: ffffffff824035cb RDI: 0000000000000001 [ 1914.522601] RBP: ffff888100d95f68 R08: 0000000000000001 R09: 0000000000000003 [ 1914.522926] R10: ffffffff82a7a420 R11: ffffffff82a7a420 R12: 0000000000000350 [ 1914.523249] R13: ffff888100d95da8 R14: ffff888100d95f50 R15: ffff888100d95c00 [ 1914.523576] FS: 00007f23abea2600(0000) GS:ffff88823b600000(0000) knlGS:0000000000000000 [ 1914.523942] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1914.524206] CR2: 00007f23abbff000 CR3: 0000000104b86004 CR4: 0000000000770ee0 [ 1914.524534] PKRU: 55555554 [ 1914.524661] Call Trace: [ 1914.524782] <TASK> [ 1914.524889] folio_alloc_swap+0xde/0x230 [ 1914.525076] add_to_swap+0x36/0xb0 [ 1914.525242] shrink_folio_list+0x9ab/0xef0 [ 1914.525445] reclaim_folio_list+0x70/0x130 [ 1914.525644] reclaim_pages+0x9c/0x1c0 [ 1914.525819] madvise_cold_or_pageout_pte_range+0x79f/0xc80 [ 1914.526073] walk_pgd_range+0x4d8/0x940 [ 1914.526255] ? mt_find+0x15b/0x490 [ 1914.526426] __walk_page_range+0x211/0x230 [ 1914.526619] walk_page_range+0x17a/0x1e0 [ 1914.526807] madvise_pageout+0xef/0x250
And when I reverted my commit on the same branch(needs some manual edits), the problem is still there.
Another thing is, I noticed Rongwei mentioned "This problem exists in versions after stable 5.10.y." in the changelog while my commit entered mainline in v4.14.
So either this problem is always there, i.e. earlier than my commit; or this problem is indeed only there after v5.10, then it should be something else that triggered it. My qemu refuses to boot v4.14 kernel so I can not verify the former yet.
Me too. The oldest kernel that my qemu can run is 4.19.
BTW, I try to replace 'p' with 'si' today, and find there are many areas need to be modified, especially inside swapoff() and swapon(). So many modifications maybe affect future tracking of code modifications and will cost some time to test. So I wanna to ensure whether need I to do this. If need, I can continue to do this.
Thanks.
On Tue, Apr 04, 2023 at 11:47:16PM +0800, Rongwei Wang wrote:
The si->lock must be held when deleting the si from the available list. Otherwise, another thread can re-add the si to the available list, which can lead to memory corruption. The only place we have found where this happens is in the swapoff path. This case can be described as below:
core 0 core 1 swapoff
del_from_avail_list(si) waiting
try lock si->lock acquire swap_avail_lock and re-add si into swap_avail_head
confused here.
If del_from_avail_list(si) finished in swaoff path, then this si should not exist in any of the per-node avail list and core 1 should not be able to re-add it.
I stared at the code for a while and couldn't figure out how this happened, will continue to look at this tomorrow.
Thanks, Aaron
acquire si->lock but missing si already be added again, and continuing to clear SWP_WRITEOK, etc.
It can be easily found a massive warning messages can be triggered inside get_swap_pages() by some special cases, for example, we call madvise(MADV_PAGEOUT) on blocks of touched memory concurrently, meanwhile, run much swapon-swapoff operations (e.g. stress-ng-swap).
However, in the worst case, panic can be caused by the above scene. In swapoff(), the memory used by si could be kept in swap_info[] after turning off a swap. This means memory corruption will not be caused immediately until allocated and reset for a new swap in the swapon path. A panic message caused: (with CONFIG_PLIST_DEBUG enabled)
------------[ cut here ]------------ top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70 Modules linked in: rfkill(E) crct10dif_ce(E)... CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+ Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) pc : plist_check_prev_next_node+0x50/0x70 lr : plist_check_prev_next_node+0x50/0x70 sp : ffff0018009d3c30 x29: ffff0018009d3c40 x28: ffff800011b32a98 x27: 0000000000000000 x26: ffff001803908000 x25: ffff8000128ea088 x24: ffff800011b32a48 x23: 0000000000000028 x22: ffff001800875c00 x21: ffff800010f9e520 x20: ffff001800875c00 x19: ffff001800fdc6e0 x18: 0000000000000030 x17: 0000000000000000 x16: 0000000000000000 x15: 0736076307640766 x14: 0730073007380731 x13: 0736076307640766 x12: 0730073007380731 x11: 000000000004058d x10: 0000000085a85b76 x9 : ffff8000101436e4 x8 : ffff800011c8ce08 x7 : 0000000000000000 x6 : 0000000000000001 x5 : ffff0017df9ed338 x4 : 0000000000000001 x3 : ffff8017ce62a000 x2 : ffff0017df9ed340 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: plist_check_prev_next_node+0x50/0x70 plist_check_head+0x80/0xf0 plist_add+0x28/0x140 add_to_avail_list+0x9c/0xf0 _enable_swap_info+0x78/0xb4 __do_sys_swapon+0x918/0xa10 __arm64_sys_swapon+0x20/0x30 el0_svc_common+0x8c/0x220 do_el0_svc+0x2c/0x90 el0_svc+0x1c/0x30 el0_sync_handler+0xa8/0xb0 el0_sync+0x148/0x180 irq event stamp: 2082270
Now, si->lock locked before calling 'del_from_avail_list()' to make sure other thread see the si had been deleted and SWP_WRITEOK cleared together, will not reinsert again.
This problem exists in versions after stable 5.10.y.
Cc: stable@vger.kernel.org Tested-by: Yongchen Yin wb-yyc939293@alibaba-inc.com Signed-off-by: Rongwei Wang rongwei.wang@linux.alibaba.com
mm/swapfile.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 62ba2bf577d7..2c718f45745f 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) { int nid;
- assert_spin_locked(&p->lock); for_each_node(nid) plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
} @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) spin_unlock(&swap_lock); goto out_dput; }
- del_from_avail_list(p); spin_lock(&p->lock);
- del_from_avail_list(p); if (p->prio < 0) { struct swap_info_struct *si = p; int nid;
-- 2.27.0
On Thu, Apr 06, 2023 at 10:04:16PM +0800, Aaron Lu wrote:
On Tue, Apr 04, 2023 at 11:47:16PM +0800, Rongwei Wang wrote:
The si->lock must be held when deleting the si from the available list. Otherwise, another thread can re-add the si to the available list, which can lead to memory corruption. The only place we have found where this happens is in the swapoff path. This case can be described as below:
core 0 core 1 swapoff
del_from_avail_list(si) waiting
try lock si->lock acquire swap_avail_lock and re-add si into swap_avail_head
confused here.
If del_from_avail_list(si) finished in swaoff path, then this si should not exist in any of the per-node avail list and core 1 should not be able to re-add it.
I think a possible sequence could be like this:
cpuX cpuY swapoff put_swap_folio()
del_from_avail_list(si) taken si->lock spin_lock(&si->lock);
swap_range_free() was_full && SWP_WRITEOK -> re-add! drop si->lock
taken si->lock proceed removing si
End result: si left on avail_list after being swapped off.
The problem is, in add_to_avail_list(), it has no idea this si is being swapped off and taking si->lock then del_from_avail_list() could avoid this problem, so I think this patch did the right thing but the changelog about how this happened needs updating and after that:
Reviewed-by: Aaron Lu aaron.lu@intel.com
Thanks, Aaron
I stared at the code for a while and couldn't figure out how this happened, will continue to look at this tomorrow.
acquire si->lock but missing si already be added again, and continuing to clear SWP_WRITEOK, etc.
It can be easily found a massive warning messages can be triggered inside get_swap_pages() by some special cases, for example, we call madvise(MADV_PAGEOUT) on blocks of touched memory concurrently, meanwhile, run much swapon-swapoff operations (e.g. stress-ng-swap).
However, in the worst case, panic can be caused by the above scene. In swapoff(), the memory used by si could be kept in swap_info[] after turning off a swap. This means memory corruption will not be caused immediately until allocated and reset for a new swap in the swapon path. A panic message caused: (with CONFIG_PLIST_DEBUG enabled)
------------[ cut here ]------------ top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70 Modules linked in: rfkill(E) crct10dif_ce(E)... CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+ Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) pc : plist_check_prev_next_node+0x50/0x70 lr : plist_check_prev_next_node+0x50/0x70 sp : ffff0018009d3c30 x29: ffff0018009d3c40 x28: ffff800011b32a98 x27: 0000000000000000 x26: ffff001803908000 x25: ffff8000128ea088 x24: ffff800011b32a48 x23: 0000000000000028 x22: ffff001800875c00 x21: ffff800010f9e520 x20: ffff001800875c00 x19: ffff001800fdc6e0 x18: 0000000000000030 x17: 0000000000000000 x16: 0000000000000000 x15: 0736076307640766 x14: 0730073007380731 x13: 0736076307640766 x12: 0730073007380731 x11: 000000000004058d x10: 0000000085a85b76 x9 : ffff8000101436e4 x8 : ffff800011c8ce08 x7 : 0000000000000000 x6 : 0000000000000001 x5 : ffff0017df9ed338 x4 : 0000000000000001 x3 : ffff8017ce62a000 x2 : ffff0017df9ed340 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: plist_check_prev_next_node+0x50/0x70 plist_check_head+0x80/0xf0 plist_add+0x28/0x140 add_to_avail_list+0x9c/0xf0 _enable_swap_info+0x78/0xb4 __do_sys_swapon+0x918/0xa10 __arm64_sys_swapon+0x20/0x30 el0_svc_common+0x8c/0x220 do_el0_svc+0x2c/0x90 el0_svc+0x1c/0x30 el0_sync_handler+0xa8/0xb0 el0_sync+0x148/0x180 irq event stamp: 2082270
Now, si->lock locked before calling 'del_from_avail_list()' to make sure other thread see the si had been deleted and SWP_WRITEOK cleared together, will not reinsert again.
This problem exists in versions after stable 5.10.y.
Cc: stable@vger.kernel.org Tested-by: Yongchen Yin wb-yyc939293@alibaba-inc.com Signed-off-by: Rongwei Wang rongwei.wang@linux.alibaba.com
mm/swapfile.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 62ba2bf577d7..2c718f45745f 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) { int nid;
- assert_spin_locked(&p->lock); for_each_node(nid) plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
} @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) spin_unlock(&swap_lock); goto out_dput; }
- del_from_avail_list(p); spin_lock(&p->lock);
- del_from_avail_list(p); if (p->prio < 0) { struct swap_info_struct *si = p; int nid;
-- 2.27.0
On 2023/4/6 22:57, Aaron Lu wrote:
On Thu, Apr 06, 2023 at 10:04:16PM +0800, Aaron Lu wrote:
On Tue, Apr 04, 2023 at 11:47:16PM +0800, Rongwei Wang wrote:
The si->lock must be held when deleting the si from the available list. Otherwise, another thread can re-add the si to the available list, which can lead to memory corruption. The only place we have found where this happens is in the swapoff path. This case can be described as below:
core 0 core 1 swapoff
del_from_avail_list(si) waiting
try lock si->lock acquire swap_avail_lock and re-add si into swap_avail_head
confused here.
If del_from_avail_list(si) finished in swaoff path, then this si should not exist in any of the per-node avail list and core 1 should not be able to re-add it.
I think a possible sequence could be like this:
cpuX cpuY swapoff put_swap_folio()
del_from_avail_list(si) taken si->lock spin_lock(&si->lock);
swap_range_free() was_full && SWP_WRITEOK -> re-add! drop si->lock
taken si->lock proceed removing si
End result: si left on avail_list after being swapped off.
The problem is, in add_to_avail_list(), it has no idea this si is being swapped off and taking si->lock then del_from_avail_list() could avoid this problem, so I think this patch did the right thing but the changelog about how this happened needs updating and after that:
Hi Aaron
That's my fault. Actually, I don't refers specifically to swap_range_free() path in commit, mainly because cpuY can stand all threads which is waiting swap_avail_lock, then to call add_to_avail_list(), not only swap_range_free(), e.g. maybe swapon().
Reviewed-by: Aaron Lu aaron.lu@intel.com
Thanks for your time.
-wrw
Thanks, Aaron
linux-stable-mirror@lists.linaro.org