Gao Xiang hsiangkao@redhat.com writes:
SWP_FS doesn't mean the device is file-backed swap device, which just means each writeback request should go through fs by DIO. Or it'll just use extents added by .swap_activate(), but it also works as file-backed swap device.
So in order to achieve the goal of the original patch, SWP_BLKDEV should be used instead.
FS corruption can be observed with SSD device + XFS + fragmented swapfile due to CONFIG_THP_SWAP=y.
Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device") Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out") Cc: "Huang, Ying" ying.huang@intel.com Cc: stable stable@vger.kernel.org Signed-off-by: Gao Xiang hsiangkao@redhat.com
Good catch! The fix itself looks good me! Although the description is a little confusing.
After some digging, it seems that SWP_FS is set on the swap devices which make swap entry read/write go through the file system specific callback (now used by swap over NFS only).
Best Regards, Huang, Ying
I reproduced the issue with the following details:
Environment: QEMU + upstream kernel + buildroot + NVMe (2 GB)
Kernel config: CONFIG_BLK_DEV_NVME=y CONFIG_THP_SWAP=y
Some reproducable steps: mkfs.xfs -f /dev/nvme0n1 mkdir /tmp/mnt mount /dev/nvme0n1 /tmp/mnt bs="32k" sz="1024m" # doesn't matter too much, I also tried 16m xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
mkswap /tmp/mnt/sw swapon /tmp/mnt/sw
stress --vm 2 --vm-bytes 600M # doesn't matter too much as well
Symptoms:
- FS corruption (e.g. checksum failure)
- memory corruption at: 0xd2808010
- segfault
...
mm/swapfile.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 6c26916e95fd..2937daf3ca02 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) goto nextsi; } if (size == SWAPFILE_CLUSTER) {
if (!(si->flags & SWP_FS))
} else n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,if (si->flags & SWP_BLKDEV) n_ret = swap_alloc_cluster(si, swp_entries);