SWP_FS doesn't mean the device is file-backed swap device, which just means each writeback request should go through fs by DIO. Or it'll just use extents added by .swap_activate(), but it also works as file-backed swap device.
So in order to achieve the goal of the original patch, SWP_BLKDEV should be used instead.
FS corruption can be observed with SSD device + XFS + fragmented swapfile due to CONFIG_THP_SWAP=y.
Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device") Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out") Cc: "Huang, Ying" ying.huang@intel.com Cc: stable stable@vger.kernel.org Signed-off-by: Gao Xiang hsiangkao@redhat.com ---
I reproduced the issue with the following details:
Environment: QEMU + upstream kernel + buildroot + NVMe (2 GB)
Kernel config: CONFIG_BLK_DEV_NVME=y CONFIG_THP_SWAP=y
Some reproducable steps: mkfs.xfs -f /dev/nvme0n1 mkdir /tmp/mnt mount /dev/nvme0n1 /tmp/mnt bs="32k" sz="1024m" # doesn't matter too much, I also tried 16m xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
mkswap /tmp/mnt/sw swapon /tmp/mnt/sw
stress --vm 2 --vm-bytes 600M # doesn't matter too much as well
Symptoms: - FS corruption (e.g. checksum failure) - memory corruption at: 0xd2808010 - segfault ...
mm/swapfile.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 6c26916e95fd..2937daf3ca02 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) goto nextsi; } if (size == SWAPFILE_CLUSTER) { - if (!(si->flags & SWP_FS)) + if (si->flags & SWP_BLKDEV) n_ret = swap_alloc_cluster(si, swp_entries); } else n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang hsiangkao@redhat.com wrote:
SWP_FS doesn't mean the device is file-backed swap device, which just means each writeback request should go through fs by DIO. Or it'll just use extents added by .swap_activate(), but it also works as file-backed swap device.
This is very hard to understand :(
So in order to achieve the goal of the original patch, SWP_BLKDEV should be used instead.
FS corruption can be observed with SSD device + XFS + fragmented swapfile due to CONFIG_THP_SWAP=y.
Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device") Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
Why do you think it has taken three years to discover this?
Hi Andrew,
On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang hsiangkao@redhat.com wrote:
SWP_FS doesn't mean the device is file-backed swap device, which just means each writeback request should go through fs by DIO. Or it'll just use extents added by .swap_activate(), but it also works as file-backed swap device.
This is very hard to understand :(
Thanks for your reply...
The related logic is in __swap_writepage() and setup_swap_extents(), and also see e.g generic_swapfile_activate() or iomap_swapfile_activate()...
I will also talk with "Huang, Ying" in person if no response here.
So in order to achieve the goal of the original patch, SWP_BLKDEV should be used instead.
FS corruption can be observed with SSD device + XFS + fragmented swapfile due to CONFIG_THP_SWAP=y.
Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device") Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
Why do you think it has taken three years to discover this?
I'm not sure if the Redhat BZ is available for public, it can be reproduced since rhel 8 https://bugzilla.redhat.com/show_bug.cgi?id=1855474
It seems hard to believe, but I think just because rare user uses the SSD device + THP + file-backed swap device combination... maybe I'm wrong here, but my test shows as it is.
Thanks, Gao Xiang
On Wed, Aug 19, 2020 at 1:15 PM Gao Xiang hsiangkao@redhat.com wrote:
Hi Andrew,
On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang hsiangkao@redhat.com wrote:
SWP_FS doesn't mean the device is file-backed swap device, which just means each writeback request should go through fs by DIO. Or it'll just use extents added by .swap_activate(), but it also works as file-backed swap device.
This is very hard to understand :(
Thanks for your reply...
The related logic is in __swap_writepage() and setup_swap_extents(), and also see e.g generic_swapfile_activate() or iomap_swapfile_activate()...
I think just NFS falls into this case, so you may rephrase it to:
SWP_FS is only used for swap files over NFS. So, !SWP_FS means non NFS swap, it could be either file backed or device backed.
Does this look more understandable?
I will also talk with "Huang, Ying" in person if no response here.
So in order to achieve the goal of the original patch, SWP_BLKDEV should be used instead.
FS corruption can be observed with SSD device + XFS + fragmented swapfile due to CONFIG_THP_SWAP=y.
Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device") Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
Why do you think it has taken three years to discover this?
I'm not sure if the Redhat BZ is available for public, it can be reproduced since rhel 8 https://bugzilla.redhat.com/show_bug.cgi?id=1855474
It seems hard to believe, but I think just because rare user uses the SSD device + THP + file-backed swap device combination... maybe I'm wrong here, but my test shows as it is.
Thanks, Gao Xiang
Hi Yang,
On Wed, Aug 19, 2020 at 02:41:08PM -0700, Yang Shi wrote:
On Wed, Aug 19, 2020 at 1:15 PM Gao Xiang hsiangkao@redhat.com wrote:
Hi Andrew,
On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang hsiangkao@redhat.com wrote:
SWP_FS doesn't mean the device is file-backed swap device, which just means each writeback request should go through fs by DIO. Or it'll just use extents added by .swap_activate(), but it also works as file-backed swap device.
This is very hard to understand :(
Thanks for your reply...
The related logic is in __swap_writepage() and setup_swap_extents(), and also see e.g generic_swapfile_activate() or iomap_swapfile_activate()...
I think just NFS falls into this case, so you may rephrase it to:
SWP_FS is only used for swap files over NFS. So, !SWP_FS means non NFS swap, it could be either file backed or device backed.
Thanks for your suggestion...
That looks reasonable, and after I looked bc4ae27d817a ("mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS")
I think it could be rephrased into
" The SWP_FS flag is used to make swap_{read,write}page() go through the filesystem, and it's only used for swap files over NFS. So, !SWP_FS means non NFS for now, it could be either file backed or device backed. Something similar goes with legacy SWP_FILE. "
Does it look sane? And I will wait for further suggestion about this for a while.
And IMO, SWP_FS flag might be useful for other uses later (e.g. laterly for some CoW swapfile use, but I don't think carefully if it's practical or not...)
Thanks, Gao Xiang
On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang hsiangkao@redhat.com wrote:
SWP_FS doesn't mean the device is file-backed swap device, which just means each writeback request should go through fs by DIO. Or it'll just use extents added by .swap_activate(), but it also works as file-backed swap device.
This is very hard to understand :(
I'll work with Gao to rephrase that message. Sorry!
So in order to achieve the goal of the original patch, SWP_BLKDEV should be used instead.
FS corruption can be observed with SSD device + XFS + fragmented swapfile due to CONFIG_THP_SWAP=y.
Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device") Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
Why do you think it has taken three years to discover this?
My bet here is that it's rare to go for a swapfile on non-rotational devices, and even rarer to create the swapfile when the filesystem is already fragmented.
RHEL-8, v4.18-based, is starting to see more adpters among Red Hat's customer base, thus the report now. We are also working on a secondary issue related to CONFIG_THP_SWAP, as well, where the deferred THP split registered shriker goes for a NULL pointer dereference in case the swap device is backed by a rotational drive.
-- Rafael
Hi Rafael,
On Wed, Aug 19, 2020 at 04:44:05PM -0400, Rafael Aquini wrote:
On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang hsiangkao@redhat.com wrote:
SWP_FS doesn't mean the device is file-backed swap device, which just means each writeback request should go through fs by DIO. Or it'll just use extents added by .swap_activate(), but it also works as file-backed swap device.
This is very hard to understand :(
I'll work with Gao to rephrase that message. Sorry!
Sorry about that :( I just finished the test and went through the related swap code and finally saw this so I think it wouldn't work entirely for the current swap code... and Sorry about my limited English.
Kindly feel free to repost the patch with rephrased commit message. Anyway, I've done this task :)
Thanks, Gao Xiang
Gao Xiang hsiangkao@redhat.com writes:
SWP_FS doesn't mean the device is file-backed swap device, which just means each writeback request should go through fs by DIO. Or it'll just use extents added by .swap_activate(), but it also works as file-backed swap device.
So in order to achieve the goal of the original patch, SWP_BLKDEV should be used instead.
FS corruption can be observed with SSD device + XFS + fragmented swapfile due to CONFIG_THP_SWAP=y.
Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device") Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out") Cc: "Huang, Ying" ying.huang@intel.com Cc: stable stable@vger.kernel.org Signed-off-by: Gao Xiang hsiangkao@redhat.com
Good catch! The fix itself looks good me! Although the description is a little confusing.
After some digging, it seems that SWP_FS is set on the swap devices which make swap entry read/write go through the file system specific callback (now used by swap over NFS only).
Best Regards, Huang, Ying
I reproduced the issue with the following details:
Environment: QEMU + upstream kernel + buildroot + NVMe (2 GB)
Kernel config: CONFIG_BLK_DEV_NVME=y CONFIG_THP_SWAP=y
Some reproducable steps: mkfs.xfs -f /dev/nvme0n1 mkdir /tmp/mnt mount /dev/nvme0n1 /tmp/mnt bs="32k" sz="1024m" # doesn't matter too much, I also tried 16m xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
mkswap /tmp/mnt/sw swapon /tmp/mnt/sw
stress --vm 2 --vm-bytes 600M # doesn't matter too much as well
Symptoms:
- FS corruption (e.g. checksum failure)
- memory corruption at: 0xd2808010
- segfault
...
mm/swapfile.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 6c26916e95fd..2937daf3ca02 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) goto nextsi; } if (size == SWAPFILE_CLUSTER) {
if (!(si->flags & SWP_FS))
} else n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,if (si->flags & SWP_BLKDEV) n_ret = swap_alloc_cluster(si, swp_entries);
Hi Ying,
On Thu, Aug 20, 2020 at 12:36:08PM +0800, Huang, Ying wrote:
Gao Xiang hsiangkao@redhat.com writes:
SWP_FS doesn't mean the device is file-backed swap device, which just means each writeback request should go through fs by DIO. Or it'll just use extents added by .swap_activate(), but it also works as file-backed swap device.
So in order to achieve the goal of the original patch, SWP_BLKDEV should be used instead.
FS corruption can be observed with SSD device + XFS + fragmented swapfile due to CONFIG_THP_SWAP=y.
Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device") Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out") Cc: "Huang, Ying" ying.huang@intel.com Cc: stable stable@vger.kernel.org Signed-off-by: Gao Xiang hsiangkao@redhat.com
Good catch! The fix itself looks good me! Although the description is a little confusing.
After some digging, it seems that SWP_FS is set on the swap devices which make swap entry read/write go through the file system specific callback (now used by swap over NFS only).
Okay, let me send out v2 with the updated commit message in https://lore.kernel.org/r/20200820012409.GB5846@xiangao.remote.csb/
Thanks, Gao Xiang
Best Regards, Huang, Ying
I reproduced the issue with the following details:
Environment: QEMU + upstream kernel + buildroot + NVMe (2 GB)
Kernel config: CONFIG_BLK_DEV_NVME=y CONFIG_THP_SWAP=y
Some reproducable steps: mkfs.xfs -f /dev/nvme0n1 mkdir /tmp/mnt mount /dev/nvme0n1 /tmp/mnt bs="32k" sz="1024m" # doesn't matter too much, I also tried 16m xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
mkswap /tmp/mnt/sw swapon /tmp/mnt/sw
stress --vm 2 --vm-bytes 600M # doesn't matter too much as well
Symptoms:
- FS corruption (e.g. checksum failure)
- memory corruption at: 0xd2808010
- segfault
...
mm/swapfile.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 6c26916e95fd..2937daf3ca02 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) goto nextsi; } if (size == SWAPFILE_CLUSTER) {
if (!(si->flags & SWP_FS))
} else n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,if (si->flags & SWP_BLKDEV) n_ret = swap_alloc_cluster(si, swp_entries);
linux-stable-mirror@lists.linaro.org