Hole puching currently evicts pages from page cache and then goes on to remove blocks from the inode. This happens under both XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL which provides appropriate serialization with racing reads or page faults. However there is currently nothing that prevents readahead triggered by fadvise() or madvise() from racing with the hole punch and instantiating page cache page after hole punching has evicted page cache in xfs_flush_unmap_range() but before it has removed blocks from the inode. This page cache page will be mapping soon to be freed block and that can lead to returning stale data to userspace or even filesystem corruption.
Fix the problem by protecting handling of readahead requests by XFS_IOLOCK_SHARED similarly as we protect reads.
CC: stable@vger.kernel.org Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahc... Reported-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Jan Kara jack@suse.cz --- fs/xfs/xfs_file.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 28101bbc0b78..d952d5962e93 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -28,6 +28,7 @@ #include <linux/falloc.h> #include <linux/backing-dev.h> #include <linux/mman.h> +#include <linux/fadvise.h>
static const struct vm_operations_struct xfs_file_vm_ops;
@@ -933,6 +934,30 @@ xfs_file_fallocate( return error; }
+STATIC int +xfs_file_fadvise( + struct file *file, + loff_t start, + loff_t end, + int advice) +{ + struct xfs_inode *ip = XFS_I(file_inode(file)); + int ret; + int lockflags = 0; + + /* + * Operations creating pages in page cache need protection from hole + * punching and similar ops + */ + if (advice == POSIX_FADV_WILLNEED) { + lockflags = XFS_IOLOCK_SHARED; + xfs_ilock(ip, lockflags); + } + ret = generic_fadvise(file, start, end, advice); + if (lockflags) + xfs_iunlock(ip, lockflags); + return ret; +}
STATIC loff_t xfs_file_remap_range( @@ -1232,6 +1257,7 @@ const struct file_operations xfs_file_operations = { .fsync = xfs_file_fsync, .get_unmapped_area = thp_get_unmapped_area, .fallocate = xfs_file_fallocate, + .fadvise = xfs_file_fadvise, .remap_file_range = xfs_file_remap_range, };
On Thu, Aug 29, 2019 at 03:10:34PM +0200, Jan Kara wrote:
Hole puching currently evicts pages from page cache and then goes on to remove blocks from the inode. This happens under both XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL which provides appropriate serialization with racing reads or page faults. However there is currently nothing that prevents readahead triggered by fadvise() or madvise() from racing with the hole punch and instantiating page cache page after hole punching has evicted page cache in xfs_flush_unmap_range() but before it has removed blocks from the inode. This page cache page will be mapping soon to be freed block and that can lead to returning stale data to userspace or even filesystem corruption.
Fix the problem by protecting handling of readahead requests by XFS_IOLOCK_SHARED similarly as we protect reads.
CC: stable@vger.kernel.org Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahc... Reported-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Jan Kara jack@suse.cz
Is there a test on xfstests to demonstrate this race?
Will test it out though...
Reviewed-by: Darrick J. Wong darrick.wong@oracle.com
--D
fs/xfs/xfs_file.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 28101bbc0b78..d952d5962e93 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -28,6 +28,7 @@ #include <linux/falloc.h> #include <linux/backing-dev.h> #include <linux/mman.h> +#include <linux/fadvise.h> static const struct vm_operations_struct xfs_file_vm_ops; @@ -933,6 +934,30 @@ xfs_file_fallocate( return error; } +STATIC int +xfs_file_fadvise(
- struct file *file,
- loff_t start,
- loff_t end,
- int advice)
+{
- struct xfs_inode *ip = XFS_I(file_inode(file));
- int ret;
- int lockflags = 0;
- /*
* Operations creating pages in page cache need protection from hole
* punching and similar ops
*/
- if (advice == POSIX_FADV_WILLNEED) {
lockflags = XFS_IOLOCK_SHARED;
xfs_ilock(ip, lockflags);
- }
- ret = generic_fadvise(file, start, end, advice);
- if (lockflags)
xfs_iunlock(ip, lockflags);
- return ret;
+} STATIC loff_t xfs_file_remap_range( @@ -1232,6 +1257,7 @@ const struct file_operations xfs_file_operations = { .fsync = xfs_file_fsync, .get_unmapped_area = thp_get_unmapped_area, .fallocate = xfs_file_fallocate,
- .fadvise = xfs_file_fadvise, .remap_file_range = xfs_file_remap_range,
}; -- 2.16.4
On Thu 29-08-19 08:52:04, Darrick J. Wong wrote:
On Thu, Aug 29, 2019 at 03:10:34PM +0200, Jan Kara wrote:
Hole puching currently evicts pages from page cache and then goes on to remove blocks from the inode. This happens under both XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL which provides appropriate serialization with racing reads or page faults. However there is currently nothing that prevents readahead triggered by fadvise() or madvise() from racing with the hole punch and instantiating page cache page after hole punching has evicted page cache in xfs_flush_unmap_range() but before it has removed blocks from the inode. This page cache page will be mapping soon to be freed block and that can lead to returning stale data to userspace or even filesystem corruption.
Fix the problem by protecting handling of readahead requests by XFS_IOLOCK_SHARED similarly as we protect reads.
CC: stable@vger.kernel.org Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahc... Reported-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Jan Kara jack@suse.cz
Is there a test on xfstests to demonstrate this race?
No, but I can try to create one.
Will test it out though...
Reviewed-by: Darrick J. Wong darrick.wong@oracle.com
Thanks. BTW, will you pick up these patches please?
Honza
--D
fs/xfs/xfs_file.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 28101bbc0b78..d952d5962e93 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -28,6 +28,7 @@ #include <linux/falloc.h> #include <linux/backing-dev.h> #include <linux/mman.h> +#include <linux/fadvise.h> static const struct vm_operations_struct xfs_file_vm_ops; @@ -933,6 +934,30 @@ xfs_file_fallocate( return error; } +STATIC int +xfs_file_fadvise(
- struct file *file,
- loff_t start,
- loff_t end,
- int advice)
+{
- struct xfs_inode *ip = XFS_I(file_inode(file));
- int ret;
- int lockflags = 0;
- /*
* Operations creating pages in page cache need protection from hole
* punching and similar ops
*/
- if (advice == POSIX_FADV_WILLNEED) {
lockflags = XFS_IOLOCK_SHARED;
xfs_ilock(ip, lockflags);
- }
- ret = generic_fadvise(file, start, end, advice);
- if (lockflags)
xfs_iunlock(ip, lockflags);
- return ret;
+} STATIC loff_t xfs_file_remap_range( @@ -1232,6 +1257,7 @@ const struct file_operations xfs_file_operations = { .fsync = xfs_file_fsync, .get_unmapped_area = thp_get_unmapped_area, .fallocate = xfs_file_fallocate,
- .fadvise = xfs_file_fadvise, .remap_file_range = xfs_file_remap_range,
}; -- 2.16.4
On Fri, Aug 30, 2019 at 05:24:49PM +0200, Jan Kara wrote:
On Thu 29-08-19 08:52:04, Darrick J. Wong wrote:
On Thu, Aug 29, 2019 at 03:10:34PM +0200, Jan Kara wrote:
Hole puching currently evicts pages from page cache and then goes on to remove blocks from the inode. This happens under both XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL which provides appropriate serialization with racing reads or page faults. However there is currently nothing that prevents readahead triggered by fadvise() or madvise() from racing with the hole punch and instantiating page cache page after hole punching has evicted page cache in xfs_flush_unmap_range() but before it has removed blocks from the inode. This page cache page will be mapping soon to be freed block and that can lead to returning stale data to userspace or even filesystem corruption.
Fix the problem by protecting handling of readahead requests by XFS_IOLOCK_SHARED similarly as we protect reads.
CC: stable@vger.kernel.org Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahc... Reported-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Jan Kara jack@suse.cz
Is there a test on xfstests to demonstrate this race?
No, but I can try to create one.
<nod> I imgaine this race was hard to spot in the first place...
Will test it out though...
Reviewed-by: Darrick J. Wong darrick.wong@oracle.com
Thanks. BTW, will you pick up these patches please?
Yeah, they looked fine.
--D
Honza
--D
fs/xfs/xfs_file.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 28101bbc0b78..d952d5962e93 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -28,6 +28,7 @@ #include <linux/falloc.h> #include <linux/backing-dev.h> #include <linux/mman.h> +#include <linux/fadvise.h> static const struct vm_operations_struct xfs_file_vm_ops; @@ -933,6 +934,30 @@ xfs_file_fallocate( return error; } +STATIC int +xfs_file_fadvise(
- struct file *file,
- loff_t start,
- loff_t end,
- int advice)
+{
- struct xfs_inode *ip = XFS_I(file_inode(file));
- int ret;
- int lockflags = 0;
- /*
* Operations creating pages in page cache need protection from hole
* punching and similar ops
*/
- if (advice == POSIX_FADV_WILLNEED) {
lockflags = XFS_IOLOCK_SHARED;
xfs_ilock(ip, lockflags);
- }
- ret = generic_fadvise(file, start, end, advice);
- if (lockflags)
xfs_iunlock(ip, lockflags);
- return ret;
+} STATIC loff_t xfs_file_remap_range( @@ -1232,6 +1257,7 @@ const struct file_operations xfs_file_operations = { .fsync = xfs_file_fsync, .get_unmapped_area = thp_get_unmapped_area, .fallocate = xfs_file_fallocate,
- .fadvise = xfs_file_fadvise, .remap_file_range = xfs_file_remap_range,
}; -- 2.16.4
-- Jan Kara jack@suse.com SUSE Labs, CR
On Fri 30-08-19 17:24:49, Jan Kara wrote:
On Thu 29-08-19 08:52:04, Darrick J. Wong wrote:
On Thu, Aug 29, 2019 at 03:10:34PM +0200, Jan Kara wrote:
Hole puching currently evicts pages from page cache and then goes on to remove blocks from the inode. This happens under both XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL which provides appropriate serialization with racing reads or page faults. However there is currently nothing that prevents readahead triggered by fadvise() or madvise() from racing with the hole punch and instantiating page cache page after hole punching has evicted page cache in xfs_flush_unmap_range() but before it has removed blocks from the inode. This page cache page will be mapping soon to be freed block and that can lead to returning stale data to userspace or even filesystem corruption.
Fix the problem by protecting handling of readahead requests by XFS_IOLOCK_SHARED similarly as we protect reads.
CC: stable@vger.kernel.org Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahc... Reported-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Jan Kara jack@suse.cz
Is there a test on xfstests to demonstrate this race?
No, but I can try to create one.
I was experimenting with this but I could not reproduce the issue in my test VM without inserting artificial delay at appropriate place... So I don't think there's much point in the fstest for this.
Honza
On Wed, Sep 18, 2019 at 02:31:24PM +0200, Jan Kara wrote:
On Fri 30-08-19 17:24:49, Jan Kara wrote:
On Thu 29-08-19 08:52:04, Darrick J. Wong wrote:
On Thu, Aug 29, 2019 at 03:10:34PM +0200, Jan Kara wrote:
Hole puching currently evicts pages from page cache and then goes on to remove blocks from the inode. This happens under both XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL which provides appropriate serialization with racing reads or page faults. However there is currently nothing that prevents readahead triggered by fadvise() or madvise() from racing with the hole punch and instantiating page cache page after hole punching has evicted page cache in xfs_flush_unmap_range() but before it has removed blocks from the inode. This page cache page will be mapping soon to be freed block and that can lead to returning stale data to userspace or even filesystem corruption.
Fix the problem by protecting handling of readahead requests by XFS_IOLOCK_SHARED similarly as we protect reads.
CC: stable@vger.kernel.org Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahc... Reported-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Jan Kara jack@suse.cz
Is there a test on xfstests to demonstrate this race?
No, but I can try to create one.
I was experimenting with this but I could not reproduce the issue in my test VM without inserting artificial delay at appropriate place... So I don't think there's much point in the fstest for this.
<shrug> We've added debugging knobs to XFS that inject delays to demonstrate race conditions that are hard to reproduce, but OTOH it's more fun to have a generic/ test that you can use to convince the other fs maintainers to take your patches. :)
--D
Honza
-- Jan Kara jack@suse.com SUSE Labs, CR
On 18/09/2019 15:31, Jan Kara wrote: <>
Is there a test on xfstests to demonstrate this race?
No, but I can try to create one.
I was experimenting with this but I could not reproduce the issue in my test VM without inserting artificial delay at appropriate place... So I don't think there's much point in the fstest for this.
Honza
If I understand correctly you will need threads that direct-write files, then fadvise(WILL_NEED) - in parallel to truncate (punch_hole) these files - In parallel to trash caches. (Direct-write is so data is not present in cache when you come to WILL_NEED it into the cache, otherwise the xfs b-trees are not exercised. Or are you more worried about the page_cache races? )
Also the d-writes might want to exercise multiple size extents + holes as well.
I have a very different system but its kind of the test we did for this problem.
The reason it is never hit is because fadvise(WILL_NEED) is never really used that much, and there are no applications that actually blindly truncate during IO, this is only us in testing that do this meaningless thing.
Thanks Jan again for working on this Boaz
On Mon 23-09-19 15:33:05, Boaz Harrosh wrote:
On 18/09/2019 15:31, Jan Kara wrote: <>
Is there a test on xfstests to demonstrate this race?
No, but I can try to create one.
I was experimenting with this but I could not reproduce the issue in my test VM without inserting artificial delay at appropriate place... So I don't think there's much point in the fstest for this.
Honza
If I understand correctly you will need threads that direct-write files, then fadvise(WILL_NEED) - in parallel to truncate (punch_hole) these files - In parallel to trash caches. (Direct-write is so data is not present in cache when you come to WILL_NEED it into the cache, otherwise the xfs b-trees are not exercised. Or are you more worried about the page_cache races? )
What I was testing was: Fill file with data. One process does fadvise(WILLNEED) block by block from end of the file. Another process punches hole into the file.
If they race is the right way, following read will show old data instead of zeros. And as I said I'm able to hit this but only if I add artificial delay between truncating page cache and actually removing blocks.
Honza
On 24/09/2019 18:23, Jan Kara wrote:
On Mon 23-09-19 15:33:05, Boaz Harrosh wrote:
On 18/09/2019 15:31, Jan Kara wrote: <>
Is there a test on xfstests to demonstrate this race?
No, but I can try to create one.
I was experimenting with this but I could not reproduce the issue in my test VM without inserting artificial delay at appropriate place... So I don't think there's much point in the fstest for this.
Honza
If I understand correctly you will need threads that direct-write files, then fadvise(WILL_NEED) - in parallel to truncate (punch_hole) these files - In parallel to trash caches. (Direct-write is so data is not present in cache when you come to WILL_NEED it into the cache, otherwise the xfs b-trees are not exercised. Or are you more worried about the page_cache races? )
What I was testing was: Fill file with data.
But are you sure data is not in page cache after this stage?
Also this stage sould create multiple extents perhaps with gaps in between
One process does fadvise(WILLNEED) block by block from end of the file. Another process punches hole into the file.
(Perhaps randome placement that spans multiple extents in one go)
If they race is the right way, following read will show old data instead of zeros. And as I said I'm able to hit this but only if I add artificial delay between truncating page cache and actually removing blocks.
I was more afraid of iterating on a btree or xarray in parallel of it being destroyed / punched.
I think if you iterate backwards in the WILLNEED case the tree/xarry corruption is less likely
But now that I think about it. Maybe your case is very different to mine, because read_pages() in xfs does take the ilock. I'm not familiar with this code.
Honza
But I guess the window is very small then
Thanks Boaz
linux-stable-mirror@lists.linaro.org