[BUG] With my local branch to enable bs > ps support for btrfs, sometimes I hit the following ASSERT() inside submit_one_sector():
ASSERT(block_start != EXTENT_MAP_HOLE);
Please note that it's not yet possible to hit this ASSERT() in the wild yet, as it requires btrfs bs > ps support, which is not even in the development branch.
But on the other hand, there is also a very low chance to hit above ASSERT() with bs < ps cases, so this is an existing bug affect not only the incoming bs > ps support but also the existing bs < ps support.
[CAUSE] Firstly that ASSERT() means we're trying to submit a dirty block but without a real extent map nor ordered extent map backing it.
Furthermore with extra debugging, the folio triggering such ASSERT() is always larger than the fs block size in my bs > ps case. (8K block size, 4K page size)
After some more debugging, the ASSERT() is trigger by the following sequence:
extent_writepage() | We got a 32K folio (4 fs blocks) at file offset 0, and the fs block | size is 8K, page size is 4K. | And there is another 8K folio at file offset 32K, which is also | dirty. | So the filemap layout looks like the following: | | "||" is the filio boundary in the filemap. | "//| is the dirty range. | | 0 8K 16K 24K 32K 40K | |////////| |//////////////////////||////////| | |- writepage_delalloc() | |- find_lock_delalloc_range() for [0, 8K) | | Now range [0, 8K) is properly locked. | | | |- find_lock_delalloc_range() for [16K, 40K) | | |- btrfs_find_delalloc_range() returned range [0, 8K) | | |- lock_delalloc_folios() succeeded. | | | | | | The filemap range [32K, 40K) got dropped from filemap. | | | | | |- lock_delalloc_folios() failed with -EAGAIN. | | | As it failed to lock the folio at [32K, 40K). | | | | | |- loops = 1; | | |- max_bytes = PAGE_SIZE; | | |- goto again; | | | This will re-do the lookup for dirty delalloc ranges. | | | | | |- btrfs_find_delalloc_range() called with @max_bytes == 4K | | | This is smaller than block size, so | | | btrfs_find_delalloc_range() is unable to return any range. | | - return false; | | | - Now only range [0, 8K) has an OE for it, but for dirty range | [16K, 32K) it's dirty without an OE. | This breaks the assumption that writepage_delalloc() will find | and lock all dirty ranges inside the folio. | |- extent_writepage_io() |- submit_one_sector() for [0, 8K) | Succeeded | |- submit_one_sector() for [16K, 24K) Triggering the ASSERT(), as there is no OE, and the original extent map is a hole.
Please note that, this also exposed the same problem for bs < ps support. E.g. with 64K page size and 4K block size.
If we failed to lock a folio, and falls back into the "loops = 1;" branch, we will re-do the search using 64K as max_bytes. Which may fail again to lock the next folio, and exit early without handling all dirty blocks inside the folio.
[FIX] Instead of using the fixed size PAGE_SIZE as @max_bytes, use @sectorsize, so that we are ensured to find and lock any remaining blocks inside the folio.
And since we're here, add an extra ASSERT() to before calling btrfs_find_delalloc_range() to make sure the @max_bytes is at least no smaller than a block to avoid false negative.
Cc: stable@vger.kernel.org #5.15+ Signed-off-by: Qu Wenruo wqu@suse.com --- fs/btrfs/extent_io.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index ca7174fa0240..2fd82055a779 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -393,6 +393,13 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, /* step one, find a bunch of delalloc bytes starting at start */ delalloc_start = *start; delalloc_end = 0; + + /* + * If @max_bytes is smaller than a block, btrfs_find_delalloc_range() can + * return early without handling any dirty ranges. + */ + ASSERT(max_bytes >= fs_info->sectorsize); + found = btrfs_find_delalloc_range(tree, &delalloc_start, &delalloc_end, max_bytes, &cached_state); if (!found || delalloc_end <= *start || delalloc_start > orig_end) { @@ -423,13 +430,14 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, delalloc_end); ASSERT(!ret || ret == -EAGAIN); if (ret == -EAGAIN) { - /* some of the folios are gone, lets avoid looping by - * shortening the size of the delalloc range we're searching + /* + * Some of the folios are gone, lets avoid looping by + * shortening the size of the delalloc range we're searching. */ btrfs_free_extent_state(cached_state); cached_state = NULL; if (!loops) { - max_bytes = PAGE_SIZE; + max_bytes = fs_info->sectorsize; loops = 1; goto again; } else {
By somehow, the cover letter is not sent correctly, so here comes the cover letter:
[CHANGELOG] v2: - Add a new patch to fix the incorrect @max_bytes of find_lock_delalloc_range() This in fact also fixes a very rare corner case where bs < ps support is also affected.
This allows us to re-enable extra large folios (folio > bs) for bs > ps cases.
RFC->v1 - Disable extra large folios for bs > ps mounts Such extra large folios are larger than a block.
Still debugging, but disabling it makes 8K block size runs survive the full fs tests, with some minor failures due to the limitations.
This may be something affecting regular large folios (folio > bs, but bs <= ps).
This series enables the initial bs > ps support, with several limitations:
- No direct IO support All direct IOs fall back to buffered ones.
- No RAID56 support Any fs with RAID56 feature will be rejected at mount time.
- No encoded read/write/send Encoded send will fallback to the regular send (reading from page cache). Encoded read/write utilized by send/receive will fallback to regular ones.
Above limits are introduced by the fact that, we require large folios to cover at least one fs block, so that no block can cross large folio boundaries.
This simplifies our checksum and RAID56 handling.
The problem is, user space programs can only ensure their memory is properly aligned in virtual addresses, but have no control on the backing folios. Thus they can got a contiguous memory but is backed by incontiguous pages.
In that case, it will break the "no block can cross large folio boundaries" assumption, and will need a very complex mechanism to handle checksum, especially for RAID56.
The same applies to encoded send, which uses vmallocated memory.
In the long run, we will need to support all those complex mechanism.
[FUTURE ROADMAP] Currently bs > ps support is only to allow extra compatibility, e.g. allowing x86_64 to mount a btrfs which is originally created on ppc64 (64K page size, 64K block size).
But this should also open a new door for btrfs RAID56 write hole problems in the future, by enforcing a larger block size and fixed power-of-2 data stripes, so that every write can fill a full stripe, just like RAIDZ.
E.g. with 8K block size, all data writes are now in 8K sizes, and will always be a full stripe write for a 3 disks RAID5 with a stripe length of 4K.
This RAIDZ like solution will allow a much simpler RAID56 (no more RMW any more), at the cost of a larger block size (more write-amplification, higher memory usage etc).
在 2025/9/22 08:10, Qu Wenruo 写道:
[BUG] With my local branch to enable bs > ps support for btrfs, sometimes I hit the following ASSERT() inside submit_one_sector():
ASSERT(block_start != EXTENT_MAP_HOLE);
Please note that it's not yet possible to hit this ASSERT() in the wild yet, as it requires btrfs bs > ps support, which is not even in the development branch.
But on the other hand, there is also a very low chance to hit above ASSERT() with bs < ps cases, so this is an existing bug affect not only the incoming bs > ps support but also the existing bs < ps support.
[CAUSE] Firstly that ASSERT() means we're trying to submit a dirty block but without a real extent map nor ordered extent map backing it.
Furthermore with extra debugging, the folio triggering such ASSERT() is always larger than the fs block size in my bs > ps case. (8K block size, 4K page size)
After some more debugging, the ASSERT() is trigger by the following sequence:
extent_writepage() | We got a 32K folio (4 fs blocks) at file offset 0, and the fs block | size is 8K, page size is 4K. | And there is another 8K folio at file offset 32K, which is also | dirty. | So the filemap layout looks like the following: | | "||" is the filio boundary in the filemap. | "//| is the dirty range. | | 0 8K 16K 24K 32K 40K | |////////| |//////////////////////||////////| | |- writepage_delalloc() | |- find_lock_delalloc_range() for [0, 8K) | | Now range [0, 8K) is properly locked. | | | |- find_lock_delalloc_range() for [16K, 40K) | | |- btrfs_find_delalloc_range() returned range [0, 8K) | | |- lock_delalloc_folios() succeeded. | | | | | | The filemap range [32K, 40K) got dropped from filemap. | | | | | |- lock_delalloc_folios() failed with -EAGAIN. | | | As it failed to lock the folio at [32K, 40K). | | | | | |- loops = 1; | | |- max_bytes = PAGE_SIZE; | | |- goto again; | | | This will re-do the lookup for dirty delalloc ranges. | | | | | |- btrfs_find_delalloc_range() called with @max_bytes == 4K | | | This is smaller than block size, so | | | btrfs_find_delalloc_range() is unable to return any range. | | - return false; | | | - Now only range [0, 8K) has an OE for it, but for dirty range | [16K, 32K) it's dirty without an OE. | This breaks the assumption that writepage_delalloc() will find | and lock all dirty ranges inside the folio. | |- extent_writepage_io() |- submit_one_sector() for [0, 8K) | Succeeded | |- submit_one_sector() for [16K, 24K) Triggering the ASSERT(), as there is no OE, and the original extent map is a hole.
Please note that, this also exposed the same problem for bs < ps support. E.g. with 64K page size and 4K block size.
If we failed to lock a folio, and falls back into the "loops = 1;" branch, we will re-do the search using 64K as max_bytes. Which may fail again to lock the next folio, and exit early without handling all dirty blocks inside the folio.
[FIX] Instead of using the fixed size PAGE_SIZE as @max_bytes, use @sectorsize, so that we are ensured to find and lock any remaining blocks inside the folio.
And since we're here, add an extra ASSERT() to before calling btrfs_find_delalloc_range() to make sure the @max_bytes is at least no smaller than a block to avoid false negative.
Cc: stable@vger.kernel.org #5.15+ Signed-off-by: Qu Wenruo wqu@suse.com
fs/btrfs/extent_io.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index ca7174fa0240..2fd82055a779 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -393,6 +393,13 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, /* step one, find a bunch of delalloc bytes starting at start */ delalloc_start = *start; delalloc_end = 0;
- /*
* If @max_bytes is smaller than a block, btrfs_find_delalloc_range() can
* return early without handling any dirty ranges.
*/
- ASSERT(max_bytes >= fs_info->sectorsize);
- found = btrfs_find_delalloc_range(tree, &delalloc_start, &delalloc_end, max_bytes, &cached_state); if (!found || delalloc_end <= *start || delalloc_start > orig_end) {
@@ -423,13 +430,14 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, delalloc_end); ASSERT(!ret || ret == -EAGAIN); if (ret == -EAGAIN) {
/* some of the folios are gone, lets avoid looping by
* shortening the size of the delalloc range we're searching
/*
* Some of the folios are gone, lets avoid looping by
*/ btrfs_free_extent_state(cached_state); cached_state = NULL; if (!loops) {* shortening the size of the delalloc range we're searching.
max_bytes = PAGE_SIZE;
} else {max_bytes = fs_info->sectorsize; loops = 1; goto again;
linux-stable-mirror@lists.linaro.org