ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code; namely, that the requested attribute name could not be found.
However, a medium error from disk may also return ENODATA. At best, this medium error may escape to userspace as "attribute not found" when in fact it's an IO (disk) error.
At worst, we may oops in xfs_attr_leaf_get() when we do:
error = xfs_attr_leaf_hasname(args, &bp); if (error == -ENOATTR) { xfs_trans_brelse(args->trans, bp); return error; }
because an ENODATA/ENOATTR error from disk leaves us with a null bp, and the xfs_trans_brelse will then null-deref it.
As discussed on the list, we really need to modify the lower level IO functions to trap all disk errors and ensure that we don't let unique errors like this leak up into higher xfs functions - many like this should be remapped to EIO.
However, this patch directly addresses a reported bug in the xattr code, and should be safe to backport to stable kernels. A larger-scope patch to handle more unique errors at lower levels can follow later.
(Note, prior to 07120f1abdff we did not oops, but we did return the wrong error code to userspace.)
Signed-off-by: Eric Sandeen sandeen@redhat.com Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Cc: stable@vger.kernel.org # v5.9+ ---
(I get it that sprinkling this around to 3 places might have an ick factor but I think it's necessary to make a surgical strike on this bug before we address the general problem.)
Thanks, -Eric
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c index fddb55605e0c..9b54cfb0e13d 100644 --- a/fs/xfs/libxfs/xfs_attr_leaf.c +++ b/fs/xfs/libxfs/xfs_attr_leaf.c @@ -478,6 +478,12 @@ xfs_attr3_leaf_read(
err = xfs_da_read_buf(tp, dp, bno, 0, bpp, XFS_ATTR_FORK, &xfs_attr3_leaf_buf_ops); + /* + * ENODATA from disk implies a disk medium failure; ENODATA for + * xattrs means attribute not found, so disambiguate that here. + */ + if (err == -ENODATA) + err = -EIO; if (err || !(*bpp)) return err;
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c index 4c44ce1c8a64..bff3dc226f81 100644 --- a/fs/xfs/libxfs/xfs_attr_remote.c +++ b/fs/xfs/libxfs/xfs_attr_remote.c @@ -435,6 +435,13 @@ xfs_attr_rmtval_get( 0, &bp, &xfs_attr3_rmt_buf_ops); if (xfs_metadata_is_sick(error)) xfs_dirattr_mark_sick(args->dp, XFS_ATTR_FORK); + /* + * ENODATA from disk implies a disk medium failure; + * ENODATA for xattrs means attribute not found, so + * disambiguate that here. + */ + if (error == -ENODATA) + error = -EIO; if (error) return error;
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c index 17d9e6154f19..723a0643b838 100644 --- a/fs/xfs/libxfs/xfs_da_btree.c +++ b/fs/xfs/libxfs/xfs_da_btree.c @@ -2833,6 +2833,12 @@ xfs_da_read_buf( &bp, ops); if (xfs_metadata_is_sick(error)) xfs_dirattr_mark_sick(dp, whichfork); + /* + * ENODATA from disk implies a disk medium failure; ENODATA for + * xattrs means attribute not found, so disambiguate that here. + */ + if (error == -ENODATA && whichfork == XFS_ATTR_FORK) + error = -EIO; if (error) goto out_free;
On Thu, Aug 21, 2025 at 04:36:10PM -0500, Eric Sandeen wrote:
ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code; namely, that the requested attribute name could not be found.
However, a medium error from disk may also return ENODATA. At best, this medium error may escape to userspace as "attribute not found" when in fact it's an IO (disk) error.
At worst, we may oops in xfs_attr_leaf_get() when we do:
error = xfs_attr_leaf_hasname(args, &bp); if (error == -ENOATTR) { xfs_trans_brelse(args->trans, bp); return error; }
because an ENODATA/ENOATTR error from disk leaves us with a null bp, and the xfs_trans_brelse will then null-deref it.
As discussed on the list, we really need to modify the lower level IO functions to trap all disk errors and ensure that we don't let unique errors like this leak up into higher xfs functions - many like this should be remapped to EIO.
However, this patch directly addresses a reported bug in the xattr code, and should be safe to backport to stable kernels. A larger-scope patch to handle more unique errors at lower levels can follow later.
(Note, prior to 07120f1abdff we did not oops, but we did return the wrong error code to userspace.)
Signed-off-by: Eric Sandeen sandeen@redhat.com Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Cc: stable@vger.kernel.org # v5.9+
(I get it that sprinkling this around to 3 places might have an ick factor but I think it's necessary to make a surgical strike on this bug before we address the general problem.)
Thanks, -Eric
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c index fddb55605e0c..9b54cfb0e13d 100644 --- a/fs/xfs/libxfs/xfs_attr_leaf.c +++ b/fs/xfs/libxfs/xfs_attr_leaf.c @@ -478,6 +478,12 @@ xfs_attr3_leaf_read( err = xfs_da_read_buf(tp, dp, bno, 0, bpp, XFS_ATTR_FORK, &xfs_attr3_leaf_buf_ops);
- /*
* ENODATA from disk implies a disk medium failure; ENODATA for
* xattrs means attribute not found, so disambiguate that here.
*/
- if (err == -ENODATA)
err = -EIO;
I think this first chunk isn't needed since you also changed xfs_da_read_buf to filter the error code, right?
(Shifting towards the giant reconsideration of ENODATA -> EIO filtering)
Do we now also need to adjust the online fsck code to turn ENODATA into a corruption report? From __xchk_process_error:
case -EFSBADCRC: case -EFSCORRUPTED: /* Note the badness but don't abort. */ sc->sm->sm_flags |= errflag; *error = 0; fallthrough; default: trace_xchk_op_error(sc, agno, bno, *error, ret_ip); break; }
We only flag corruptions for these two error codes, but ENODATA from the block layer means "critical medium error". I take that to mean the media has permanently lost whatever was persisted there, right?
--D
if (err || !(*bpp)) return err; diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c index 4c44ce1c8a64..bff3dc226f81 100644 --- a/fs/xfs/libxfs/xfs_attr_remote.c +++ b/fs/xfs/libxfs/xfs_attr_remote.c @@ -435,6 +435,13 @@ xfs_attr_rmtval_get( 0, &bp, &xfs_attr3_rmt_buf_ops); if (xfs_metadata_is_sick(error)) xfs_dirattr_mark_sick(args->dp, XFS_ATTR_FORK);
/*
* ENODATA from disk implies a disk medium failure;
* ENODATA for xattrs means attribute not found, so
* disambiguate that here.
*/
if (error == -ENODATA)
error = -EIO; if (error) return error;
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c index 17d9e6154f19..723a0643b838 100644 --- a/fs/xfs/libxfs/xfs_da_btree.c +++ b/fs/xfs/libxfs/xfs_da_btree.c @@ -2833,6 +2833,12 @@ xfs_da_read_buf( &bp, ops); if (xfs_metadata_is_sick(error)) xfs_dirattr_mark_sick(dp, whichfork);
- /*
* ENODATA from disk implies a disk medium failure; ENODATA for
* xattrs means attribute not found, so disambiguate that here.
*/
- if (error == -ENODATA && whichfork == XFS_ATTR_FORK)
if (error) goto out_free;error = -EIO;
On Fri, Aug 22, 2025 at 08:21:37AM -0700, Darrick J. Wong wrote:
We only flag corruptions for these two error codes, but ENODATA from the block layer means "critical medium error". I take that to mean the media has permanently lost whatever was persisted there, right?
It can also be a write error. But yes, it's what EIO indidcates in general. Which is why I really think we should be doing something like the patch below. But as I don't have the time to fully shephed this I'm not trying to block this hack, even if I think the issue will continue to byte us in the future.
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index f9ef3b2a332a..0252faf038aa 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1290,6 +1290,22 @@ xfs_bwrite( return error; }
+static int +xfs_buf_bio_status( + struct bio *bio) +{ + switch (bio->bi_status) { + case BLK_STS_OK: + return 0; + case BLK_STS_NOSPC: + return -ENOSPC; + case BLK_STS_OFFLINE: + return -ENODEV; + default: + return -EIO; + } +} + static void xfs_buf_bio_end_io( struct bio *bio) @@ -1297,7 +1313,7 @@ xfs_buf_bio_end_io( struct xfs_buf *bp = bio->bi_private;
if (bio->bi_status) - xfs_buf_ioerror(bp, blk_status_to_errno(bio->bi_status)); + xfs_buf_ioerror(bp, xfs_buf_bio_status(bio)); else if ((bp->b_flags & XBF_WRITE) && (bp->b_flags & XBF_ASYNC) && XFS_TEST_ERROR(false, bp->b_mount, XFS_ERRTAG_BUF_IOERROR)) xfs_buf_ioerror(bp, -EIO);
On 8/25/25 2:55 AM, Christoph Hellwig wrote:
On Fri, Aug 22, 2025 at 08:21:37AM -0700, Darrick J. Wong wrote:
We only flag corruptions for these two error codes, but ENODATA from the block layer means "critical medium error". I take that to mean the media has permanently lost whatever was persisted there, right?
It can also be a write error. But yes, it's what EIO indidcates in general. Which is why I really think we should be doing something like the patch below. But as I don't have the time to fully shephed this I'm not trying to block this hack, even if I think the issue will continue to byte us in the future.
Right, as I said we need to do something like what you suggest, which is 100% fine by me.
I'd just like a safe, very targeted fix merged first, to handle the currently observed and reported error. A bigger scope change pushed back to stable kernels has somewhat more risk, and I'd like to avoid that.
I'll work on something like this, and send another patch that's something like "xfs: generalize block device error handling" or whatever, that effectively reverts the patch proposed by me, and adds this patch proposed by you.
If it soaks a little and has no regressions, I'd be happy to send the broader fix back to stable as well.
Thanks, -Eric
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index f9ef3b2a332a..0252faf038aa 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1290,6 +1290,22 @@ xfs_bwrite( return error; } +static int +xfs_buf_bio_status(
- struct bio *bio)
+{
- switch (bio->bi_status) {
- case BLK_STS_OK:
return 0;
- case BLK_STS_NOSPC:
return -ENOSPC;
- case BLK_STS_OFFLINE:
return -ENODEV;
- default:
return -EIO;
- }
+}
static void xfs_buf_bio_end_io( struct bio *bio) @@ -1297,7 +1313,7 @@ xfs_buf_bio_end_io( struct xfs_buf *bp = bio->bi_private; if (bio->bi_status)
xfs_buf_ioerror(bp, blk_status_to_errno(bio->bi_status));
else if ((bp->b_flags & XBF_WRITE) && (bp->b_flags & XBF_ASYNC) && XFS_TEST_ERROR(false, bp->b_mount, XFS_ERRTAG_BUF_IOERROR)) xfs_buf_ioerror(bp, -EIO);xfs_buf_ioerror(bp, xfs_buf_bio_status(bio));
On Mon, Aug 25, 2025 at 12:55:06AM -0700, Christoph Hellwig wrote:
On Fri, Aug 22, 2025 at 08:21:37AM -0700, Darrick J. Wong wrote:
We only flag corruptions for these two error codes, but ENODATA from the block layer means "critical medium error". I take that to mean the media has permanently lost whatever was persisted there, right?
It can also be a write error. But yes, it's what EIO indidcates in general. Which is why I really think we should be doing something like the patch below. But as I don't have the time to fully shephed this I'm not trying to block this hack, even if I think the issue will continue to byte us in the future.
Yes, it's a bit of a problem, whose issues we'll need to nibble on all over the place to fix all the weird issues before issuing the customary patchbomb for gluttinous consumption on fsdevel. <cough>
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index f9ef3b2a332a..0252faf038aa 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1290,6 +1290,22 @@ xfs_bwrite( return error; } +static int +xfs_buf_bio_status(
- struct bio *bio)
+{
- switch (bio->bi_status) {
- case BLK_STS_OK:
return 0;
- case BLK_STS_NOSPC:
return -ENOSPC;
- case BLK_STS_OFFLINE:
return -ENODEV;
- default:
return -EIO;
Well as I pointed out earlier, one interesting "quality" of the current behavior is that online fsck captures the ENODATA and turns that into a metadata corruption report. I'd like to keep that behavior.
- }
+}
static void xfs_buf_bio_end_io( struct bio *bio) @@ -1297,7 +1313,7 @@ xfs_buf_bio_end_io( struct xfs_buf *bp = bio->bi_private; if (bio->bi_status)
xfs_buf_ioerror(bp, blk_status_to_errno(bio->bi_status));
xfs_buf_ioerror(bp, xfs_buf_bio_status(bio));
I think you'd also want to wrap all the submit_bio_wait here too, right?
Hrm, only discard bios, log writes, and zonegc use that function. Maybe not? I think a failed log write takes down the system no matter what error code, nobody cares about failing discard, and I think zonegc write failures just lead to the gc ... aborting?
--D
else if ((bp->b_flags & XBF_WRITE) && (bp->b_flags & XBF_ASYNC) && XFS_TEST_ERROR(false, bp->b_mount, XFS_ERRTAG_BUF_IOERROR)) xfs_buf_ioerror(bp, -EIO);
On Mon, Aug 25, 2025 at 08:34:14AM -0700, Darrick J. Wong wrote:
- case BLK_STS_NOSPC:
return -ENOSPC;
- case BLK_STS_OFFLINE:
return -ENODEV;
- default:
return -EIO;
Well as I pointed out earlier, one interesting "quality" of the current behavior is that online fsck captures the ENODATA and turns that into a metadata corruption report. I'd like to keep that behavior.
-EIO is just as much of a metadata corruption, so if you only catch ENODATA you're missing most of them.
if (bio->bi_status)
xfs_buf_ioerror(bp, blk_status_to_errno(bio->bi_status));
xfs_buf_ioerror(bp, xfs_buf_bio_status(bio));
I think you'd also want to wrap all the submit_bio_wait here too, right?
Hrm, only discard bios, log writes, and zonegc use that function. Maybe not? I think a failed log write takes down the system no matter what error code, nobody cares about failing discard, and I think zonegc write failures just lead to the gc ... aborting?
Yes. In Linux -EIO means an unrecoverable I/O error that the lower layers gave up retrying. Not much we can do about that.
On Wed, Aug 27, 2025 at 12:34:44AM -0700, Christoph Hellwig wrote:
On Mon, Aug 25, 2025 at 08:34:14AM -0700, Darrick J. Wong wrote:
- case BLK_STS_NOSPC:
return -ENOSPC;
- case BLK_STS_OFFLINE:
return -ENODEV;
- default:
return -EIO;
Well as I pointed out earlier, one interesting "quality" of the current behavior is that online fsck captures the ENODATA and turns that into a metadata corruption report. I'd like to keep that behavior.
-EIO is just as much of a metadata corruption, so if you only catch ENODATA you're missing most of them.
Hrmm, well an EIO (or an ENODATA) coming from the block layer causes the scrub code to return to userspace with EIO, and xfs_scrub will complain about the IO error and exit.
It doesn't explicitly mark the data structure as corrupt, but scrub failing should be enough to conclude that the fs is corrupt.
I could patch the kernel to set the CORRUPT flag on the data structure and keep going, since the likelihood of random bit errors causing media errors is pretty high now that we have disks that store more than 1e15 bits.
if (bio->bi_status)
xfs_buf_ioerror(bp, blk_status_to_errno(bio->bi_status));
xfs_buf_ioerror(bp, xfs_buf_bio_status(bio));
I think you'd also want to wrap all the submit_bio_wait here too, right?
Hrm, only discard bios, log writes, and zonegc use that function. Maybe not? I think a failed log write takes down the system no matter what error code, nobody cares about failing discard, and I think zonegc write failures just lead to the gc ... aborting?
Yes. In Linux -EIO means an unrecoverable I/O error that the lower layers gave up retrying. Not much we can do about that.
<nod>
--D
linux-stable-mirror@lists.linaro.org