ext4 damage suspected in between 5.15.167 - 5.15.170 - Linux-stable-mirror

List overview All Threads
Download

newer

ext4 damage suspected in between 5.15.167 - 5.15.170

older

[PATCH] KVM: arm64: Ignore...

[Regression] 6.1.120, 6.6.66 and...

Nikolai Zhubr

12 Dec 2024 12 Dec '24

6:31 p.m.

Hi,

This is to report that after jumping from generic kernel 5.15.167 to 5.15.170 I apparently observe ext4 damage.

After some few days of regular daily use of 5.15.170, one morning my ext4 partition refused to mount complaining about corrupted system area (-117). There were no unusual events preceding this. The device in question is a laptop with healthy battery, also connected to AC permanently. The laptop is privately owned by me, in daily use at home, so I am 100% aware of everything happening with it. The filesystem in question lives on md raid1 with very assymmetric members (ssd+hdd) so one would not possibly expect that in the event of emergency cpu halt or some other abnormal stop while filesystem was actively writing data, raid members could stay in perfect sync. After the incident, I've run raid1 check multiple times and run memtest multiple times from different boot media and certainly consulted startctl. Nothing. No issues whatsoever except for this spontaneous ext4 damage.

Looking at git log for ext4 changes between 5.15.167 and 5.15.170 shows a few commits. All landed in 5.15.168. Interestingly, one of them is a comeback of the (in)famous 91562895f803 "properly sync file size update after O_SYNC ..." which caused some blowup 1 year ago due to "subtle interaction". I've no idea if 91562895f803 is related to damage this time or not, but most definitely it looks like some problem was introduced between 5.15.167 and 5.15.170 anyway. And because there are apparently 0 commits to ext4 in 5.15 since 5.15.168 at the moment, I thought I'd report.

Please CC me if you want me to see your reply and/or need more info (I'm not subscribed to the normal flow).

Take care,

Nick

Show replies by date

Theodore Ts'o

12 Dec 12 Dec

7:16 p.m.

On Thu, Dec 12, 2024 at 09:31:05PM +0300, Nikolai Zhubr wrote:

...

This is to report that after jumping from generic kernel 5.15.167 to 5.15.170 I apparently observe ext4 damage.

Hi Nick,

In general this is not something that upstream kernel developers will pay a lot of attention to try to root cause. If you can come up with a reliable reproducer, not just a single one-off, it's much more likely that people will pay attention. If you can demonstrate that the reliable reproducer shows the issue on the latest development HEAD of the upstream kernel, they will definitely pay attention.

People will also pay more attention if you give more detail in your message. Not just some vague "ext4 damage" (where 99% of time, these sorts of things happen due to hardware-induced corruption), but the exact message when mount failed.

Also helpful when reporting ext4 issues, it's helpful to include information about the file system configuration using "dumpe2fs -h /dev/XXX". Extracting kernel log messages that include the string "EXT4-fs", via commands like "sudo dmesg | grep EXT4-fs", or "sudo journalctl | grep EXT4-fs", or "grep EXT4-fs /var/log/messages" are also helpful, as is getting a report from fsck via a command like "fsck.ext4 -fn /dev/XXX >& /tmp/fsck.out"

That way they can take a quick look the information and do an initial triage over the most likely cause.

...

And because there are apparently 0 commits to ext4 in 5.15 since 5.15.168 at the moment, I thought I'd report.

Did you check for any changes to the md/dm code, or the block layer? Also, if you checked for I/O errors in the system logs, or run "smartctl" on the block devices, please say so. (And if there are indications of I/O errors or storage device issues, please do immediate backups and make plans to replace your hardware before you suffer more serious data loss.)

Finally, if you want more support than what volunteers in the upstream linux kernel community can provide, this is what paid support from companies like SuSE, or Red Hat, can provide.

Cheers,

- Ted

Nikolai Zhubr

13 Dec 13 Dec

10:49 a.m.

Hi Ted,

...

On Thu, Dec 12, 2024 at 09:31:05PM +0300, Nikolai Zhubr wrote:

...
This is to report that after jumping from generic kernel 5.15.167 to 5.15.170 I apparently observe ext4 damage.

Hi Nick,

In general this is not something that upstream kernel developers will pay a lot of attention to try to root cause. If you can come up with

Thanks for a quick and detailed reply. That's really appreciated. I need to clarify. I'm not a hardcore kernel developer at all, I just touch it a little bit occasionally, for random reasons. Debugging the situation thoroughly so as to find and prove the cause is far beyond my capability and also not exactly my personal or professional interest. I also don't need any sort of support (i.e. as a client) - I've already repaired and validated/restored from backups almost everything now, and I can just stick at 5.15.167 for basically as long as I like.

On the other hand, having buggy kernels (to the point of ext4 fs corruption) published as suitable for wide general use is not a good thing in my book, therefore I believe in the case of reasonable suspects I must at least raise a warning about it, and if I can somehow contribute to tracking the problem I'll do what I'm able to.

Not going to argue, but it'd seem if 5.15 is totally out of interest already, why keep patching it? And as long as it keeps receiving patches, supposedly they are backported and applied to stabilize, not damage it? Ok, nevermind :-)

...

People will also pay more attention if you give more detail in your message. Not just some vague "ext4 damage" (where 99% of time, these sorts of things happen due to hardware-induced corruption), but the exact message when mount failed.

Yes. That is why I spent 2 days for solely testing hardware, booting from separate media, stressing everything, and making plenty of copies. As I mentioned in my initial post, this had revealed no hardware issues. And I'm enjoying md raid-1 since around 2003 already (Not on this device though). I can post all my "smart" values as is, but I can assure they are perfectly fine for both raid-1 members. I encounter faulty hdds elsewhere routinely so its not something unseen too.

#smartctl -a /dev/nvme0n1 | grep Spare Available Spare: 100% Available Spare Threshold: 10%

#smartctl -a /dev/sda | grep Sector Sector Sizes: 512 bytes logical, 4096 bytes physical 5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0

I have a copy of the entire ext4 partition taken immediately as mount first failed, it is ~800Gb and may contain some sensitive data so I cannot just hand it to someone else or publish for examination. But I can now easily do a replay of mount failure and fsck processing as many times as needed. For now, it seems file/dir bodies had not been damaged, just some system areas had. I've not encountered any file which would give wrong checksum or otherwise appeared definitely damaged, with overall like 95% verified and definitely fine, 5% hard to reliably verify but those are less important files.

...

Also helpful when reporting ext4 issues, it's helpful to include information about the file system configuration using "dumpe2fs -h

This is a dump run on a standalone copy taken before repair (after successful raid re-check):

#dumpe2fs -h /dev/sdb1 Filesystem volume name: DATA Last mounted on: /opt Filesystem UUID: ea823c6c-500f-4bf0-a4a7-a872ed740af3 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean with errors Errors behavior: Continue Filesystem OS type: Linux Inode count: 51634176 Block count: 206513920 Reserved block count: 10325696 Overhead clusters: 3292742 Free blocks: 48135978 Free inodes: 50216050 First block: 0 Block size: 4096 Fragment size: 4096 Group descriptor size: 64 Reserved GDT blocks: 1024 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Tue Jul 9 01:51:16 2024 Last mount time: Mon Dec 9 10:08:27 2024 Last write time: Tue Dec 10 04:08:17 2024 Mount count: 273 Maximum mount count: -1 Last checked: Tue Jul 9 01:51:16 2024 Check interval: 0 (<none>) Lifetime writes: 913 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 60bfa28b-cdd2-4ba6-8261-87961db4ecea Journal backup: inode blocks FS Error count: 293 First error time: Tue Dec 10 06:17:23 2024 First error function: ext4_lookup First error line #: 1437 First error inode #: 20709377 Last error time: Tue Dec 10 21:12:30 2024 Last error function: ext4_lookup Last error line #: 1437 Last error inode #: 20709377 Journal features: journal_incompat_revoke journal_64bit Total journal size: 128M Total journal blocks: 32768 Max transaction length: 32768 Fast commit length: 0 Journal sequence: 0x00064c6e Journal start: 0

...

/dev/XXX". Extracting kernel log messages that include the string "EXT4-fs", via commands like "sudo dmesg | grep EXT4-fs", or "sudo journalctl | grep EXT4-fs", or "grep EXT4-fs /var/log/messages" are also helpful, as is getting a report from fsck via a command like

#grep EXT4-fs messages-20241212 | grep md126 2024-12-06T11:53:09.471317+03:00 lenovo-zh kernel: [ 7.649474][ T1124] EXT4-fs (md126): Mount option "noacl" will be removed by 3.5 2024-12-06T11:53:09.471351+03:00 lenovo-zh kernel: [ 7.899321][ T1124] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: noacl. Quota mode: none. 2024-12-07T12:03:18.518047+03:00 lenovo-zh kernel: [ 7.633150][ T1106] EXT4-fs (md126): Mount option "noacl" will be removed by 3.5 2024-12-07T12:03:18.518054+03:00 lenovo-zh kernel: [ 7.951716][ T1106] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: noacl. Quota mode: none. 2024-12-08T12:41:33.686145+03:00 lenovo-zh kernel: [ 7.588405][ T1118] EXT4-fs (md126): Mount option "noacl" will be removed by 3.5 2024-12-08T12:41:33.686148+03:00 lenovo-zh kernel: [ 7.679963][ T1118] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: noacl. Quota mode: none. (* normal boot failed and subsequently fsck was run on real data here *) 2024-12-10T18:21:40.356656+03:00 lenovo-zh kernel: [ 483.522025][ T1740] EXT4-fs (md126): failed to initialize system zone (-117) 2024-12-10T18:21:40.356685+03:00 lenovo-zh kernel: [ 483.522050][ T1740] EXT4-fs (md126): mount failed 2024-12-11T02:00:18.382301+03:00 lenovo-zh kernel: [ 490.551080][ T1809] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none. 2024-12-11T12:00:53.249626+03:00 lenovo-zh kernel: [ 7.550823][ T1056] EXT4-fs (md126): Mount option "noacl" will be removed by 3.5 2024-12-11T12:00:53.249629+03:00 lenovo-zh kernel: [ 7.662317][ T1056] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: noacl. Quota mode: none.

#grep md126 messages-20241212 2024-12-07T12:03:18.518038+03:00 lenovo-zh kernel: [ 7.154448][ T992] md126: detected capacity change from 0 to 1652111360 2024-12-07T12:03:18.518047+03:00 lenovo-zh kernel: [ 7.633150][ T1106] EXT4-fs (md126): Mount option "noacl" will be removed by 3.5 2024-12-07T12:03:18.518054+03:00 lenovo-zh kernel: [ 7.951716][ T1106] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: noacl. Quota mode: none. 2024-12-08T12:41:33.685280+03:00 lenovo-zh systemd[1]: Started Timer to wait for more drives before activating degraded array md126.. 2024-12-08T12:41:33.685325+03:00 lenovo-zh systemd[1]: mdadm-last-resort@md126.timer: Deactivated successfully. 2024-12-08T12:41:33.685327+03:00 lenovo-zh systemd[1]: Stopped Timer to wait for more drives before activating degraded array md126.. 2024-12-08T12:41:33.686136+03:00 lenovo-zh kernel: [ 7.346744][ T1107] md/raid1:md126: active with 2 out of 2 mirrors 2024-12-08T12:41:33.686137+03:00 lenovo-zh kernel: [ 7.357218][ T1107] md126: detected capacity change from 0 to 1652111360 2024-12-08T12:41:33.686145+03:00 lenovo-zh kernel: [ 7.588405][ T1118] EXT4-fs (md126): Mount option "noacl" will be removed by 3.5 2024-12-08T12:41:33.686148+03:00 lenovo-zh kernel: [ 7.679963][ T1118] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: noacl. Quota mode: none. (* on 2024-12-09 system refused to boot and no normal log was written *) 2024-12-10T18:13:44.862091+03:00 lenovo-zh systemd[1]: Started Timer to wait for more drives before activating degraded array md126.. 2024-12-10T18:13:45.164589+03:00 lenovo-zh kernel: [ 8.332616][ T1248] md/raid1:md126: active with 2 out of 2 mirrors 2024-12-10T18:13:45.196580+03:00 lenovo-zh kernel: [ 8.363066][ T1248] md126: detected capacity change from 0 to 1652111360 2024-12-10T18:13:45.469396+03:00 lenovo-zh systemd[1]: mdadm-last-resort@md126.timer: Deactivated successfully. 2024-12-10T18:13:45.469584+03:00 lenovo-zh systemd[1]: Stopped Timer to wait for more drives before activating degraded array md126.. 2024-12-10T18:18:51.652575+03:00 lenovo-zh kernel: [ 314.821429][ T1657] md: data-check of RAID array md126 2024-12-10T18:21:40.356656+03:00 lenovo-zh kernel: [ 483.522025][ T1740] EXT4-fs (md126): failed to initialize system zone (-117) 2024-12-10T18:21:40.356685+03:00 lenovo-zh kernel: [ 483.522050][ T1740] EXT4-fs (md126): mount failed 2024-12-10T20:07:29.116652+03:00 lenovo-zh kernel: [ 6832.284366][ T1657] md: md126: data-check done. (fsck was run on real data here) 2024-12-11T01:52:15.839052+03:00 lenovo-zh systemd[1]: Started Timer to wait for more drives before activating degraded array md126.. 2024-12-11T01:52:15.840396+03:00 lenovo-zh kernel: [ 7.832271][ T1170] md/raid1:md126: active with 2 out of 2 mirrors 2024-12-11T01:52:15.840397+03:00 lenovo-zh kernel: [ 7.845385][ T1170] md126: detected capacity change from 0 to 1652111360 2024-12-11T01:52:16.255454+03:00 lenovo-zh systemd[1]: mdadm-last-resort@md126.timer: Deactivated successfully. 2024-12-11T01:52:16.255573+03:00 lenovo-zh systemd[1]: Stopped Timer to wait for more drives before activating degraded array md126.. 2024-12-11T02:00:18.382301+03:00 lenovo-zh kernel: [ 490.551080][ T1809] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.

...

"fsck.ext4 -fn /dev/XXX >& /tmp/fsck.out"

This is a fsck run on a standalone copy taken before repair (after successful raid re-check):

#fsck.ext4 -fn /dev/sdb1 ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap fsck.ext4: Group descriptors look bad... trying backup blocks... Pass 1: Checking inodes, blocks, and sizes Inode 9185447 extent tree (at level 1) could be narrower. Optimize? no Inode 9189969 extent tree (at level 1) could be narrower. Optimize? no Inode 22054610 extent tree (at level 1) could be shorter. Optimize? no Inode 22959998 extent tree (at level 1) could be shorter. Optimize? no Inode 23351116 extent tree (at level 1) could be shorter. Optimize? no Inode 23354700 extent tree (at level 1) could be shorter. Optimize? no Inode 23363083 extent tree (at level 1) could be shorter. Optimize? no Inode 25197205 extent tree (at level 1) could be narrower. Optimize? no Inode 25197271 extent tree (at level 1) could be narrower. Optimize? no Inode 47710225 extent tree (at level 1) could be narrower. Optimize? no Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #0 (23414, counted=22437). Fix? no Free blocks count wrong for group #1 (31644, counted=7). Fix? no Free blocks count wrong for group #2 (32768, counted=0). Fix? no Free blocks count wrong for group #3 (31644, counted=4). Fix? no

[repeated tons of times]

Free inodes count wrong for group #4895 (8192, counted=8044). Fix? no Directories count wrong for group #4895 (0, counted=148). Fix? no Free inodes count wrong for group #4896 (8192, counted=8114). Fix? no Directories count wrong for group #4896 (0, counted=13). Fix? no Free inodes count wrong for group #5824 (8192, counted=8008). Fix? no Directories count wrong for group #5824 (0, counted=31). Fix? no Free inodes count wrong (51634165, counted=50157635). Fix? no DATA: ********** WARNING: Filesystem still has errors ********** DATA: 11/51634176 files (73845.5% non-contiguous), 3292748/206513920 blocks

...

...
And because there are apparently 0 commits to ext4 in 5.15 since 5.15.168 at the moment, I thought I'd report.

Did you check for any changes to the md/dm code, or the block layer?

No. Generally, it could be just anything, therefore I see no point even starting without good background knowledge. That is why I'm trying to draw attention of those who are more aware instead. :-)

...

Also, if you checked for I/O errors in the system logs, or run "smartctl" on the block devices, please say so. (And if there are indications of I/O errors or storage device issues, please do immediate backups and make plans to replace your hardware before you

I have not found any indication of hardware errors at this point.

#grep -i err messages-20241212 | grep sda (nothing) #grep -i err messages-20241212 | grep nvme (nothing)

Some "smart" values are posted above. Nothing suspicious whatsoever.

Thank you!

Regards,

Nick

...

suffer more serious data loss.)

Finally, if you want more support than what volunteers in the upstream linux kernel community can provide, this is what paid support from companies like SuSE, or Red Hat, can provide.

Cheers,
					- Ted

Theodore Ts'o

4:12 p.m.

On Fri, Dec 13, 2024 at 01:49:59PM +0300, Nikolai Zhubr wrote:

...

Not going to argue, but it'd seem if 5.15 is totally out of interest already, why keep patching it? And as long as it keeps receiving patches, supposedly they are backported and applied to stabilize, not damage it? Ok, nevermind :-)

The Long-Term Stable (LTS) kernels are maintained by the LTS team. A description of how it works can be found here[1].

[1] https://docs.kernel.org/process/2.Process.html#the-big-picture

Subsystems can tag patches sent to the development head by adding "Cc: stable@kernel.org" to the commit description. However, they are not obligated to do that, so there is an auxillary system which uses AI to intuit which patches might be a bug fix. There is also automated systems that try to automatically figure out which patches might be prerequites that are needed. This system is very automated, and after the LTS team uses their automated scripts to generate the LTS kernel, it gets published as an release candidate for 48 hours before it gets pushed out.

Kernel developers are not obligated to support LTS kernels. The fact that they tag commits as "you might want to consider it for backporting" might be all they do; and in some cases, not even that. Most kernel maintainers don't even bother testing the LTS candidate releases. (I only started adding automated tests earlier this year to test the LTS release candidates.)

The primary use for LTS kernels are for companies that really don't want to update to newer kernels, and have kernel teams who can provide support for the LTS kernels and their customers. So if Amazon, Google, and some Android manufacturers want to keep using 5.15, or 6.1, or 6.6, it's provided as a starting point to make life easier for them, especially in terms of geting security bugs backported.

If the kernel teams for thecompanies which use the LTS kernels find problems, they can let the LTS team know if there is some regression, or they can manually backport some patch that couldn't be handled by the automated scripts. But it's all on a best-efforts basis.

For hobbists and indeed most kernels, what I generally recommend is that they switch to the latest LTS kernel once a year. So for example, the last LTS kernel released in 2023 was 6.6. It looks very much like the last kerel released in 2024 will be 6.12, so that will likely be the next LTS kernel. In general, there is more attention paid to the newer LTS kernels, and although *technically* there are LTS kernels going back to 5.4, pretty much no one pays attention to them other than the companies stubbornly hanging on because they don't have the engineering bandwidth to go to a newer kernel, despite the fact that many security bug fixes never make it all the way back to those ancient kernels.

...

Yes. That is why I spent 2 days for solely testing hardware, booting from separate media, stressing everything, and making plenty of copies. As I mentioned in my initial post, this had revealed no hardware issues. And I'm enjoying md raid-1 since around 2003 already (Not on this device though). I can post all my "smart" values as is, but I can assure they are perfectly fine for both raid-1 members. I encounter faulty hdds elsewhere routinely so its not something unseen too.

Note that some hardware errors can be caused by one-off errors, such as cosmic rays causing a bit-flip in memory DIMM. If that happens, RAID won't save you, since the error was introduced before an updated block group descriptor (for example) gets written. ECC will help; unfortunately, most consumer grade systems don't use ECC. (And by the way, the are systems used in hyperscaler cloud companies which look for CPU-level failures, which can start with silent bit flips leading to crashes or rep-invariant failures, and correlating them with specific CPU cores. For example, see[2].)

[2] https://research.google/pubs/detection-and-prevention-of-silent-data-corrupt...

...

This is a fsck run on a standalone copy taken before repair (after successful raid re-check):

#fsck.ext4 -fn /dev/sdb1 ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap fsck.ext4: Group descriptors look bad... trying backup blocks...

What this means is that the block group descriptor has for one of ext4's block groups has the location for its block allcation bitmap to be a invalid value. For example, if one of the high bits in the block allcation gets flipped, the block number will be wildly out of range, and so it's something that can be noticed very quickly at mount time. This is a lucky failure, because (a) it can get detected right away, and (b) it can be very easily fixed by consulting one of the backup copies of the block group descriptors. This is what happened in this case, and rest of fsck transcript is consitent with that.

The location of block allocation bitmaps never gets changed, so this sort of thing only happens due to hardware-induced corruption.

Looking at the dumpe2fs output, it looks like it was created relatively recently (July 2024) but it doesn't have the metadata checksum feature enabled, which has been enabled for quite a long time. I'm going to guess that this means that you're using a fairly old version version of e2fsprogs (it was enabled by default in e2fsprogs 1.43, released in May 2016[3]).

[3] https://e2fsprogs.sourceforge.net/e2fsprogs-release.html#1.43

You got lucky because it block allocation bitmap location was corrupted to an obviously invalid value. But if it had been a low-order bit that had gotten flipped this could have lead to data corruption before the data and metadata corruption became obvious enough that ext4 would flag it. Metadata checksums would catch that kind of error much more quickly --- and is an example of how RAID arrays shouldn't be treated as a magic bullet.

...

...
Did you check for any changes to the md/dm code, or the block layer?

No. Generally, it could be just anything, therefore I see no point even starting without good background knowledge. That is why I'm trying to draw attention of those who are more aware instead. :-)

The problem is that there are millions and millions of Linux users. If everyone were do that, it just wouldn't scale. For companies who don't want to bother with upgrading to newer versions of software, that's why they pay the big bucks to companies like Red Hat or SuSE or Canonical. Or if you are a platinum level customer for Amazon or Google, you can use Amazon Linux or Google's Container-Optimized OS, and the cloud company's tech support teams will help you out. :-)

Otherwise, I strongly encourage you to learn, and to take responsibility for the health of your own system. And ideally, you can also use that knowledge to help other users out, which is the only way the free-as-in-beer ecosystem can flurish; by having everybody helping each other. Who knows, maybe you could even get a job doing it for a living. :-) :-) :-)

Cheers,

Nikolai Zhubr

14 Dec 14 Dec

7:58 p.m.

Hi Ted,

On 12/13/24 19:12, Theodore Ts'o wrote:

...

stable@kernel.org" to the commit description. However, they are not obligated to do that, so there is an auxillary system which uses AI to intuit which patches might be a bug fix. There is also automated systems that try to automatically figure out which patches might be

Oh, so meanwhile it got even worse than I used to imagine :-) Thanks for pointing out.

...

Note that some hardware errors can be caused by one-off errors, such as cosmic rays causing a bit-flip in memory DIMM. If that happens, RAID won't save you, since the error was introduced before an updated

Certainly cosmic rays is a possibility, but based on previous episodes I'd still rather bet on a more usual "subtle interaction" problem, either exact same or some similar to [1]. I even tried to run an existing test for this particular case as described in [2] but it is not too user-friendly and somehow exits abnormally without actually doing any interesting work. I'll get back to it later when I have some time.

[1] https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuhc3@quack3/ [2] https://lwn.net/Articles/954364/

...

The location of block allocation bitmaps never gets changed, so this sort of thing only happens due to hardware-induced corruption.

Well, unless e.g. some modified sectors start being flushed to random wrong offsets, like in [1] above, or something similar.

...

Looking at the dumpe2fs output, it looks like it was created relatively recently (July 2024) but it doesn't have the metadata checksum feature enabled, which has been enabled for quite a long

Yes. That was intentional - for better compatibility with even more ancient stuff. Maybe time has come to reconsider the approach though.

...

You got lucky because it block allocation bitmap location was corrupted to an obviously invalid value. But if it had been a

Absolutely. I was really amazed when I realized that :-) It saved me days or even weeks of unnecessary verification work.

...

Otherwise, I strongly encourage you to learn, and to take responsibility for the health of your own system. And ideally, you can also use that knowledge to help other users out, which is the only way the free-as-in-beer ecosystem can flurish; by having everybody

True. Generally I try to follow that, as much as appears possible. It is sad a direct communication end-user-to-developer for solving issues is becoming increasingly problematic here. Anyway, thank you for friendly speech, useful hints and good references!

Regards,

Nick

...

helping each other. Who knows, maybe you could even get a job doing it for a living. :-) :-) :-)

Cheers,

Jan Kara

16 Dec 16 Dec

12:59 p.m.

Hi Nikolai!

On Sat 14-12-24 22:58:24, Nikolai Zhubr wrote:

...

On 12/13/24 19:12, Theodore Ts'o wrote:

...
Note that some hardware errors can be caused by one-off errors, such as cosmic rays causing a bit-flip in memory DIMM. If that happens, RAID won't save you, since the error was introduced before an updated

Certainly cosmic rays is a possibility, but based on previous episodes I'd still rather bet on a more usual "subtle interaction" problem, either exact same or some similar to [1]. I even tried to run an existing test for this particular case as described in [2] but it is not too user-friendly and somehow exits abnormally without actually doing any interesting work. I'll get back to it later when I have some time.

[1] https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuhc3@quack3/ [2] https://lwn.net/Articles/954364/

...
The location of block allocation bitmaps never gets changed, so this sort of thing only happens due to hardware-induced corruption.

Well, unless e.g. some modified sectors start being flushed to random wrong offsets, like in [1] above, or something similar.

Note that above bug led to writing file data to another position in that file. As such it cannot really lead to metadata corruption. Corrupting data in a file is relatively frequent event (given the wide variety of manipulations we do with file data). OTOH I've never seen corrupting metadata like this (in particular because ext4 has additional sanity checks that newly allocated blocks don't overlap with critical fs metadata). In theory, there could be software bug leading to writing sector to a wrong position but frankly, in all the cases I've investigated so far such bugs ended up being HW related.

...

...
Otherwise, I strongly encourage you to learn, and to take responsibility for the health of your own system. And ideally, you can also use that knowledge to help other users out, which is the only way the free-as-in-beer ecosystem can flurish; by having everybody

True. Generally I try to follow that, as much as appears possible. It is sad a direct communication end-user-to-developer for solving issues is becoming increasingly problematic here.

On one hand I understand you, on the other hand back in the good old days (and I remember those as well ;) you wouldn't get much help when running over three years old kernel either. And I understand you're running a stable kernel that gets at least some updates but that's meant more for companies that build their products on top of that and have teams available for debugging issues. For an enduser I find some distribution kernels (Debian, Ubuntu, openSUSE, Fedora) more suitable as they get much more scrutiny before being released than -stable and also people there are more willing to look at issues with older kernels (that are still supported by the distro).

Honza

-- Jan Kara jack@suse.com SUSE Labs, CR

David Laight

3:16 p.m.

....

...

...
The location of block allocation bitmaps never gets changed, so this sort of thing only happens due to hardware-induced corruption.

Well, unless e.g. some modified sectors start being flushed to random wrong offsets, like in [1] above, or something similar.

Or cutting the power in the middle of SSD 'wear levelling'.

I've seen a completely trashed disk (sectors in completely the wrong places) after an unexpected power cut.

David

- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)

Theodore Ts'o

7:31 p.m.

On Mon, Dec 16, 2024 at 03:16:00PM +0000, David Laight wrote:

...

....

...
...
The location of block allocation bitmaps never gets changed, so this sort of thing only happens due to hardware-induced corruption.

Well, unless e.g. some modified sectors start being flushed to random wrong offsets, like in [1] above, or something similar.

Well in the bug that you referenced in [1], what was happening was that data could get written to the wrong offset in the file under certain race conditions. This would not be the case of data block getting written over some metadata block like the block group descriptors.

Sectors getting written to the wrong LBA's do happen; there's a reason why enterprise databases include a checksum in every 4k database block. But the root cause of that generally tends to be a bit getting flipped in the LBA number when it is being sent from the CPU to the Controller to the storage device. It's rare, but when it does happen, it is more often than not hardware-induced --- and again, one of those things where RAID won't necessarily save you.

...

Or cutting the power in the middle of SSD 'wear levelling'.

I've seen a completely trashed disk (sectors in completely the wrong places) after an unexpected power cut.

Sure, but that falls in the category of hardware-induced corruption. There have been non-power-fail certified SSD which have their flash translation metadata so badly corrupted that you lose everything (there's a reason why professional photographers use dual SDcard slots, and some may use duct tape to make sure the battery access door won't fly open if their camera gets dropped).

- Ted

209

days inactive

213

days old

linux-stable-mirror@lists.linaro.org

7 comments

participants

tags (0)

participants (4)

David Laight
Jan Kara
Nikolai Zhubr
Theodore Ts'o