Hi Ted,
On 12/13/24 19:12, Theodore Ts'o wrote:
stable@kernel.org" to the commit description. However, they are not obligated to do that, so there is an auxillary system which uses AI to intuit which patches might be a bug fix. There is also automated systems that try to automatically figure out which patches might be
Oh, so meanwhile it got even worse than I used to imagine :-) Thanks for pointing out.
Note that some hardware errors can be caused by one-off errors, such as cosmic rays causing a bit-flip in memory DIMM. If that happens, RAID won't save you, since the error was introduced before an updated
Certainly cosmic rays is a possibility, but based on previous episodes I'd still rather bet on a more usual "subtle interaction" problem, either exact same or some similar to [1]. I even tried to run an existing test for this particular case as described in [2] but it is not too user-friendly and somehow exits abnormally without actually doing any interesting work. I'll get back to it later when I have some time.
[1] https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuhc3@quack3/ [2] https://lwn.net/Articles/954364/
The location of block allocation bitmaps never gets changed, so this sort of thing only happens due to hardware-induced corruption.
Well, unless e.g. some modified sectors start being flushed to random wrong offsets, like in [1] above, or something similar.
Looking at the dumpe2fs output, it looks like it was created relatively recently (July 2024) but it doesn't have the metadata checksum feature enabled, which has been enabled for quite a long
Yes. That was intentional - for better compatibility with even more ancient stuff. Maybe time has come to reconsider the approach though.
You got lucky because it block allocation bitmap location was corrupted to an obviously invalid value. But if it had been a
Absolutely. I was really amazed when I realized that :-) It saved me days or even weeks of unnecessary verification work.
Otherwise, I strongly encourage you to learn, and to take responsibility for the health of your own system. And ideally, you can also use that knowledge to help other users out, which is the only way the free-as-in-beer ecosystem can flurish; by having everybody
True. Generally I try to follow that, as much as appears possible. It is sad a direct communication end-user-to-developer for solving issues is becoming increasingly problematic here. Anyway, thank you for friendly speech, useful hints and good references!
Regards,
Nick
helping each other. Who knows, maybe you could even get a job doing it for a living. :-) :-) :-)
Cheers,