On Fri, Dec 13, 2024 at 01:49:59PM +0300, Nikolai Zhubr wrote:
Not going to argue, but it'd seem if 5.15 is totally out of interest already, why keep patching it? And as long as it keeps receiving patches, supposedly they are backported and applied to stabilize, not damage it? Ok, nevermind :-)
The Long-Term Stable (LTS) kernels are maintained by the LTS team. A description of how it works can be found here[1].
[1] https://docs.kernel.org/process/2.Process.html#the-big-picture
Subsystems can tag patches sent to the development head by adding "Cc: stable@kernel.org" to the commit description. However, they are not obligated to do that, so there is an auxillary system which uses AI to intuit which patches might be a bug fix. There is also automated systems that try to automatically figure out which patches might be prerequites that are needed. This system is very automated, and after the LTS team uses their automated scripts to generate the LTS kernel, it gets published as an release candidate for 48 hours before it gets pushed out.
Kernel developers are not obligated to support LTS kernels. The fact that they tag commits as "you might want to consider it for backporting" might be all they do; and in some cases, not even that. Most kernel maintainers don't even bother testing the LTS candidate releases. (I only started adding automated tests earlier this year to test the LTS release candidates.)
The primary use for LTS kernels are for companies that really don't want to update to newer kernels, and have kernel teams who can provide support for the LTS kernels and their customers. So if Amazon, Google, and some Android manufacturers want to keep using 5.15, or 6.1, or 6.6, it's provided as a starting point to make life easier for them, especially in terms of geting security bugs backported.
If the kernel teams for thecompanies which use the LTS kernels find problems, they can let the LTS team know if there is some regression, or they can manually backport some patch that couldn't be handled by the automated scripts. But it's all on a best-efforts basis.
For hobbists and indeed most kernels, what I generally recommend is that they switch to the latest LTS kernel once a year. So for example, the last LTS kernel released in 2023 was 6.6. It looks very much like the last kerel released in 2024 will be 6.12, so that will likely be the next LTS kernel. In general, there is more attention paid to the newer LTS kernels, and although *technically* there are LTS kernels going back to 5.4, pretty much no one pays attention to them other than the companies stubbornly hanging on because they don't have the engineering bandwidth to go to a newer kernel, despite the fact that many security bug fixes never make it all the way back to those ancient kernels.
Yes. That is why I spent 2 days for solely testing hardware, booting from separate media, stressing everything, and making plenty of copies. As I mentioned in my initial post, this had revealed no hardware issues. And I'm enjoying md raid-1 since around 2003 already (Not on this device though). I can post all my "smart" values as is, but I can assure they are perfectly fine for both raid-1 members. I encounter faulty hdds elsewhere routinely so its not something unseen too.
Note that some hardware errors can be caused by one-off errors, such as cosmic rays causing a bit-flip in memory DIMM. If that happens, RAID won't save you, since the error was introduced before an updated block group descriptor (for example) gets written. ECC will help; unfortunately, most consumer grade systems don't use ECC. (And by the way, the are systems used in hyperscaler cloud companies which look for CPU-level failures, which can start with silent bit flips leading to crashes or rep-invariant failures, and correlating them with specific CPU cores. For example, see[2].)
[2] https://research.google/pubs/detection-and-prevention-of-silent-data-corrupt...
This is a fsck run on a standalone copy taken before repair (after successful raid re-check):
#fsck.ext4 -fn /dev/sdb1 ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap fsck.ext4: Group descriptors look bad... trying backup blocks...
What this means is that the block group descriptor has for one of ext4's block groups has the location for its block allcation bitmap to be a invalid value. For example, if one of the high bits in the block allcation gets flipped, the block number will be wildly out of range, and so it's something that can be noticed very quickly at mount time. This is a lucky failure, because (a) it can get detected right away, and (b) it can be very easily fixed by consulting one of the backup copies of the block group descriptors. This is what happened in this case, and rest of fsck transcript is consitent with that.
The location of block allocation bitmaps never gets changed, so this sort of thing only happens due to hardware-induced corruption.
Looking at the dumpe2fs output, it looks like it was created relatively recently (July 2024) but it doesn't have the metadata checksum feature enabled, which has been enabled for quite a long time. I'm going to guess that this means that you're using a fairly old version version of e2fsprogs (it was enabled by default in e2fsprogs 1.43, released in May 2016[3]).
[3] https://e2fsprogs.sourceforge.net/e2fsprogs-release.html#1.43
You got lucky because it block allocation bitmap location was corrupted to an obviously invalid value. But if it had been a low-order bit that had gotten flipped this could have lead to data corruption before the data and metadata corruption became obvious enough that ext4 would flag it. Metadata checksums would catch that kind of error much more quickly --- and is an example of how RAID arrays shouldn't be treated as a magic bullet.
Did you check for any changes to the md/dm code, or the block layer?
No. Generally, it could be just anything, therefore I see no point even starting without good background knowledge. That is why I'm trying to draw attention of those who are more aware instead. :-)
The problem is that there are millions and millions of Linux users. If everyone were do that, it just wouldn't scale. For companies who don't want to bother with upgrading to newer versions of software, that's why they pay the big bucks to companies like Red Hat or SuSE or Canonical. Or if you are a platinum level customer for Amazon or Google, you can use Amazon Linux or Google's Container-Optimized OS, and the cloud company's tech support teams will help you out. :-)
Otherwise, I strongly encourage you to learn, and to take responsibility for the health of your own system. And ideally, you can also use that knowledge to help other users out, which is the only way the free-as-in-beer ecosystem can flurish; by having everybody helping each other. Who knows, maybe you could even get a job doing it for a living. :-) :-) :-)
Cheers,