On 06/05/2019 18:07, Song Liu wrote:
[...] I understand this could in theory affects all the RAID levels, but in practice I don't think it'll happen. RAID0 is the only "blind" mode of RAID, in the sense it's the only one that doesn't care at all with failures. In fact, this was the origin of my other thread [0], regarding the change of raid0's behavior in error cases..because it currently does not care with members being removed and rely only in filesystem failures (after submitting many BIOs to the removed device).
That said, in this change I've only took care of raid0, since in my understanding the other levels won't submit BIOs to dead devices; we can experiment that to see if it's true.
Could you please run a quick test with raid5? I am wondering whether some race condition could get us into similar crash. If we cannot easily trigger the bug, we can process with this version.
Thanks, Song
Hi Song, I've tested both RAID5 (with 3 disks, removing one at a time), and also RAID 1 (2 disks, also removing one at a time); no issues observed in kernel 5.1. We can see one interesting message in kernel log: "super_written gets error=10", which corresponds to md detecting the error (bi_status == BLK_STS_IOERROR) and instantly failing the write, making FS read-only.
So, I think really the issue happens only in RAID0, which writes "blindly" to its components. Let me know your thoughts - thanks again for your input!
Cheers,
Guilherme