On Sun, Dec 09, 2018 at 11:44:19AM -0500, Theodore Y. Ts'o wrote:
> On Sun, Dec 09, 2018 at 12:30:39PM +0100, Greg KH wrote:
> > > P.P.P.S. If I were king, I'd be asking for a huge number of kunit
> > > tests for block-mq to be developed, and then running them under a
> > > Thread Sanitizer.
> >
> > Isn't that what xfs and fio is? Aren't we running this all the time and
> > reporting those issues? How did this bug not show up on those tests, is
> > it just because they didn't run long enough?
> >
> > Because of those test suites, I was thinking that the block and
> > filesystem paths were one of the more well-tested things we had at the
> > moment, is this not true?
>
> I'm pretty confident about the file system paths, and the "happy
> paths" for the block layer.
>
> But with Kernel Bugzilla #201685, despite huge amounts both before and
> after 4.19-rc1, nothing picked it up. It turned out to be very
> configuration specific, *and* only happened when you were under heavy
> memory pressure and/or I/O pressure.
>
> I'm starting to try to use blktests, but it's not as mature as
> xfstests. It has portability issues, as it assumes a much newer
> userspace. So I can't even run it under some environments at all.
> The test coverage just isn't as broad. Compare:
>
> ext4/4k: 441 tests, 1 failures, 42 skipped, 4387 seconds
> Failures: generic/388
>
> Versus:
>
> Run: block/001 block/002 block/003 block/004 block/005 block/006
> block/009 block/010 block/012 block/013 block/014 block/015
> block/016 block/017 block/018 block/020 block/021 block/023
> block/024 loop/001 loop/002 loop/003 loop/004 loop/005 loop/006
> nvme/002 nvme/003 nvme/004 nvme/006 nvme/007 nvme/008 nvme/009
> nvme/010 nvme/011 nvme/012 nvme/013 nvme/014 nvme/015 nvme/016
> nvme/017 nvme/019 nvme/020 nvme/021 nvme/022 nvme/023 nvme/024
> nvme/025 nvme/026 nvme/027 nvme/028 scsi/001 scsi/002 scsi/003
> scsi/004 scsi/005 scsi/006 srp/001 srp/002 srp/003 srp/004
> srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 srp/013
> Failures: block/017 block/024 nvme/002 nvme/003 nvme/008 nvme/009
> nvme/010 nvme/011 nvme/012 nvme/013 nvme/014 nvme/015 nvme/016
> nvme/019 nvme/020 nvme/021 nvme/022 nvme/023 nvme/024 nvme/025
> nvme/026 nvme/027 nvme/028 scsi/006 srp/001 srp/002 srp/003 srp/004
> srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 srp/013
> Failed 37 of 69 tests
>
> (Most of the failures are test portability issues that I still need to
> work through, not real failures. But just look at the number of
> tests....)
So you are saying quantity rules over quantity? :)
It's really hard to judge this, given that xfstests are testing a whole
range of other things (POSIX compliance and stressing the vfs api),
while blktests are there to stress the block i/o api/interface.
So both would be best to run as we know xfstests also hits the block
layer...
thanks,
greg k-h
The latest file system corruption issue (Nominally fixed by
ffe81d45322c ("blk-mq: fix corruption with direct issue") later
fixed by c616cbee97ae ("blk-mq: punt failed direct issue to dispatch
list")) brought a lot of rightfully concerned users asking about
release schedules. 4.18 went EOL on Nov 21 and Fedora rebased to
4.19.3 on Nov 23. When the issue started getting visibility,
users were left with the option of running known EOL 4.18.x
kernels or running a 4.19 series that could corrupt their
data. Admittedly, the risk of running the EOL kernel was pretty
low given how recent it was, but it's still not a great look
to tell people to run something marked EOL.
I'm wondering if there's anything we can do to make things easier
on kernel consumers. Bugs will certainly happen but it really
makes it hard to push the "always run the latest stable" narrative
if there isn't a good fallback when things go seriously wrong. I
don't actually have a great proposal for a solution here other than
retroactively bringing back 4.18 (which I don't think Greg would
like) but I figured I should at least bring it up.
Thanks,
Laura