On Wed, Nov 21, 2018 at 05:02:13PM -0500, Theodore Y. Ts'o wrote:
On Wed, Nov 21, 2018 at 02:47:35PM -0700, Jens Axboe wrote:
Thanks applied, this bug was elusive but ever present in recent testing that we did internally, it's been a huge pain in the butt. The symptoms were usually a crash in blk_mq_get_driver_tag() with hctx->tags == NULL, or a crash inside deadline request insert off requeue.
I'm still hitting some weird crashes even with this applied, like this one:
FYI, there are a number of Ubuntu users running 4.19, 4.19.1, and 4.19.2 which have been reporting file system corruption problems. They have a fix of configurations, but one of the things which is seem to be a common factor is they all have CONFIG_SCSI_MQ_DEFAULT disabled. (Which also happens to be how I happen to be running my laptop, and I've noticed no problems.)
One correction to the above --- the people who are having the problem have CONFIG_SCSI_MQ_DEFAULT *enabled* (at least for those who reported the kernel configs --- not all of them did).
I have CONFIG_SCSI_MQ_DEFAULT *disabled*, and things are running just fine on my laptop. Although that may be a red herring, since as you pointed out on the bug NVMe isn't affected by the SCSI_MQ_DEFAULT setting (sorry, I'm still used to a world where SCSI controls the whole world :-). And my laptop is an XPS 13 with an NVMe-attached 1T SSD. Fortunately I've not seen any corruption (or at least nothing visible yet).
Anyway, all of this is in the bug, and I'll see if I can find a way of repro'ing corruption in a KVM or GCE crash-and-burn environment...
- Ted