The libnvdimm subsystem uses async operations to parallelize device probing operations and to allow sysfs to trigger device_unregister() on deleted namepsaces. A multithreaded stress test of the libnvdimm sysfs interface uncovered a case where device_unregister() is triggered multiple times, and the subsequent investigation uncovered a broken locking scenario.
The lack of lockdep coverage for device_lock() stymied the debug. That is, until patch6 "driver-core, libnvdimm: Let device subsystems add local lockdep coverage" solved that with a shadow lock, with lockdep coverage, to mirror device_lock() operations. Given the time saved with shadow-lock debug-hack, patch6 attempts to generalize device_lock() debug facility that might be able to be carried upstream. Patch6 is staged at the end of this fix series in case it is contentious and needs to be dropped.
Patch1 "drivers/base: Introduce kill_device()" could be achieved with local libnvdimm infrastructure. However, the existing 'dead' flag in 'struct device_private' aims to solve similar async register/unregister races so the fix in patch2 "libnvdimm/bus: Prevent duplicate device_unregister() calls" can be implemented with existing driver-core infrastructure.
Patch3 is a rare lockdep warning that is intermittent based on namespaces racing ahead of the completion of probe of their parent region. It is not related to the other fixes, it just happened to trigger as a result of the async stress test.
Patch4 and patch5 address an ABBA deadlock tripped by the stress test.
These patches pass the failing stress test and the existing libnvdimm unit tests with CONFIG_PROVE_LOCKING=y and the new "dev->lockdep_mutex" shadow lock with no lockdep warnings.
---
Dan Williams (6): drivers/base: Introduce kill_device() libnvdimm/bus: Prevent duplicate device_unregister() calls libnvdimm/region: Register badblocks before namespaces libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl() libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock driver-core, libnvdimm: Let device subsystems add local lockdep coverage
drivers/acpi/nfit/core.c | 28 ++++--- drivers/acpi/nfit/nfit.h | 24 ++++++ drivers/base/core.c | 30 ++++++-- drivers/nvdimm/btt_devs.c | 16 ++-- drivers/nvdimm/bus.c | 154 +++++++++++++++++++++++++++------------ drivers/nvdimm/core.c | 10 +-- drivers/nvdimm/dimm_devs.c | 4 + drivers/nvdimm/namespace_devs.c | 36 +++++---- drivers/nvdimm/nd-core.h | 71 ++++++++++++++++++ drivers/nvdimm/pfn_devs.c | 24 +++--- drivers/nvdimm/pmem.c | 4 + drivers/nvdimm/region.c | 24 +++--- drivers/nvdimm/region_devs.c | 12 ++- include/linux/device.h | 6 ++ 14 files changed, 308 insertions(+), 135 deletions(-)
The libnvdimm subsystem arranges for devices to be destroyed as a result of a sysfs operation. Since device_unregister() cannot be called from an actively running sysfs attribute of the same device libnvdimm arranges for device_unregister() to be performed in an out-of-line async context.
The driver core maintains a 'dead' state for coordinating its own racing async registration / de-registration requests. Rather than add local 'dead' state tracking infrastructure to libnvdimm device objects, export the existing state tracking via a new kill_device() helper.
The kill_device() helper simply marks the device as dead, i.e. that it is on its way to device_del(), or returns that the device was already dead. This can be used in advance of calling device_unregister() for subsystems like libnvdimm that might need to handle multiple user threads racing to delete a device.
This refactoring does not change any behavior, but it is a pre-requisite for follow-on fixes and therefore marked for -stable.
Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "Rafael J. Wysocki" rafael@kernel.org Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...") Cc: stable@vger.kernel.org Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/base/core.c | 27 +++++++++++++++++++-------- include/linux/device.h | 1 + 2 files changed, 20 insertions(+), 8 deletions(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c index fd7511e04e62..eaf3aa0cb803 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -2211,6 +2211,24 @@ void put_device(struct device *dev) } EXPORT_SYMBOL_GPL(put_device);
+bool kill_device(struct device *dev) +{ + /* + * Require the device lock and set the "dead" flag to guarantee that + * the update behavior is consistent with the other bitfields near + * it and that we cannot have an asynchronous probe routine trying + * to run while we are tearing out the bus/class/sysfs from + * underneath the device. + */ + lockdep_assert_held(&dev->mutex); + + if (dev->p->dead) + return false; + dev->p->dead = true; + return true; +} +EXPORT_SYMBOL_GPL(kill_device); + /** * device_del - delete device from system. * @dev: device. @@ -2230,15 +2248,8 @@ void device_del(struct device *dev) struct kobject *glue_dir = NULL; struct class_interface *class_intf;
- /* - * Hold the device lock and set the "dead" flag to guarantee that - * the update behavior is consistent with the other bitfields near - * it and that we cannot have an asynchronous probe routine trying - * to run while we are tearing out the bus/class/sysfs from - * underneath the device. - */ device_lock(dev); - dev->p->dead = true; + kill_device(dev); device_unlock(dev);
/* Notify clients of device removal. This call must come diff --git a/include/linux/device.h b/include/linux/device.h index e85264fb6616..0da5c67f6be1 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1373,6 +1373,7 @@ extern int (*platform_notify_remove)(struct device *dev); */ extern struct device *get_device(struct device *dev); extern void put_device(struct device *dev); +extern bool kill_device(struct device *dev);
#ifdef CONFIG_DEVTMPFS extern int devtmpfs_create_node(struct device *dev);
A multithreaded namespace creation/destruction stress test currently fails with signatures like the following:
sysfs group 'power' not found for kobject 'dax1.1' RIP: 0010:sysfs_remove_group+0x76/0x80 Call Trace: device_del+0x73/0x370 device_unregister+0x16/0x50 nd_async_device_unregister+0x1e/0x30 [libnvdimm] async_run_entry_fn+0x39/0x160 process_one_work+0x23c/0x5e0 worker_thread+0x3c/0x390
BUG: kernel NULL pointer dereference, address: 0000000000000020 RIP: 0010:klist_put+0x1b/0x6c Call Trace: klist_del+0xe/0x10 device_del+0x8a/0x2c9 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 device_unregister+0x44/0x4f nd_async_device_unregister+0x22/0x2d [libnvdimm] async_run_entry_fn+0x47/0x15a process_one_work+0x1a2/0x2eb worker_thread+0x1b8/0x26e
Use the kill_device() helper to atomically resolve the race of multiple threads issuing kill, device_unregister(), requests.
Reported-by: Jane Chu jane.chu@oracle.com Reported-by: Erwin Tsaur erwin.tsaur@oracle.com Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...") Cc: stable@vger.kernel.org Link: https://github.com/pmem/ndctl/issues/96 Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/nvdimm/bus.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c index 2dca3034fee0..42713b210f51 100644 --- a/drivers/nvdimm/bus.c +++ b/drivers/nvdimm/bus.c @@ -547,13 +547,38 @@ EXPORT_SYMBOL(nd_device_register);
void nd_device_unregister(struct device *dev, enum nd_async_mode mode) { + bool killed; + switch (mode) { case ND_ASYNC: + /* + * In the async case this is being triggered with the + * device lock held and the unregistration work needs to + * be moved out of line iff this is thread has won the + * race to schedule the deletion. + */ + if (!kill_device(dev)) + return; + get_device(dev); async_schedule_domain(nd_async_device_unregister, dev, &nd_async_domain); break; case ND_SYNC: + /* + * In the sync case the device is being unregistered due + * to a state change of the parent. Claim the kill state + * to synchronize against other unregistration requests, + * or otherwise let the async path handle it if the + * unregistration was already queued. + */ + device_lock(dev); + killed = kill_device(dev); + device_unlock(dev); + + if (!killed) + return; + nd_synchronize(); device_unregister(dev); break;
Namespace activation expects to be able to reference region badblocks. The following warning sometimes triggers when asynchronous namespace activation races in front of the completion of namespace probing. Move all possible namespace probing after region badblocks initialization.
Otherwise, lockdep sometimes catches the uninitialized state of the badblocks seqlock with stack trace signatures like:
INFO: trying to register non-static key. pmem2: detected capacity change from 0 to 136365211648 the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 9 PID: 358 Comm: kworker/u80:5 Tainted: G OE 5.2.0-rc4+ #3382 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Workqueue: events_unbound async_run_entry_fn Call Trace: dump_stack+0x85/0xc0 pmem1.12: detected capacity change from 0 to 8589934592 register_lock_class+0x56a/0x570 ? check_object+0x140/0x270 __lock_acquire+0x80/0x1710 ? __mutex_lock+0x39d/0x910 lock_acquire+0x9e/0x180 ? nd_pfn_validate+0x28f/0x440 [libnvdimm] badblocks_check+0x93/0x1f0 ? nd_pfn_validate+0x28f/0x440 [libnvdimm] nd_pfn_validate+0x28f/0x440 [libnvdimm] ? lockdep_hardirqs_on+0xf0/0x180 nd_dax_probe+0x9a/0x120 [libnvdimm] nd_pmem_probe+0x6d/0x180 [nd_pmem] nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]
Fixes: 48af2f7e52f4 ("libnvdimm, pfn: during init, clear errors...") Cc: stable@vger.kernel.org Cc: Vishal Verma vishal.l.verma@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/nvdimm/region.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c index ef46cc3a71ae..488c47ac4c4a 100644 --- a/drivers/nvdimm/region.c +++ b/drivers/nvdimm/region.c @@ -34,17 +34,6 @@ static int nd_region_probe(struct device *dev) if (rc) return rc;
- rc = nd_region_register_namespaces(nd_region, &err); - if (rc < 0) - return rc; - - ndrd = dev_get_drvdata(dev); - ndrd->ns_active = rc; - ndrd->ns_count = rc + err; - - if (rc && err && rc == err) - return -ENODEV; - if (is_nd_pmem(&nd_region->dev)) { struct resource ndr_res;
@@ -60,6 +49,17 @@ static int nd_region_probe(struct device *dev) nvdimm_badblocks_populate(nd_region, &nd_region->bb, &ndr_res); }
+ rc = nd_region_register_namespaces(nd_region, &err); + if (rc < 0) + return rc; + + ndrd = dev_get_drvdata(dev); + ndrd->ns_active = rc; + ndrd->ns_count = rc + err; + + if (rc && err && rc == err) + return -ENODEV; + nd_region->btt_seed = nd_btt_create(nd_region); nd_region->pfn_seed = nd_pfn_create(nd_region); nd_region->dax_seed = nd_dax_create(nd_region);
On Tue, 2019-06-11 at 16:25 -0700, Dan Williams wrote:
Namespace activation expects to be able to reference region badblocks. The following warning sometimes triggers when asynchronous namespace activation races in front of the completion of namespace probing. Move all possible namespace probing after region badblocks initialization.
Otherwise, lockdep sometimes catches the uninitialized state of the badblocks seqlock with stack trace signatures like:
INFO: trying to register non-static key. pmem2: detected capacity change from 0 to 136365211648 the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 9 PID: 358 Comm: kworker/u80:5 Tainted:
G OE 5.2.0-rc4+ #3382 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Workqueue: events_unbound async_run_entry_fn Call Trace: dump_stack+0x85/0xc0 pmem1.12: detected capacity change from 0 to 8589934592 register_lock_class+0x56a/0x570 ? check_object+0x140/0x270 __lock_acquire+0x80/0x1710 ? __mutex_lock+0x39d/0x910 lock_acquire+0x9e/0x180 ? nd_pfn_validate+0x28f/0x440 [libnvdimm] badblocks_check+0x93/0x1f0 ? nd_pfn_validate+0x28f/0x440 [libnvdimm] nd_pfn_validate+0x28f/0x440 [libnvdimm] ? lockdep_hardirqs_on+0xf0/0x180 nd_dax_probe+0x9a/0x120 [libnvdimm] nd_pmem_probe+0x6d/0x180 [nd_pmem] nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]
Fixes: 48af2f7e52f4 ("libnvdimm, pfn: during init, clear errors...") Cc: stable@vger.kernel.org Cc: Vishal Verma vishal.l.verma@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com
drivers/nvdimm/region.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-)
This looks good to me, Reviewed-by: Vishal Verma vishal.l.verma@intel.com
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c index ef46cc3a71ae..488c47ac4c4a 100644 --- a/drivers/nvdimm/region.c +++ b/drivers/nvdimm/region.c @@ -34,17 +34,6 @@ static int nd_region_probe(struct device *dev) if (rc) return rc;
- rc = nd_region_register_namespaces(nd_region, &err);
- if (rc < 0)
return rc;
- ndrd = dev_get_drvdata(dev);
- ndrd->ns_active = rc;
- ndrd->ns_count = rc + err;
- if (rc && err && rc == err)
return -ENODEV;
- if (is_nd_pmem(&nd_region->dev)) { struct resource ndr_res;
@@ -60,6 +49,17 @@ static int nd_region_probe(struct device *dev) nvdimm_badblocks_populate(nd_region, &nd_region->bb, &ndr_res); }
- rc = nd_region_register_namespaces(nd_region, &err);
- if (rc < 0)
return rc;
- ndrd = dev_get_drvdata(dev);
- ndrd->ns_active = rc;
- ndrd->ns_count = rc + err;
- if (rc && err && rc == err)
return -ENODEV;
- nd_region->btt_seed = nd_btt_create(nd_region); nd_region->pfn_seed = nd_pfn_create(nd_region); nd_region->dax_seed = nd_dax_create(nd_region);
In preparation for fixing a deadlock between wait_for_bus_probe_idle() and the nvdimm_bus_list_mutex arrange for __nd_ioctl() without nvdimm_bus_list_mutex held. This also unifies the 'dimm' and 'bus' level ioctls into a common nd_ioctl() preamble implementation.
Marked for -stable as it is a pre-requisite for a follow-on fix.
Cc: stable@vger.kernel.org Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation") Cc: Vishal Verma vishal.l.verma@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/nvdimm/bus.c | 96 ++++++++++++++++++++++++++++------------------ drivers/nvdimm/nd-core.h | 3 + 2 files changed, 60 insertions(+), 39 deletions(-)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c index 42713b210f51..dfb93228d6a7 100644 --- a/drivers/nvdimm/bus.c +++ b/drivers/nvdimm/bus.c @@ -73,7 +73,7 @@ static void nvdimm_bus_probe_end(struct nvdimm_bus *nvdimm_bus) { nvdimm_bus_lock(&nvdimm_bus->dev); if (--nvdimm_bus->probe_active == 0) - wake_up(&nvdimm_bus->probe_wait); + wake_up(&nvdimm_bus->wait); nvdimm_bus_unlock(&nvdimm_bus->dev); }
@@ -341,7 +341,7 @@ struct nvdimm_bus *nvdimm_bus_register(struct device *parent, return NULL; INIT_LIST_HEAD(&nvdimm_bus->list); INIT_LIST_HEAD(&nvdimm_bus->mapping_list); - init_waitqueue_head(&nvdimm_bus->probe_wait); + init_waitqueue_head(&nvdimm_bus->wait); nvdimm_bus->id = ida_simple_get(&nd_ida, 0, 0, GFP_KERNEL); if (nvdimm_bus->id < 0) { kfree(nvdimm_bus); @@ -426,6 +426,9 @@ static int nd_bus_remove(struct device *dev) list_del_init(&nvdimm_bus->list); mutex_unlock(&nvdimm_bus_list_mutex);
+ wait_event(nvdimm_bus->wait, + atomic_read(&nvdimm_bus->ioctl_active) == 0); + nd_synchronize(); device_for_each_child(&nvdimm_bus->dev, NULL, child_unregister);
@@ -885,7 +888,7 @@ void wait_nvdimm_bus_probe_idle(struct device *dev) if (nvdimm_bus->probe_active == 0) break; nvdimm_bus_unlock(&nvdimm_bus->dev); - wait_event(nvdimm_bus->probe_wait, + wait_event(nvdimm_bus->wait, nvdimm_bus->probe_active == 0); nvdimm_bus_lock(&nvdimm_bus->dev); } while (true); @@ -1115,24 +1118,10 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm, return rc; }
-static long nd_ioctl(struct file *file, unsigned int cmd, unsigned long arg) -{ - long id = (long) file->private_data; - int rc = -ENXIO, ro; - struct nvdimm_bus *nvdimm_bus; - - ro = ((file->f_flags & O_ACCMODE) == O_RDONLY); - mutex_lock(&nvdimm_bus_list_mutex); - list_for_each_entry(nvdimm_bus, &nvdimm_bus_list, list) { - if (nvdimm_bus->id == id) { - rc = __nd_ioctl(nvdimm_bus, NULL, ro, cmd, arg); - break; - } - } - mutex_unlock(&nvdimm_bus_list_mutex); - - return rc; -} +enum nd_ioctl_mode { + BUS_IOCTL, + DIMM_IOCTL, +};
static int match_dimm(struct device *dev, void *data) { @@ -1147,31 +1136,62 @@ static int match_dimm(struct device *dev, void *data) return 0; }
-static long nvdimm_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +static long nd_ioctl(struct file *file, unsigned int cmd, unsigned long arg, + enum nd_ioctl_mode mode) + { - int rc = -ENXIO, ro; - struct nvdimm_bus *nvdimm_bus; + struct nvdimm_bus *nvdimm_bus, *found = NULL; + long id = (long) file->private_data; + struct nvdimm *nvdimm = NULL; + int rc, ro;
ro = ((file->f_flags & O_ACCMODE) == O_RDONLY); mutex_lock(&nvdimm_bus_list_mutex); list_for_each_entry(nvdimm_bus, &nvdimm_bus_list, list) { - struct device *dev = device_find_child(&nvdimm_bus->dev, - file->private_data, match_dimm); - struct nvdimm *nvdimm; - - if (!dev) - continue; + if (mode == DIMM_IOCTL) { + struct device *dev; + + dev = device_find_child(&nvdimm_bus->dev, + file->private_data, match_dimm); + if (!dev) + continue; + nvdimm = to_nvdimm(dev); + found = nvdimm_bus; + } else if (nvdimm_bus->id == id) { + found = nvdimm_bus; + }
- nvdimm = to_nvdimm(dev); - rc = __nd_ioctl(nvdimm_bus, nvdimm, ro, cmd, arg); - put_device(dev); - break; + if (found) { + atomic_inc(&nvdimm_bus->ioctl_active); + break; + } } mutex_unlock(&nvdimm_bus_list_mutex);
+ if (!found) + return -ENXIO; + + nvdimm_bus = found; + rc = __nd_ioctl(nvdimm_bus, nvdimm, ro, cmd, arg); + + if (nvdimm) + put_device(&nvdimm->dev); + if (atomic_dec_and_test(&nvdimm_bus->ioctl_active)) + wake_up(&nvdimm_bus->wait); + return rc; }
+static long bus_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{ + return nd_ioctl(file, cmd, arg, BUS_IOCTL); +} + +static long dimm_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{ + return nd_ioctl(file, cmd, arg, DIMM_IOCTL); +} + static int nd_open(struct inode *inode, struct file *file) { long minor = iminor(inode); @@ -1183,16 +1203,16 @@ static int nd_open(struct inode *inode, struct file *file) static const struct file_operations nvdimm_bus_fops = { .owner = THIS_MODULE, .open = nd_open, - .unlocked_ioctl = nd_ioctl, - .compat_ioctl = nd_ioctl, + .unlocked_ioctl = bus_ioctl, + .compat_ioctl = bus_ioctl, .llseek = noop_llseek, };
static const struct file_operations nvdimm_fops = { .owner = THIS_MODULE, .open = nd_open, - .unlocked_ioctl = nvdimm_ioctl, - .compat_ioctl = nvdimm_ioctl, + .unlocked_ioctl = dimm_ioctl, + .compat_ioctl = dimm_ioctl, .llseek = noop_llseek, };
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h index 391e88de3a29..6cd470547106 100644 --- a/drivers/nvdimm/nd-core.h +++ b/drivers/nvdimm/nd-core.h @@ -17,10 +17,11 @@ extern struct workqueue_struct *nvdimm_wq;
struct nvdimm_bus { struct nvdimm_bus_descriptor *nd_desc; - wait_queue_head_t probe_wait; + wait_queue_head_t wait; struct list_head list; struct device dev; int id, probe_active; + atomic_t ioctl_active; struct list_head mapping_list; struct mutex reconfig_mutex; struct badrange badrange;
A multithreaded namespace creation/destruction stress test currently deadlocks with the following lockup signature:
INFO: task ndctl:2924 blocked for more than 122 seconds. Tainted: G OE 5.2.0-rc4+ #3382 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ndctl D 0 2924 1176 0x00000000 Call Trace: ? __schedule+0x27e/0x780 schedule+0x30/0xb0 wait_nvdimm_bus_probe_idle+0x8a/0xd0 [libnvdimm] ? finish_wait+0x80/0x80 uuid_store+0xe6/0x2e0 [libnvdimm] kernfs_fop_write+0xf0/0x1a0 vfs_write+0xb7/0x1b0 ksys_write+0x5c/0xd0 do_syscall_64+0x60/0x240
INFO: task ndctl:2923 blocked for more than 122 seconds. Tainted: G OE 5.2.0-rc4+ #3382 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ndctl D 0 2923 1175 0x00000000 Call Trace: ? __schedule+0x27e/0x780 ? __mutex_lock+0x489/0x910 schedule+0x30/0xb0 schedule_preempt_disabled+0x11/0x20 __mutex_lock+0x48e/0x910 ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm] ? __lock_acquire+0x23f/0x1710 ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm] nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm] __dax_pmem_probe+0x5e/0x210 [dax_pmem_core] ? nvdimm_bus_probe+0x1d0/0x2c0 [libnvdimm] dax_pmem_probe+0xc/0x20 [dax_pmem] nvdimm_bus_probe+0x90/0x2c0 [libnvdimm] really_probe+0xef/0x390 driver_probe_device+0xb4/0x100
In this sequence an 'nd_dax' device is being probed and trying to take the lock on its backing namespace to validate that the 'nd_dax' device indeed has exclusive access to the backing namespace. Meanwhile, another thread is trying to update the uuid property of that same backing namespace. So one thread is in the probe path trying to acquire the lock, and the other thread has acquired the lock and tries to flush the probe path.
Fix this deadlock by not holding the namespace device_lock over the wait_nvdimm_bus_probe_idle() synchronization step. In turn this requires the device_lock to be held on entry to wait_nvdimm_bus_probe_idle() and subsequently dropped internally to wait_nvdimm_bus_probe_idle().
Cc: stable@vger.kernel.org Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation") Cc: Vishal Verma vishal.l.verma@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/nvdimm/bus.c | 17 +++++++++++------ drivers/nvdimm/region_devs.c | 4 ++++ 2 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c index dfb93228d6a7..c1d26fca9c4c 100644 --- a/drivers/nvdimm/bus.c +++ b/drivers/nvdimm/bus.c @@ -887,10 +887,12 @@ void wait_nvdimm_bus_probe_idle(struct device *dev) do { if (nvdimm_bus->probe_active == 0) break; - nvdimm_bus_unlock(&nvdimm_bus->dev); + nvdimm_bus_unlock(dev); + device_unlock(dev); wait_event(nvdimm_bus->wait, nvdimm_bus->probe_active == 0); - nvdimm_bus_lock(&nvdimm_bus->dev); + device_lock(dev); + nvdimm_bus_lock(dev); } while (true); }
@@ -1017,7 +1019,7 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm, case ND_CMD_ARS_START: case ND_CMD_CLEAR_ERROR: case ND_CMD_CALL: - dev_dbg(&nvdimm_bus->dev, "'%s' command while read-only.\n", + dev_dbg(dev, "'%s' command while read-only.\n", nvdimm ? nvdimm_cmd_name(cmd) : nvdimm_bus_cmd_name(cmd)); return -EPERM; @@ -1088,7 +1090,8 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm, goto out; }
- nvdimm_bus_lock(&nvdimm_bus->dev); + device_lock(dev); + nvdimm_bus_lock(dev); rc = nd_cmd_clear_to_send(nvdimm_bus, nvdimm, func, buf); if (rc) goto out_unlock; @@ -1103,7 +1106,8 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm, nvdimm_account_cleared_poison(nvdimm_bus, clear_err->address, clear_err->cleared); } - nvdimm_bus_unlock(&nvdimm_bus->dev); + nvdimm_bus_unlock(dev); + device_unlock(dev);
if (copy_to_user(p, buf, buf_len)) rc = -EFAULT; @@ -1112,7 +1116,8 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm, return rc;
out_unlock: - nvdimm_bus_unlock(&nvdimm_bus->dev); + nvdimm_bus_unlock(dev); + device_unlock(dev); out: vfree(buf); return rc; diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c index 4fed9ce9c2fe..a15276cdec7d 100644 --- a/drivers/nvdimm/region_devs.c +++ b/drivers/nvdimm/region_devs.c @@ -422,10 +422,12 @@ static ssize_t available_size_show(struct device *dev, * memory nvdimm_bus_lock() is dropped, but that's userspace's * problem to not race itself. */ + device_lock(dev); nvdimm_bus_lock(dev); wait_nvdimm_bus_probe_idle(dev); available = nd_region_available_dpa(nd_region); nvdimm_bus_unlock(dev); + device_unlock(dev);
return sprintf(buf, "%llu\n", available); } @@ -437,10 +439,12 @@ static ssize_t max_available_extent_show(struct device *dev, struct nd_region *nd_region = to_nd_region(dev); unsigned long long available = 0;
+ device_lock(dev); nvdimm_bus_lock(dev); wait_nvdimm_bus_probe_idle(dev); available = nd_region_allocatable_dpa(nd_region); nvdimm_bus_unlock(dev); + device_unlock(dev);
return sprintf(buf, "%llu\n", available); }
On 6/11/2019 4:25 PM, Dan Williams wrote:
The libnvdimm subsystem uses async operations to parallelize device probing operations and to allow sysfs to trigger device_unregister() on deleted namepsaces. A multithreaded stress test of the libnvdimm sysfs interface uncovered a case where device_unregister() is triggered multiple times, and the subsequent investigation uncovered a broken locking scenario.
The lack of lockdep coverage for device_lock() stymied the debug. That is, until patch6 "driver-core, libnvdimm: Let device subsystems add local lockdep coverage" solved that with a shadow lock, with lockdep coverage, to mirror device_lock() operations. Given the time saved with shadow-lock debug-hack, patch6 attempts to generalize device_lock() debug facility that might be able to be carried upstream. Patch6 is staged at the end of this fix series in case it is contentious and needs to be dropped.
Patch1 "drivers/base: Introduce kill_device()" could be achieved with local libnvdimm infrastructure. However, the existing 'dead' flag in 'struct device_private' aims to solve similar async register/unregister races so the fix in patch2 "libnvdimm/bus: Prevent duplicate device_unregister() calls" can be implemented with existing driver-core infrastructure.
Patch3 is a rare lockdep warning that is intermittent based on namespaces racing ahead of the completion of probe of their parent region. It is not related to the other fixes, it just happened to trigger as a result of the async stress test.
Patch4 and patch5 address an ABBA deadlock tripped by the stress test.
These patches pass the failing stress test and the existing libnvdimm unit tests with CONFIG_PROVE_LOCKING=y and the new "dev->lockdep_mutex" shadow lock with no lockdep warnings.
Dan Williams (6): drivers/base: Introduce kill_device() libnvdimm/bus: Prevent duplicate device_unregister() calls libnvdimm/region: Register badblocks before namespaces libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl() libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock driver-core, libnvdimm: Let device subsystems add local lockdep coverage
drivers/acpi/nfit/core.c | 28 ++++--- drivers/acpi/nfit/nfit.h | 24 ++++++ drivers/base/core.c | 30 ++++++-- drivers/nvdimm/btt_devs.c | 16 ++-- drivers/nvdimm/bus.c | 154 +++++++++++++++++++++++++++------------ drivers/nvdimm/core.c | 10 +-- drivers/nvdimm/dimm_devs.c | 4 + drivers/nvdimm/namespace_devs.c | 36 +++++---- drivers/nvdimm/nd-core.h | 71 ++++++++++++++++++ drivers/nvdimm/pfn_devs.c | 24 +++--- drivers/nvdimm/pmem.c | 4 + drivers/nvdimm/region.c | 24 +++--- drivers/nvdimm/region_devs.c | 12 ++- include/linux/device.h | 6 ++ 14 files changed, 308 insertions(+), 135 deletions(-)
Tested-by: Jane Chu jane.chu@oracle.com
Specifically, running parallel ndctls creating/destroying namespaces in multiple processes concurrently led to system panic, that has been verified fixed by this patch series.
Thanks! -jane
linux-stable-mirror@lists.linaro.org