The ATTRIBUTE_GROUPS is typically used to avoid boiler plate
code which is used in many drivers. Embracing ATTRIBUTE_GROUPS was
long due on the zram driver, however a recent fix for sysfs allows
users of ATTRIBUTE_GROUPS to also associate a module to the group
attribute.
In zram's case this also means it allows us to fix a race which triggers
a deadlock on the zram driver. This deadlock happens when a sysfs attribute
use a lock also used on module removal. This happens when for instance a
sysfs file on a driver is used, then at the same time we have module
removal call trigger. The module removal call code holds a lock, and then
the sysfs file entry waits for the same lock. While holding the lock the
module removal tries to remove the sysfs entries, but these cannot be
removed yet as one is waiting for a lock. This won't complete as the lock
is already held. Likewise module removal cannot complete, and so we
deadlock.
Sysfs fixes this when the group attributes have a module associated to
it, sysfs will *try* to get a refcount to the module when a shared
lock is used, prior to mucking with a sysfs attribute. If this fails we
just give up right away.
This deadlock was first reported with the zram driver, a sketch of how
this can happen follows:
CPU A CPU B
whatever_store()
module_unload
mutex_lock(foo)
mutex_lock(foo)
del_gendisk(zram->disk);
device_del()
device_remove_groups()
In this situation whatever_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the sysfs file being poked
completes which is waiting for a lock already held.
This issue can be reproduced easily on the zram driver as follows:
Loop 1 on one terminal:
while true;
do modprobe zram;
modprobe -r zram;
done
Loop 2 on a second terminal:
while true; do
echo 1024 > /sys/block/zram0/disksize;
echo 1 > /sys/block/zram0/reset;
done
Without this patch we end up in a deadlock, and the following
stack trace is produced which hints to us what the issue was:
INFO: task bash:888 blocked for more than 120 seconds.
Tainted: G E 5.12.0-rc1-next-20210304+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:bash state:D stack: 0 pid: 888 ppid: 887 flags:<etc>
Call Trace:
__schedule+0x2e4/0x900
schedule+0x46/0xb0
schedule_preempt_disabled+0xa/0x10
__mutex_lock.constprop.0+0x2c3/0x490
? _kstrtoull+0x35/0xd0
reset_store+0x6c/0x160 [zram]
kernfs_fop_write_iter+0x124/0x1b0
new_sync_write+0x11c/0x1b0
vfs_write+0x1c2/0x260
ksys_write+0x5f/0xe0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f34f2c3df33
RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33
RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001
RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001
R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0
INFO: task modprobe:1104 can't die for more than 120 seconds.
task:modprobe state:D stack: 0 pid: 1104 ppid: 916 flags:<etc>
Call Trace:
__schedule+0x2e4/0x900
schedule+0x46/0xb0
__kernfs_remove.part.0+0x228/0x2b0
? finish_wait+0x80/0x80
kernfs_remove_by_name_ns+0x50/0x90
remove_files+0x2b/0x60
sysfs_remove_group+0x38/0x80
sysfs_remove_groups+0x29/0x40
device_remove_attrs+0x4a/0x80
device_del+0x183/0x3e0
? mutex_lock+0xe/0x30
del_gendisk+0x27a/0x2d0
zram_remove+0x8a/0xb0 [zram]
? hot_remove_store+0xf0/0xf0 [zram]
zram_remove_cb+0xd/0x10 [zram]
idr_for_each+0x5e/0xd0
destroy_devices+0x39/0x6f [zram]
__do_sys_delete_module+0x190/0x2a0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f32adf727d7
RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7
RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78
RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78
R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20
[0] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
drivers/block/zram/zram_drv.c | 11 ++---------
1 file changed, 2 insertions(+), 9 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b26abcb955cc..60a55ae8cd91 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1902,14 +1902,7 @@ static struct attribute *zram_disk_attrs[] = {
NULL,
};
-static const struct attribute_group zram_disk_attr_group = {
- .attrs = zram_disk_attrs,
-};
-
-static const struct attribute_group *zram_disk_attr_groups[] = {
- &zram_disk_attr_group,
- NULL,
-};
+ATTRIBUTE_GROUPS(zram_disk);
/*
* Allocate and initialize new zram device. the function returns
@@ -1981,7 +1974,7 @@ static int zram_add(void)
blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue);
- device_add_disk(NULL, zram->disk, zram_disk_attr_groups);
+ device_add_disk(NULL, zram->disk, zram_disk_groups);
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
--
2.30.2
Provide a simple state machine to fix races with driver exit where we
remove the CPU multistate callbacks and re-initialization / creation of
new per CPU instances which should be managed by these callbacks.
The zram driver makes use of cpu hotplug multistate support, whereby it
associates a struct zcomp per CPU. Each struct zcomp represents a
compression algorithm in charge of managing compression streams per
CPU. Although a compiled zram driver only supports a fixed set of
compression algorithms, each zram device gets a struct zcomp allocated
per CPU. The "multi" in CPU hotplug multstate refers to these per
cpu struct zcomp instances. Each of these will have the CPU hotplug
callback called for it on CPU plug / unplug. The kernel's CPU hotplug
multistate keeps a linked list of these different structures so that
it will iterate over them on CPU transitions.
By default at driver initialization we will create just one zram device
(num_devices=1) and a zcomp structure then set for the now default
lzo-rle comrpession algorithm. At driver removal we first remove each
zram device, and so we destroy the associated struct zcomp per CPU. But
since we expose sysfs attributes to create new devices or reset /
initialize existing zram devices, we can easily end up re-initializing
a struct zcomp for a zram device before the exit routine of the module
removes the cpu hotplug callback. When this happens the kernel's CPU
hotplug will detect that at least one instance (struct zcomp for us)
exists. This can happen in the following situation:
CPU 1 CPU 2
disksize_store(...);
class_unregister(...);
idr_for_each(...);
zram_debugfs_destroy();
idr_destroy(...);
unregister_blkdev(...);
cpuhp_remove_multi_state(...);
The warning comes up on cpuhp_remove_multi_state() when it sees that the
state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
In this case, that a struct zcom still exists, the driver allowed its
creation per CPU even though we could have just freed them per CPU
though a call on another CPU, and we are then later trying to remove the
hotplug callback.
Fix all this by providing a zram initialization boolean
protected the shared in the driver zram_index_mutex, which we
can use to annotate when sysfs attributes are safe to use or
not -- once the driver is properly initialized. When the driver
is going down we also are sure to not let userspace muck with
attributes which may affect each per cpu struct zcomp.
This also fixes a series of possible memory leaks. The
crashes and memory leaks can easily be caused by issuing
the zram02.sh script from the LTP project [0] in a loop
in two separate windows:
cd testcases/kernel/device-drivers/zram
while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
You end up with a splat as follows:
kernel: zram: Removed device: zram0
kernel: zram: Added device: zram0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: Adding 104857596k swap on /dev/zram0. <etc>
kernel: zram0: detected capacitky change from 209715200 to 0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: ------------[ cut here ]------------
kernel: Error: Removing state 63 which has instances left.
kernel: WARNING: CPU: 7 PID: 70457 at \
kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G \
E 5.12.0-rc1-next-20210304 #3
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \
BIOS 1.14.0-2 04/01/2014
kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Code: <etc>
kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc>
kernel: CS: 0010 DS: 0000 ES 0000 CR0: 0000000080050033
kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel: __cpuhp_remove_state+0x2e/0x80
kernel: __do_sys_delete_module+0x190/0x2a0
kernel: do_syscall_64+0x33/0x80
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
The "Error: Removing state 63 which has instances left" refers
to the zram per CPU struct zcomp instances left.
[0] https://github.com/linux-test-project/ltp.git
Acked-by: Minchan Kim <minchan(a)kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
1 file changed, 55 insertions(+), 8 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f61910c65f0f..b26abcb955cc 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
static int zram_major;
static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
+static bool zram_up;
+
/* Module params (documentation at end) */
static unsigned int num_devices = 1;
/*
@@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram)
comp = zram->comp;
disksize = zram->disksize;
zram->disksize = 0;
+ zram->comp = NULL;
set_capacity_and_notify(zram->disk, 0);
part_stat_set_all(zram->disk->part0, 0);
@@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev,
struct zram *zram = dev_to_zram(dev);
int err;
+ mutex_lock(&zram_index_mutex);
+
+ if (!zram_up) {
+ err = -ENODEV;
+ goto out;
+ }
+
disksize = memparse(buf, NULL);
- if (!disksize)
- return -EINVAL;
+ if (!disksize) {
+ err = -EINVAL;
+ goto out;
+ }
down_write(&zram->init_lock);
if (init_done(zram)) {
@@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev,
set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
up_write(&zram->init_lock);
+ mutex_unlock(&zram_index_mutex);
+
return len;
out_free_meta:
zram_meta_free(zram, disksize);
out_unlock:
up_write(&zram->init_lock);
+out:
+ mutex_unlock(&zram_index_mutex);
return err;
}
@@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev,
if (ret)
return ret;
- if (!do_reset)
- return -EINVAL;
+ mutex_lock(&zram_index_mutex);
+
+ if (!zram_up) {
+ len = -ENODEV;
+ goto out;
+ }
+
+ if (!do_reset) {
+ len = -EINVAL;
+ goto out;
+ }
zram = dev_to_zram(dev);
bdev = zram->disk->part0;
@@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev,
/* Do not reset an active device or claimed device */
if (bdev->bd_openers || zram->claim) {
mutex_unlock(&bdev->bd_disk->open_mutex);
- return -EBUSY;
+ len = -EBUSY;
+ goto out;
}
/* From now on, anyone can't open /dev/zram[0-9] */
@@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev,
zram->claim = false;
mutex_unlock(&bdev->bd_disk->open_mutex);
+out:
+ mutex_unlock(&zram_index_mutex);
return len;
}
@@ -2010,6 +2038,10 @@ static ssize_t hot_add_show(struct class *class,
int ret;
mutex_lock(&zram_index_mutex);
+ if (!zram_up) {
+ mutex_unlock(&zram_index_mutex);
+ return -ENODEV;
+ }
ret = zram_add();
mutex_unlock(&zram_index_mutex);
@@ -2037,6 +2069,11 @@ static ssize_t hot_remove_store(struct class *class,
mutex_lock(&zram_index_mutex);
+ if (!zram_up) {
+ ret = -ENODEV;
+ goto out;
+ }
+
zram = idr_find(&zram_index_idr, dev_id);
if (zram) {
ret = zram_remove(zram);
@@ -2046,6 +2083,7 @@ static ssize_t hot_remove_store(struct class *class,
ret = -ENODEV;
}
+out:
mutex_unlock(&zram_index_mutex);
return ret ? ret : count;
}
@@ -2072,12 +2110,15 @@ static int zram_remove_cb(int id, void *ptr, void *data)
static void destroy_devices(void)
{
+ mutex_lock(&zram_index_mutex);
+ zram_up = false;
class_unregister(&zram_control_class);
idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
zram_debugfs_destroy();
idr_destroy(&zram_index_idr);
unregister_blkdev(zram_major, "zram");
cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
+ mutex_unlock(&zram_index_mutex);
}
static int __init zram_init(void)
@@ -2105,15 +2146,21 @@ static int __init zram_init(void)
return -EBUSY;
}
+ mutex_lock(&zram_index_mutex);
+
while (num_devices != 0) {
- mutex_lock(&zram_index_mutex);
ret = zram_add();
- mutex_unlock(&zram_index_mutex);
- if (ret < 0)
+ if (ret < 0) {
+ mutex_unlock(&zram_index_mutex);
goto out_error;
+ }
num_devices--;
}
+ zram_up = true;
+
+ mutex_unlock(&zram_index_mutex);
+
return 0;
out_error:
--
2.30.2
Now that sysfs has the deadlock race fixed with module removal,
enable the deadlock tests module removal tests. They were left
disabled by default as otherwise you would deadlock your system
./tools/testing/selftests/sysfs/sysfs.sh -t 0027
Running test: sysfs_test_0027 - run #0
Test for possible rmmod deadlock while writing x ... ok
./tools/testing/selftests/sysfs/sysfs.sh -t 0028
Running test: sysfs_test_0028 - run #0
Test for possible rmmod deadlock using rtnl_lock while writing x ... ok
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
tools/testing/selftests/sysfs/sysfs.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
index f928635d0e35..4047ac48e764 100755
--- a/tools/testing/selftests/sysfs/sysfs.sh
+++ b/tools/testing/selftests/sysfs/sysfs.sh
@@ -60,8 +60,8 @@ ALL_TESTS="$ALL_TESTS 0023:1:1:test_dev_y:block"
ALL_TESTS="$ALL_TESTS 0024:1:1:test_dev_x:block"
ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
-ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
-ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
+ALL_TESTS="$ALL_TESTS 0027:1:1:test_dev_x:block" # deadlock test
+ALL_TESTS="$ALL_TESTS 0028:1:1:test_dev_x:block" # deadlock test with rntl_lock
ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store
ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex
ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex
--
2.30.2
If one ends up expanding on this line checkpatch will complain that the
combination S_IRWXU|S_IRUGO|S_IXUGO should just be replaced with the
octal 0755. Do that.
This makes no functional changes.
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
fs/sysfs/dir.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 59dffd5ca517..b6b6796e1616 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -56,8 +56,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
kobject_get_ownership(kobj, &uid, &gid);
- kn = kernfs_create_dir_ns(parent, kobject_name(kobj),
- S_IRWXU | S_IRUGO | S_IXUGO, uid, gid,
+ kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
kobj, ns);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
--
2.30.2
If one ends up extending this line checkpatch will complain about the
use of S_IRWXUGO suggesting it is not preferred and that 0777
should be used instead. Take the tip from checkpatch and do that
change before we do our subsequent changes.
This makes no functional changes.
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
fs/kernfs/symlink.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index c8f8e41b8411..19a6c71c6ff5 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,8 +36,7 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
gid = target->iattr->ia_gid;
}
- kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO, uid, gid,
- KERNFS_LINK);
+ kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
if (!kn)
return ERR_PTR(-ENOMEM);
--
2.30.2
There is quite a bit of tribal knowledge around proper use of
try_module_get() and that it must be used only in a context which
can ensure the module won't be gone during the operation. Document
this little bit of tribal knowledge.
I'm extending this tribal knowledge with new developments which it
seems some folks do not yet believe to be true: we can be sure a
module will exist during the lifetime of a sysfs file operation.
For proof, refer to test_sysfs test #32:
./tools/testing/selftests/sysfs/sysfs.sh -t 0032
Without this being true, the write would fail or worse,
a crash would happen, in this test. It does not.
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
include/linux/module.h | 34 ++++++++++++++++++++++++++++++++--
1 file changed, 32 insertions(+), 2 deletions(-)
diff --git a/include/linux/module.h b/include/linux/module.h
index c9f1200b2312..22eacd5e1e85 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -609,10 +609,40 @@ void symbol_put_addr(void *addr);
to handle the error case (which only happens with rmmod --wait). */
extern void __module_get(struct module *module);
-/* This is the Right Way to get a module: if it fails, it's being removed,
- * so pretend it's not there. */
+/**
+ * try_module_get() - yields to module removal and bumps refcnt otherwise
+ * @module: the module we should check for
+ *
+ * This can be used to try to bump the reference count of a module, so to
+ * prevent module removal. The reference count of a module is not allowed
+ * to be incremented if the module is already being removed.
+ *
+ * Care must be taken to ensure the module cannot be removed during the call to
+ * try_module_get(). This can be done by having another entity other than the
+ * module itself increment the module reference count, or through some other
+ * means which guarantees the module could not be removed during an operation.
+ * An example of this later case is using try_module_get() in a sysfs file
+ * which the module created. The sysfs store / read file operations are
+ * gauranteed to exist through the use of kernfs's active reference (see
+ * kernfs_active()). If a sysfs file operation is being run, the module which
+ * created it must still exist as the module is in charge of removing the same
+ * sysfs file being read. Also, a sysfs / kernfs file removal cannot happen
+ * unless the same file is not active.
+ *
+ * One of the real values to try_module_get() is the module_is_live() check
+ * which ensures this the caller of try_module_get() can yield to userspace
+ * module removal requests and fail whatever it was about to process.
+ */
extern bool try_module_get(struct module *module);
+/**
+ * module_put() - release a reference count to a module
+ * @module: the module we should release a reference count for
+ *
+ * If you successfully bump a reference count to a module with try_module_get(),
+ * when you are finished you must call module_put() to release that reference
+ * count.
+ */
extern void module_put(struct module *module);
#else /*!CONFIG_MODULE_UNLOAD*/
--
2.30.2
This extends test_sysfs with support for using the failure injection
wait completion and knobs to force a few race conditions which
demonstrates that kernfs active reference protection is sufficient
for kobject / device protection at higher layers.
This adds 4 new tests which tries to remove the device attribute
store operation in 4 different situations:
1) at the start of kernfs_kernfs_fop_write_iter()
2) before the of->mutex is held in kernfs_kernfs_fop_write_iter()
3) after the of->mutex is held in kernfs_kernfs_fop_write_iter()
4) after the kernfs node active reference is taken
A write fails in call cases except the last one, test number #32. There
is a good explanation for this: *once* kernfs_get_active() gets called
we have a guarantee that the kernfs entry cannot be removed. If
kernfs_get_active() succeeds that entry cannot be removed and so
anything trying to remove that entry will have to wait. It is perhaps
not obvious but since a sysfs write will trigger eventually a
kernfs_get_active() call, and *only* if this succeeds will the sysfs
op be called, this and the fact that you cannot remove the kernfs
entry while the kenfs entry is active implies that a module that
created the respective sysfs / kernfs entry *cannot* possibly be
removed during a sysfs operation. And test number 32 provides us with
proof of this. If it were not true test #32 should crash.
No null dereferences are reproduced, even though this has been observed
in some complex testing cases [0]. If this issue really exists we should
have enough tools on the sysfs_test toolbox now to try to reproduce
this easily without having to poke around other drivers. It very likley
was the case that the issue reported [0] was possibly a side issue after
the first bug which was zram specific. This is why it is important to
isolate the issue and try to reproduce it in a generic form using the
test_sysfs driver.
[0] https://lkml.kernel.org/r/20210623215007.862787-1-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
lib/Kconfig.debug | 3 +
lib/test_sysfs.c | 31 +++++
tools/testing/selftests/sysfs/config | 3 +
tools/testing/selftests/sysfs/sysfs.sh | 175 +++++++++++++++++++++++++
4 files changed, 212 insertions(+)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a29b7d398c4e..176b822654e5 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2358,6 +2358,9 @@ config TEST_SYSFS
depends on SYSFS
depends on NET
depends on BLOCK
+ select FAULT_INJECTION
+ select FAULT_INJECTION_DEBUG_FS
+ select FAIL_KERNFS_KNOBS
help
This builds the "test_sysfs" module. This driver enables to test the
sysfs file system safely without affecting production knobs which
diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c
index 273fc3f39740..391e0af2864a 100644
--- a/lib/test_sysfs.c
+++ b/lib/test_sysfs.c
@@ -38,6 +38,11 @@
#include <linux/rtnetlink.h>
#include <linux/genhd.h>
#include <linux/blkdev.h>
+#include <linux/kernfs.h>
+
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+MODULE_IMPORT_NS(KERNFS_DEBUG_PRIVATE);
+#endif
static bool enable_lock;
module_param(enable_lock, bool_enable_only, 0644);
@@ -82,6 +87,13 @@ static bool enable_verbose_rmmod;
module_param(enable_verbose_rmmod, bool_enable_only, 0644);
MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod");
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+static bool enable_completion_on_rmmod;
+module_param(enable_completion_on_rmmod, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_completion_on_rmmod,
+ "enable sending a kernfs completion on rmmod");
+#endif
+
static int sysfs_test_major;
/**
@@ -289,6 +301,12 @@ static ssize_t config_show(struct device *dev,
"enable_verbose_writes:\t%s\n",
enable_verbose_writes ? "true" : "false");
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+ len += snprintf(buf+len, PAGE_SIZE - len,
+ "enable_completion_on_rmmod:\t%s\n",
+ enable_completion_on_rmmod ? "true" : "false");
+#endif
+
test_dev_config_unlock(test_dev);
return len;
@@ -926,10 +944,23 @@ static int __init test_sysfs_init(void)
}
module_init(test_sysfs_init);
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+/* The goal is to race our device removal with a pending kernfs -> store call */
+static void test_sysfs_kernfs_send_completion_rmmod(void)
+{
+ if (!enable_completion_on_rmmod)
+ return;
+ complete(&kernfs_debug_wait_completion);
+}
+#else
+static inline void test_sysfs_kernfs_send_completion_rmmod(void) {}
+#endif
+
static void __exit test_sysfs_exit(void)
{
if (enable_debugfs)
debugfs_remove(debugfs_dir);
+ test_sysfs_kernfs_send_completion_rmmod();
if (delay_rmmod_ms)
msleep(delay_rmmod_ms);
unregister_test_dev_sysfs(first_test_dev);
diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config
index 9196f452ecd5..2876a229f95b 100644
--- a/tools/testing/selftests/sysfs/config
+++ b/tools/testing/selftests/sysfs/config
@@ -1,2 +1,5 @@
CONFIG_SYSFS=m
CONFIG_TEST_SYSFS=m
+CONFIG_FAULT_INJECTION=y
+CONFIG_FAULT_INJECTION_DEBUG_FS=y
+CONFIG_FAIL_KERNFS_KNOBS=y
diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
index b3f4c2236c7f..f928635d0e35 100755
--- a/tools/testing/selftests/sysfs/sysfs.sh
+++ b/tools/testing/selftests/sysfs/sysfs.sh
@@ -62,6 +62,10 @@ ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
+ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store
+ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex
+ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex
+ALL_TESTS="$ALL_TESTS 0032:1:1:test_dev_x:block" # kernfs race removal after active
allow_user_defaults()
{
@@ -92,6 +96,9 @@ allow_user_defaults()
if [ -z $SYSFS_DEBUGFS_DIR ]; then
SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs"
fi
+ if [ -z $KERNFS_DEBUGFS_DIR ]; then
+ KERNFS_DEBUGFS_DIR="/sys/kernel/debug/kernfs"
+ fi
if [ -z $PAGE_SIZE ]; then
PAGE_SIZE=$(getconf PAGESIZE)
fi
@@ -167,6 +174,14 @@ modprobe_reset_enable_rtnl_lock_on_rmmod()
unset FIRST_MODPROBE_ARGS
}
+modprobe_reset_enable_completion()
+{
+ FIRST_MODPROBE_ARGS="enable_completion_on_rmmod=1 enable_verbose_writes=1"
+ FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_rmmod=1 delay_rmmod_ms=0"
+ modprobe_reset
+ unset FIRST_MODPROBE_ARGS
+}
+
load_req_mod()
{
modprobe_reset
@@ -197,6 +212,63 @@ debugfs_reset_first_test_dev_ignore_errors()
echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev
}
+debugfs_kernfs_kernfs_fop_write_iter_exists()
+{
+ KNOB_DIR="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter"
+ if [[ ! -d $KNOB_DIR ]]; then
+ echo "kernfs debugfs does not exist $KNOB_DIR"
+ return 0;
+ fi
+ KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+ if [[ ! -d $KNOB_DEBUGFS ]]; then
+ echo -n "kernfs debugfs for coniguring fail_kernfs_fop_write_iter "
+ echo "does not exist $KNOB_DIR"
+ return 0;
+ fi
+ return 1
+}
+
+debugfs_kernfs_kernfs_fop_write_iter_set_fail_once()
+{
+ KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+ echo 1 > $KNOB_DEBUGFS/interval
+ echo 100 > $KNOB_DEBUGFS/probability
+ echo 0 > $KNOB_DEBUGFS/space
+ # Disable verbose messages on the kernel ring buffer which may
+ # confuse developers with a kernel panic.
+ echo 0 > $KNOB_DEBUGFS/verbose
+
+ # Fail only once
+ echo 1 > $KNOB_DEBUGFS/times
+}
+
+debugfs_kernfs_kernfs_fop_write_iter_set_fail_never()
+{
+ KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+ echo 0 > $KNOB_DEBUGFS/times
+}
+
+debugfs_kernfs_set_wait_ms()
+{
+ SLEEP_AFTER_WAIT_MS="${KERNFS_DEBUGFS_DIR}/sleep_after_wait_ms"
+ echo $1 > $SLEEP_AFTER_WAIT_MS
+}
+
+debugfs_kernfs_disable_wait_kernfs_fop_write_iter()
+{
+ ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_"
+ for KNOB in ${ENABLE_WAIT_KNOB}*; do
+ echo 0 > $KNOB
+ done
+}
+
+debugfs_kernfs_enable_wait_kernfs_fop_write_iter()
+{
+ ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_$1"
+ echo -n "1" > $ENABLE_WAIT_KNOB
+ return $?
+}
+
set_orig()
{
if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then
@@ -972,6 +1044,105 @@ sysfs_test_0028()
fi
}
+sysfs_race_kernfs_kernfs_fop_write_iter()
+{
+ TARGET="${DIR}/$(get_test_target $1)"
+ WAIT_AT=$2
+ EXPECT_WRITE_RETURNS=$3
+ MSDELAY=$4
+
+ modprobe_reset_enable_completion
+ ORIG=$(cat "${TARGET}")
+ TEST_STR=$(( $ORIG + 1 ))
+
+ echo -n "Test racing removal of sysfs store op with kernfs $WAIT_AT ... "
+
+ if debugfs_kernfs_kernfs_fop_write_iter_exists; then
+ echo -n "skipping test as CONFIG_FAIL_KERNFS_KNOBS "
+ echo " or CONFIG_FAULT_INJECTION_DEBUG_FS is disabled"
+ return $ksft_skip
+ fi
+
+ # Allow for failing the kernfs_kernfs_fop_write_iter call once,
+ # we'll provide exact context shortly afterwards.
+ debugfs_kernfs_kernfs_fop_write_iter_set_fail_once
+
+ # First disable all waits
+ debugfs_kernfs_disable_wait_kernfs_fop_write_iter
+
+ # Enable a wait_for_completion(&kernfs_debug_wait_completion) at the
+ # specified location inside the kernfs_fop_write_iter() routine
+ debugfs_kernfs_enable_wait_kernfs_fop_write_iter $WAIT_AT
+
+ # Configure kernfs so that after its wait_for_completion() it
+ # will msleep() this amount of time and schedule(). We figure this
+ # will be sufficient time to allow for our module removal to complete.
+ debugfs_kernfs_set_wait_ms $MSDELAY
+
+ # Now we trigger a kernfs write op, which will run kernfs_fop_write_iter,
+ # but will wait until our driver sends a respective completion
+ set_test_ignore_errors &
+ write_pid=$!
+
+ # At this point kernfs_fop_write_iter() hasn't run our op, its
+ # waiting for our completion at the specified time $WAIT_AT.
+ # We now remove our module which will send a
+ # complete(&kernfs_debug_wait_completion) right before we deregister
+ # our device and the sysfs device attributes are removed.
+ #
+ # After the completion is sent, the test_sysfs driver races with
+ # kernfs to do the device deregistration with the kernfs msleep
+ # and schedule(). This should mean we've forced trying to remove the
+ # module prior to allowing kernfs to run our store operation. If the
+ # race did happen we'll panic with a null dereference on the store op.
+ #
+ # If no race happens we should see no write operation triggered.
+ modprobe -r $TEST_DRIVER > /dev/null 2>&1
+
+ debugfs_kernfs_kernfs_fop_write_iter_set_fail_never
+
+ wait $write_pid
+ if [[ $? -eq $EXPECT_WRITE_RETURNS ]]; then
+ echo "ok"
+ else
+ echo "FAIL" >&2
+ fi
+}
+
+sysfs_test_0029()
+{
+ for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+ echo "Using delay-after-completion: $delay"
+ sysfs_race_kernfs_kernfs_fop_write_iter 0029 at_start 1 $delay
+ done
+}
+
+sysfs_test_0030()
+{
+ for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+ echo "Using delay-after-completion: $delay"
+ sysfs_race_kernfs_kernfs_fop_write_iter 0030 before_mutex 1 $delay
+ done
+}
+
+sysfs_test_0031()
+{
+ for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+ echo "Using delay-after-completion: $delay"
+ sysfs_race_kernfs_kernfs_fop_write_iter 0031 after_mutex 1 $delay
+ done
+}
+
+# A write only succeeds *iff* a module removal happens *after* the
+# kernfs active reference is obtained with kernfs_get_active().
+sysfs_test_0032()
+{
+ for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+ echo "Using delay-after-completion: $delay"
+ sysfs_race_kernfs_kernfs_fop_write_iter 0032 after_active 0 $delay
+ done
+}
+
test_gen_desc()
{
echo -n "$1 x $(get_test_count $1)"
@@ -1013,6 +1184,10 @@ list_tests()
echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device"
echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... "
echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..."
+ echo "$(test_gen_desc 0029) - racing removal of store op with kernfs at start"
+ echo "$(test_gen_desc 0030) - racing removal of store op with kernfs before mutex"
+ echo "$(test_gen_desc 0031) - racing removal of store op with kernfs after mutex"
+ echo "$(test_gen_desc 0032) - racing removal of store op with kernfs after active"
}
usage()
--
2.30.2
This adds initial failure injection support to kernfs. We start
off with debug knobs which when enabled allow test drivers, such as
test_sysfs, to then make use of these to try to force certain
difficult races to take place with a high degree of certainty.
This only adds runtime code *iff* the new bool CONFIG_FAIL_KERNFS_KNOBS is
enabled in your kernel. If you don't have this enabled this provides
no new functional. When CONFIG_FAIL_KERNFS_KNOBS is disabled the new
routine kernfs_debug_should_wait() ends up being transformed to if
(false), and so the compiler should optimize these out as dead code
producing no new effective binary changes.
We start off with enabling failure injections in kernfs by allowing us to
alter the way kernfs_fop_write_iter() behaves. We allow for the routine
kernfs_fop_write_iter() to wait for a certain condition in the kernel to
occur, after which it will sleep a predefined amount of time. This lets
kernfs users to time exactly when it want kernfs_fop_write_iter() to
complete, allowing for developing race conditions and test for correctness
in kernfs.
You'd boot with this enabled on your kernel command line:
fail_kernfs_fop_write_iter=1,100,0,1
The values are <interval,probability,size,times>, we don't care for
size, so for now we ignore it. The above ensures a failure will trigger
only once.
*How* we allow for this routine to change behaviour is left to knobs we
expose under debugfs:
# ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
wait_after_active
wait_after_mutex
wait_at_start
wait_before_mutex
A debugfs entry also exists to allow us to sleep a configurabler amount
of time after the completion:
/sys/kernel/debug/kernfs/sleep_after_wait_ms
These two sets of knobs allow us to construct races and demonstrate
how the kernfs active reference should suffice to project against
races.
Enabling CONFIG_FAULT_INJECTION_DEBUG_FS enables us to configure the
differnt fault injection parametres for the new fail_kernfs_fop_write_iter
fault injection at run time:
ls -1 /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
interval
probability
space
task-filter
times
verbose
verbose_ratelimit_burst
verbose_ratelimit_interval_ms
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
.../fault-injection/fault-injection.rst | 22 +++++
MAINTAINERS | 2 +-
fs/kernfs/Makefile | 1 +
fs/kernfs/failure-injection.c | 91 +++++++++++++++++++
fs/kernfs/file.c | 13 +++
fs/kernfs/kernfs-internal.h | 72 +++++++++++++++
include/linux/kernfs.h | 5 +
lib/Kconfig.debug | 10 ++
8 files changed, 215 insertions(+), 1 deletion(-)
create mode 100644 fs/kernfs/failure-injection.c
diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst
index 4a25c5eb6f07..d4d34b082f47 100644
--- a/Documentation/fault-injection/fault-injection.rst
+++ b/Documentation/fault-injection/fault-injection.rst
@@ -28,6 +28,28 @@ Available fault injection capabilities
injects kernel RPC client and server failures.
+- fail_kernfs_fop_write_iter
+
+ Allows for failures to be enabled inside kernfs_fop_write_iter(). Enabling
+ this does not immediately enable any errors to occur. You must configure
+ how you want this routine to fail or change behaviour by using the debugfs
+ knobs for it:
+
+ # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
+ wait_after_active
+ wait_after_mutex
+ wait_at_start
+ wait_before_mutex
+
+ You can also configure how long to sleep after a wait under
+
+ /sys/kernel/debug/kernfs/sleep_after_wait_ms
+
+ If you enable CONFIG_FAULT_INJECTION_DEBUG_FS the fail_add_disk failure
+ injection parameters are placed under:
+
+ /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
+
- fail_make_request
injects disk IO errors on devices permitted by setting
diff --git a/MAINTAINERS b/MAINTAINERS
index 28a34384f541..acdbf91058d5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10341,7 +10341,7 @@ M: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
M: Tejun Heo <tj(a)kernel.org>
S: Supported
T: git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git
-F: fs/kernfs/
+F: fs/kernfs/*
F: include/linux/kernfs.h
KEXEC
diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile
index 4ca54ff54c98..bc5b32ca39f9 100644
--- a/fs/kernfs/Makefile
+++ b/fs/kernfs/Makefile
@@ -4,3 +4,4 @@
#
obj-y := mount.o inode.o dir.o file.o symlink.o
+obj-$(CONFIG_FAIL_KERNFS_KNOBS) += failure-injection.o
diff --git a/fs/kernfs/failure-injection.c b/fs/kernfs/failure-injection.c
new file mode 100644
index 000000000000..4130d202c13b
--- /dev/null
+++ b/fs/kernfs/failure-injection.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/fault-inject.h>
+#include <linux/delay.h>
+
+#include "kernfs-internal.h"
+
+static DECLARE_FAULT_ATTR(fail_kernfs_fop_write_iter);
+struct kernfs_config_fail kernfs_config_fail;
+
+#define kernfs_config_fail(when) \
+ kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
+
+#define kernfs_config_fail(when) \
+ kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
+
+static int __init setup_fail_kernfs_fop_write_iter(char *str)
+{
+ return setup_fault_attr(&fail_kernfs_fop_write_iter, str);
+}
+
+__setup("fail_kernfs_fop_write_iter=", setup_fail_kernfs_fop_write_iter);
+
+struct dentry *kernfs_debugfs_root;
+struct dentry *config_fail_kernfs_fop_write_iter;
+
+static int __init kernfs_init_failure_injection(void)
+{
+ kernfs_config_fail.sleep_after_wait_ms = 100;
+ kernfs_debugfs_root = debugfs_create_dir("kernfs", NULL);
+
+ fault_create_debugfs_attr("fail_kernfs_fop_write_iter",
+ kernfs_debugfs_root, &fail_kernfs_fop_write_iter);
+
+ config_fail_kernfs_fop_write_iter =
+ debugfs_create_dir("config_fail_kernfs_fop_write_iter",
+ kernfs_debugfs_root);
+
+ debugfs_create_u32("sleep_after_wait_ms", 0600,
+ kernfs_debugfs_root,
+ &kernfs_config_fail.sleep_after_wait_ms);
+
+ debugfs_create_bool("wait_at_start", 0600,
+ config_fail_kernfs_fop_write_iter,
+ &kernfs_config_fail(at_start));
+ debugfs_create_bool("wait_before_mutex", 0600,
+ config_fail_kernfs_fop_write_iter,
+ &kernfs_config_fail(before_mutex));
+ debugfs_create_bool("wait_after_mutex", 0600,
+ config_fail_kernfs_fop_write_iter,
+ &kernfs_config_fail(after_mutex));
+ debugfs_create_bool("wait_after_active", 0600,
+ config_fail_kernfs_fop_write_iter,
+ &kernfs_config_fail(after_active));
+ return 0;
+}
+late_initcall(kernfs_init_failure_injection);
+
+int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate)
+{
+ if (!evaluate)
+ return 0;
+
+ return should_fail(&fail_kernfs_fop_write_iter, 0);
+}
+
+DECLARE_COMPLETION(kernfs_debug_wait_completion);
+EXPORT_SYMBOL_NS_GPL(kernfs_debug_wait_completion, KERNFS_DEBUG_PRIVATE);
+
+void kernfs_debug_wait(void)
+{
+ unsigned long timeout;
+
+ timeout = wait_for_completion_timeout(&kernfs_debug_wait_completion,
+ msecs_to_jiffies(3000));
+ if (!timeout)
+ pr_info("%s waiting for kernfs_debug_wait_completion timed out\n",
+ __func__);
+ else
+ pr_info("%s received completion with time left on timeout %u ms\n",
+ __func__, jiffies_to_msecs(timeout));
+
+ /**
+ * The goal is wait for an event, and *then* once we have
+ * reached it, the other side will try to do something which
+ * it thinks will break. So we must give it some time to do
+ * that. The amount of time is configurable.
+ */
+ msleep(kernfs_config_fail.sleep_after_wait_ms);
+ pr_info("%s ended\n", __func__);
+}
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 60e2a86c535e..4479c6580333 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -259,6 +259,9 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
const struct kernfs_ops *ops;
char *buf;
+ if (kernfs_debug_should_wait(kernfs_fop_write_iter, at_start))
+ kernfs_debug_wait();
+
if (of->atomic_write_len) {
if (len > of->atomic_write_len)
return -E2BIG;
@@ -280,17 +283,27 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
}
buf[len] = '\0'; /* guarantee string termination */
+ if (kernfs_debug_should_wait(kernfs_fop_write_iter, before_mutex))
+ kernfs_debug_wait();
+
/*
* @of->mutex nests outside active ref and is used both to ensure that
* the ops aren't called concurrently for the same open file.
*/
mutex_lock(&of->mutex);
+
+ if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_mutex))
+ kernfs_debug_wait();
+
if (!kernfs_get_active(of->kn)) {
mutex_unlock(&of->mutex);
len = -ENODEV;
goto out_free;
}
+ if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_active))
+ kernfs_debug_wait();
+
ops = kernfs_ops(of->kn);
if (ops->write)
len = ops->write(of, buf, len, iocb->ki_pos);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index f9cc912c31e1..9e3abf597e2d 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -18,6 +18,7 @@
#include <linux/kernfs.h>
#include <linux/fs_context.h>
+#include <linux/stringify.h>
struct kernfs_iattrs {
kuid_t ia_uid;
@@ -147,4 +148,75 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
*/
extern const struct inode_operations kernfs_symlink_iops;
+/*
+ * failure-injection.c
+ */
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+
+/**
+ * struct kernfs_fop_write_iter_fail - how kernfs_fop_write_iter_fail fails
+ *
+ * This lets you configure what part of kernfs_fop_write_iter() should behave
+ * in a specific way to allow userspace to capture possible failures in
+ * kernfs. The wait knobs are allowed to let you design capture possible
+ * race conditions which would otherwise be difficult to reproduce. A
+ * secondary driver would tell kernfs's wait completion when it is done.
+ *
+ * The point to the wait completion failure injection tests are to confirm
+ * that the kernfs active refcount suffice to ensure other objects in other
+ * layers are also gauranteed to exist, even they are opaque to kernfs. This
+ * includes kobjects, devices, and other objects built on top of this, like
+ * the block layer when using sysfs block device attributes.
+ *
+ * @wait_at_start: waits for completion from a third party at the start of
+ * the routine.
+ * @wait_before_mutex: waits for completion from a third party before we
+ * are allowed to continue before the of->mutex is held.
+ * @wait_after_mutex: waits for completion from a third party after we
+ * have held the of->mutex.
+ * @wait_after_active: waits for completion from a thid party after we
+ * have refcounted the struct kernfs_node.
+ */
+struct kernfs_fop_write_iter_fail {
+ bool wait_at_start;
+ bool wait_before_mutex;
+ bool wait_after_mutex;
+ bool wait_after_active;
+};
+
+/**
+ * struct kernfs_config_fail - kernfs configuration for failure injection
+ *
+ * You can kernfs failure injection on boot, and in particular we currently
+ * only support failures for kernfs_fop_write_iter(). However, we don't
+ * want to always enable errors on this call when failure injection is enabled
+ * as this routine is used by many parts of the kernel for proper functionality.
+ * The compromise we make is we let userspace start enabling which parts it
+ * wants to fail after boot, if and only if failure injection has been enabled.
+ *
+ * @kernfs_fop_write_iter_fail: configuration for how we want to allow
+ * for failure injection on kernfs_fop_write_iter()
+ * @sleep_after_wait_ms: how many ms to wait after completion is received.
+ */
+struct kernfs_config_fail {
+ struct kernfs_fop_write_iter_fail kernfs_fop_write_iter_fail;
+ u32 sleep_after_wait_ms;
+};
+
+extern struct kernfs_config_fail kernfs_config_fail;
+
+#define __kernfs_config_wait_var(func, when) \
+ (kernfs_config_fail. func ## _fail.wait_ ## when)
+#define __kernfs_debug_should_wait_func_name(func) __kernfs_debug_should_wait_## func
+
+#define kernfs_debug_should_wait(func, when) \
+ __kernfs_debug_should_wait_func_name(func)(__kernfs_config_wait_var(func, when))
+int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate);
+void kernfs_debug_wait(void);
+#else
+static inline void kernfs_init_failure_injection(void) {}
+#define kernfs_debug_should_wait(func, when) (false)
+static inline void kernfs_debug_wait(void) {}
+#endif
+
#endif /* __KERNFS_INTERNAL_H */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3ccce6f24548..cd968ee2b503 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -411,6 +411,11 @@ void kernfs_init(void);
struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
u64 id);
+
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+extern struct completion kernfs_debug_wait_completion;
+#endif
+
#else /* CONFIG_KERNFS */
static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ae19bf1a21b8..a29b7d398c4e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1902,6 +1902,16 @@ config FAULT_INJECTION_USERCOPY
Provides fault-injection capability to inject failures
in usercopy functions (copy_from_user(), get_user(), ...).
+config FAIL_KERNFS_KNOBS
+ bool "Fault-injection support in kernfs"
+ depends on FAULT_INJECTION
+ help
+ Provide fault-injection capability for kernfs. This only enables
+ the error injection functionality. To use it you must configure which
+ which path you want to trigger on error on using debugfs under
+ /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/. By
+ default all of these are disabled.
+
config FAIL_MAKE_REQUEST
bool "Fault-injection capability for disk IO"
depends on FAULT_INJECTION && BLOCK
--
2.30.2
Two selftests drivers exist under the copyleft-next license.
These drivers were added prior to SPDX practice taking full swing
in the kernel. Now that we have an SPDX tag for copylef-next-0.3.1
documented, embrace it and remove the boiler plate.
Cc: Goldwyn Rodrigues <rgoldwyn(a)suse.com>
Cc: Kuno Woudt <kuno(a)frob.nl>
Cc: Richard Fontana <fontana(a)sharpeleven.org>
Cc: copyleft-next(a)lists.fedorahosted.org
Cc: Ciaran Farrell <Ciaran.Farrell(a)suse.com>
Cc: Christopher De Nicolo <Christopher.DeNicolo(a)suse.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: Thorsten Leemhuis <linux(a)leemhuis.info>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
lib/test_kmod.c | 12 +-----------
lib/test_sysctl.c | 12 +-----------
tools/testing/selftests/kmod/kmod.sh | 13 +------------
tools/testing/selftests/sysctl/sysctl.sh | 12 +-----------
4 files changed, 4 insertions(+), 45 deletions(-)
diff --git a/lib/test_kmod.c b/lib/test_kmod.c
index ce1589391413..d62afd89dc63 100644
--- a/lib/test_kmod.c
+++ b/lib/test_kmod.c
@@ -1,18 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
/*
* kmod stress test driver
*
* Copyright (C) 2017 Luis R. Rodriguez <mcgrof(a)kernel.org>
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or at your option any
- * later version; or, when distributed separately from the Linux kernel or
- * when incorporated into other software packages, subject to the following
- * license:
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of copyleft-next (version 0.3.1 or later) as published
- * at http://copyleft-next.org/.
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
diff --git a/lib/test_sysctl.c b/lib/test_sysctl.c
index 3750323973f4..9e5bd10a930a 100644
--- a/lib/test_sysctl.c
+++ b/lib/test_sysctl.c
@@ -1,18 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
/*
* proc sysctl test driver
*
* Copyright (C) 2017 Luis R. Rodriguez <mcgrof(a)kernel.org>
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or at your option any
- * later version; or, when distributed separately from the Linux kernel or
- * when incorporated into other software packages, subject to the following
- * license:
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of copyleft-next (version 0.3.1 or later) as published
- * at http://copyleft-next.org/.
*/
/*
diff --git a/tools/testing/selftests/kmod/kmod.sh b/tools/testing/selftests/kmod/kmod.sh
index afd42387e8b2..7189715d7960 100755
--- a/tools/testing/selftests/kmod/kmod.sh
+++ b/tools/testing/selftests/kmod/kmod.sh
@@ -1,18 +1,7 @@
#!/bin/bash
-#
+# SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
# Copyright (C) 2017 Luis R. Rodriguez <mcgrof(a)kernel.org>
#
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of the GNU General Public License as published by the Free
-# Software Foundation; either version 2 of the License, or at your option any
-# later version; or, when distributed separately from the Linux kernel or
-# when incorporated into other software packages, subject to the following
-# license:
-#
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of copyleft-next (version 0.3.1 or later) as published
-# at http://copyleft-next.org/.
-
# This is a stress test script for kmod, the kernel module loader. It uses
# test_kmod which exposes a series of knobs for the API for us so we can
# tweak each test in userspace rather than in kernelspace.
diff --git a/tools/testing/selftests/sysctl/sysctl.sh b/tools/testing/selftests/sysctl/sysctl.sh
index 19515dcb7d04..2046c603a4d4 100755
--- a/tools/testing/selftests/sysctl/sysctl.sh
+++ b/tools/testing/selftests/sysctl/sysctl.sh
@@ -1,16 +1,6 @@
#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
# Copyright (C) 2017 Luis R. Rodriguez <mcgrof(a)kernel.org>
-#
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of the GNU General Public License as published by the Free
-# Software Foundation; either version 2 of the License, or at your option any
-# later version; or, when distributed separately from the Linux kernel or
-# when incorporated into other software packages, subject to the following
-# license:
-#
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of copyleft-next (version 0.3.1 or later) as published
-# at http://copyleft-next.org/.
# This performs a series tests against the proc sysctl interface.
--
2.30.2