From: David Woodhouse dwmw@amazon.co.uk
[ Upstream commit 6d712b9b3a58018259fb40ddd498d1f7dfa1f4ec ]
Commit dce1ca0525bf ("sched/scs: Reset task stack state in bringup_cpu()") ensured that the shadow call stack and KASAN poisoning were removed from a CPU's stack each time that CPU is brought up, not just once.
This is not incorrect. However, with parallel bringup the idle thread setup will happen at a different step. As a consequence the cleanup in bringup_cpu() would be too late.
Move the SCS/KASAN cleanup to the generic _cpu_up() function instead, which already ensures that the new CPU's stack is available, purely to allow for early failure. This occurs when the CPU to be brought up is in the CPUHP_OFFLINE state, which should correctly do the cleanup any time the CPU has been taken down to the point where such is needed.
Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Mark Rutland mark.rutland@arm.com Tested-by: Mark Rutland mark.rutland@arm.com Tested-by: Michael Kelley mikelley@microsoft.com Tested-by: Oleksandr Natalenko oleksandr@natalenko.name Tested-by: Helge Deller deller@gmx.de # parisc Tested-by: Guilherme G. Piccoli gpiccoli@igalia.com # Steam Deck Link: https://lore.kernel.org/r/20230512205257.027075560@linutronix.de Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/cpu.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/kernel/cpu.c b/kernel/cpu.c index 008b50da22246..705d330600485 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -553,12 +553,6 @@ static int bringup_cpu(unsigned int cpu) struct task_struct *idle = idle_thread_get(cpu); int ret;
- /* - * Reset stale stack state from the last time this CPU was online. - */ - scs_task_reset(idle); - kasan_unpoison_task_stack(idle); - /* * Some architectures have to walk the irq descriptors to * setup the vector space for the cpu which comes online. @@ -1274,6 +1268,12 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) ret = PTR_ERR(idle); goto out; } + + /* + * Reset stale stack state from the last time this CPU was online. + */ + scs_task_reset(idle); + kasan_unpoison_task_stack(idle); }
cpuhp_tasks_frozen = tasks_frozen;
From: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp
[ Upstream commit 8b64d420fe2450f82848178506d3e3a0bd195539 ]
syzbot is reporting false a positive ODEBUG message immediately after ODEBUG was disabled due to OOM.
[ 1062.309646][T22911] ODEBUG: Out of memory. ODEBUG disabled [ 1062.886755][ T5171] ------------[ cut here ]------------ [ 1062.892770][ T5171] ODEBUG: assert_init not available (active state 0) object: ffffc900056afb20 object type: timer_list hint: process_timeout+0x0/0x40
CPU 0 [ T5171] CPU 1 [T22911] -------------- -------------- debug_object_assert_init() { if (!debug_objects_enabled) return; db = get_bucket(addr); lookup_object_or_alloc() { debug_objects_enabled = 0; return NULL; } debug_objects_oom() { pr_warn("Out of memory. ODEBUG disabled\n"); // all buckets get emptied here, and } lookup_object_or_alloc(addr, db, descr, false, true) { // this bucket is already empty. return ERR_PTR(-ENOENT); } // Emits false positive warning. debug_print_object(&o, "assert_init"); }
Recheck debug_object_enabled in debug_print_object() to avoid that.
Reported-by: syzbot syzbot+7937ba6a50bdd00fffdf@syzkaller.appspotmail.com Suggested-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/492fe2ae-5141-d548-ebd5-62f5fe2e57f7@I-love.SAKURA... Closes: https://syzkaller.appspot.com/bug?extid=7937ba6a50bdd00fffdf Signed-off-by: Sasha Levin sashal@kernel.org --- lib/debugobjects.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/lib/debugobjects.c b/lib/debugobjects.c index 4c39678c03ee5..4dd9283f6fea0 100644 --- a/lib/debugobjects.c +++ b/lib/debugobjects.c @@ -501,6 +501,15 @@ static void debug_print_object(struct debug_obj *obj, char *msg) const struct debug_obj_descr *descr = obj->descr; static int limit;
+ /* + * Don't report if lookup_object_or_alloc() by the current thread + * failed because lookup_object_or_alloc()/debug_objects_oom() by a + * concurrent thread turned off debug_objects_enabled and cleared + * the hash buckets. + */ + if (!debug_objects_enabled) + return; + if (limit < 5 && descr != descr_test) { void *hint = descr->debug_hint ? descr->debug_hint(obj->object) : NULL;
From: Zhong Jinghua zhongjinghua@huawei.com
[ Upstream commit f12bc113ce904777fd6ca003b473b427782b3dde ]
If the index allocated by idr_alloc greater than MINORMASK >> part_shift, the device number will overflow, resulting in failure to create a block device.
Fix it by imiting the size of the max allocation.
Signed-off-by: Zhong Jinghua zhongjinghua@huawei.com Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20230605122159.2134384-1-zhongjinghua@huaweicloud.... Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/block/nbd.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index b6940f0a9c905..e0f805ca0e727 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -1723,7 +1723,8 @@ static int nbd_dev_add(int index) if (err == -ENOSPC) err = -EEXIST; } else { - err = idr_alloc(&nbd_index_idr, nbd, 0, 0, GFP_KERNEL); + err = idr_alloc(&nbd_index_idr, nbd, 0, + (MINORMASK >> part_shift) + 1, GFP_KERNEL); if (err >= 0) index = err; }
From: Yu Kuai yukuai3@huawei.com
[ Upstream commit 873f50ece41aad5c4f788a340960c53774b5526e ]
Currently, if reshape is interrupted, echo "reshape" to sync_action will restart reshape from scratch, for example:
echo frozen > sync_action echo reshape > sync_action
This will corrupt data before reshape_position if the array is growing, fix the problem by continue reshape from reshape_position.
Reported-by: Peter Neuwirth reddunur@online.de Link: https://lore.kernel.org/linux-raid/e2f96772-bfbc-f43b-6da1-f520e5164536@onli... Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230512015610.821290-3-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/md/md.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 1553c2495841b..9e0060278a9c3 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4887,11 +4887,21 @@ action_store(struct mddev *mddev, const char *page, size_t len) return -EINVAL; err = mddev_lock(mddev); if (!err) { - if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { err = -EBUSY; - else { + } else if (mddev->reshape_position == MaxSector || + mddev->pers->check_reshape == NULL || + mddev->pers->check_reshape(mddev)) { clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); err = mddev->pers->start_reshape(mddev); + } else { + /* + * If reshape is still in progress, and + * md_check_recovery() can continue to reshape, + * don't restart reshape because data can be + * corrupted for raid456. + */ + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); } mddev_unlock(mddev); }
From: Yu Kuai yukuai3@huawei.com
[ Upstream commit 010444623e7f4da6b4a4dd603a7da7469981e293 ]
Currently, there is no limit for raid1/raid10 plugged bio. While flushing writes, raid1 has cond_resched() while raid10 doesn't, and too many writes can cause soft lockup.
Follow up soft lockup can be triggered easily with writeback test for raid10 with ramdisks:
watchdog: BUG: soft lockup - CPU#10 stuck for 27s! [md0_raid10:1293] Call Trace: <TASK> call_rcu+0x16/0x20 put_object+0x41/0x80 __delete_object+0x50/0x90 delete_object_full+0x2b/0x40 kmemleak_free+0x46/0xa0 slab_free_freelist_hook.constprop.0+0xed/0x1a0 kmem_cache_free+0xfd/0x300 mempool_free_slab+0x1f/0x30 mempool_free+0x3a/0x100 bio_free+0x59/0x80 bio_put+0xcf/0x2c0 free_r10bio+0xbf/0xf0 raid_end_bio_io+0x78/0xb0 one_write_done+0x8a/0xa0 raid10_end_write_request+0x1b4/0x430 bio_endio+0x175/0x320 brd_submit_bio+0x3b9/0x9b7 [brd] __submit_bio+0x69/0xe0 submit_bio_noacct_nocheck+0x1e6/0x5a0 submit_bio_noacct+0x38c/0x7e0 flush_pending_writes+0xf0/0x240 raid10d+0xac/0x1ed0
Fix the problem by adding cond_resched() to raid10 like what raid1 did.
Note that unlimited plugged bio still need to be optimized, for example, in the case of lots of dirty pages writeback, this will take lots of memory and io will spend a long time in plug, hence io latency is bad.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230529131106.2123367-2-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/md/raid10.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 6a0459f9fafbc..bb25f63ae6964 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -894,6 +894,7 @@ static void flush_pending_writes(struct r10conf *conf) else submit_bio_noacct(bio); bio = next; + cond_resched(); } blk_finish_plug(&plug); } else @@ -1087,6 +1088,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule) else submit_bio_noacct(bio); bio = next; + cond_resched(); } kfree(plug); }
From: Thomas Gleixner tglx@linutronix.de
[ Upstream commit 8ce8849dd1e78dadcee0ec9acbd259d239b7069f ]
posix_timer_add() tries to allocate a posix timer ID by starting from the cached ID which was stored by the last successful allocation.
This is done in a loop searching the ID space for a free slot one by one. The loop has to terminate when the search wrapped around to the starting point.
But that's racy vs. establishing the starting point. That is read out lockless, which leads to the following problem:
CPU0 CPU1 posix_timer_add() start = sig->posix_timer_id; lock(hash_lock); ... posix_timer_add() if (++sig->posix_timer_id < 0) start = sig->posix_timer_id; sig->posix_timer_id = 0;
So CPU1 can observe a negative start value, i.e. -1, and the loop break never happens because the condition can never be true:
if (sig->posix_timer_id == start) break;
While this is unlikely to ever turn into an endless loop as the ID space is huge (INT_MAX), the racy read of the start value caught the attention of KCSAN and Dmitry unearthed that incorrectness.
Rewrite it so that all id operations are under the hash lock.
Reported-by: syzbot+5c54bd3eb218bb595aa9@syzkaller.appspotmail.com Reported-by: Dmitry Vyukov dvyukov@google.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Frederic Weisbecker frederic@kernel.org Link: https://lore.kernel.org/r/87bkhzdn6g.ffs@tglx Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/sched/signal.h | 2 +- kernel/time/posix-timers.c | 31 ++++++++++++++++++------------- 2 files changed, 19 insertions(+), 14 deletions(-)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index ae60f838ebb92..2c634010cc7bd 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -125,7 +125,7 @@ struct signal_struct { #ifdef CONFIG_POSIX_TIMERS
/* POSIX.1b Interval Timers */ - int posix_timer_id; + unsigned int next_posix_timer_id; struct list_head posix_timers;
/* ITIMER_REAL timer for the process */ diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c index d089627f2f2b4..1bbc4b513c793 100644 --- a/kernel/time/posix-timers.c +++ b/kernel/time/posix-timers.c @@ -140,25 +140,30 @@ static struct k_itimer *posix_timer_by_id(timer_t id) static int posix_timer_add(struct k_itimer *timer) { struct signal_struct *sig = current->signal; - int first_free_id = sig->posix_timer_id; struct hlist_head *head; - int ret = -ENOENT; + unsigned int cnt, id;
- do { + /* + * FIXME: Replace this by a per signal struct xarray once there is + * a plan to handle the resulting CRIU regression gracefully. + */ + for (cnt = 0; cnt <= INT_MAX; cnt++) { spin_lock(&hash_lock); - head = &posix_timers_hashtable[hash(sig, sig->posix_timer_id)]; - if (!__posix_timers_find(head, sig, sig->posix_timer_id)) { + id = sig->next_posix_timer_id; + + /* Write the next ID back. Clamp it to the positive space */ + sig->next_posix_timer_id = (id + 1) & INT_MAX; + + head = &posix_timers_hashtable[hash(sig, id)]; + if (!__posix_timers_find(head, sig, id)) { hlist_add_head_rcu(&timer->t_hash, head); - ret = sig->posix_timer_id; + spin_unlock(&hash_lock); + return id; } - if (++sig->posix_timer_id < 0) - sig->posix_timer_id = 0; - if ((sig->posix_timer_id == first_free_id) && (ret == -ENOENT)) - /* Loop over all possible ids completed */ - ret = -EAGAIN; spin_unlock(&hash_lock); - } while (ret == -ENOENT); - return ret; + } + /* POSIX return code when no timer ID could be allocated */ + return -EAGAIN; }
static inline void unlock_timer(struct k_itimer *timr, unsigned long flags)
From: David Sterba dsterba@suse.com
[ Upstream commit efcfcbc6a36195c42d98e0ee697baba36da94dc8 ]
The implementation of XXHASH is now CPU only but still fast enough to be considered for the synchronous checksumming, like non-generic crc32c.
A userspace benchmark comparing it to various implementations (patched hash-speedtest from btrfs-progs):
Block size: 4096 Iterations: 1000000 Implementation: builtin Units: CPU cycles
NULL-NOP: cycles: 73384294, cycles/i 73 NULL-MEMCPY: cycles: 228033868, cycles/i 228, 61664.320 MiB/s CRC32C-ref: cycles: 24758559416, cycles/i 24758, 567.950 MiB/s CRC32C-NI: cycles: 1194350470, cycles/i 1194, 11773.433 MiB/s CRC32C-ADLERSW: cycles: 6150186216, cycles/i 6150, 2286.372 MiB/s CRC32C-ADLERHW: cycles: 626979180, cycles/i 626, 22427.453 MiB/s CRC32C-PCL: cycles: 466746732, cycles/i 466, 30126.699 MiB/s XXHASH: cycles: 860656400, cycles/i 860, 16338.188 MiB/s
Comparing purely software implementation (ref), current outdated accelerated using crc32q instruction (NI), optimized implementations by M. Adler (https://stackoverflow.com/questions/17645167/implementing-sse-4-2s-crc32c-in...) and the best one that was taken from kernel using the PCLMULQDQ instruction (PCL).
Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/disk-io.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 5a114cad988a6..608b939a4d287 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2256,6 +2256,9 @@ static int btrfs_init_csum_hash(struct btrfs_fs_info *fs_info, u16 csum_type) if (!strstr(crypto_shash_driver_name(csum_shash), "generic")) set_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags); break; + case BTRFS_CSUM_TYPE_XXHASH: + set_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags); + break; default: break; }
linux-stable-mirror@lists.linaro.org