Hello,
I was profiling the 5.10 kernel and comparing it to 4.14. On a system with 64 virtual CPUs and 256 GiB of RAM, I am observing a significant drop in IO performance. Using the following FIO with the script "sudo ftest_write.sh <dev_name>" in attachment, I saw FIO iops result drop from 22K to less than 1K.
The script simply does: mount a the EXT4 16GiB volume with max IOPS 64000K, mounting option is " -o noatime,nodiratime,data=ordered", then run fio with 2048 fio wring thread with 28800000 file size with { --name=16kb_rand_write_only_2048_jobs --directory=/rdsdbdata1 --rw=randwrite --ioengine=sync --buffered=1 --bs=16k --max-jobs=2048 --numjobs=2048 --runtime=60 --time_based --thread --filesize=28800000 --fsync=1 --group_reporting }.
My analyzing is that the degradation is introduce by commit {244adf6426ee31a83f397b700d964cff12a247d3} and the issue is the contention on rsv_conversion_wq. The simplest option is to increase the journal size, but that introduces more operational complexity. Another option is to add the following change in attachment "allow more ext4-rsv-conversion workqueue.patch"
From 27e1b0e14275a281b3529f6a60c7b23a81356751 Mon Sep 17 00:00:00 2001
From: davinalu <davinalu(a)amazon.com>
Date: Fri, 23 Sep 2022 00:43:53 +0000
Subject: [PATCH] allow more ext4-rsv-conversion workqueue to speedup fio writing
---
fs/ext4/super.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index a0af833f7da7..6b34298cdc3b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4963,7 +4963,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
* concurrency isn't really necessary. Limit it to 1.
*/
EXT4_SB(sb)->rsv_conversion_wq =
- alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+ alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM |
+ WQ_UNBOUND | __WQ_ORDERED, 0);
if (!EXT4_SB(sb)->rsv_conversion_wq) {
printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
ret = -ENOMEM;
My thought is: If the max_active is 1, it means the "__WQ_ORDERED" combined with WQ_UNBOUND setting, based on alloc_workqueue(). So I added it .
I am not sure should we need "__WQ_ORDERED" or not? without "__WQ_ORDERED" it looks also work at my testbed, but I added since not much fio TP difference on my testbed result with/out "__WQ_ORDERED".
From My understanding and observation: with dioread_unlock and delay_alloc both enabled, the bio_endio() and ext4_writepages() will trigger this work queue to ext4_do_flush_completed_IO(). Looks like the work queue is an one-by-one updating: at EXT4 extend.c io_end->list_vec list only have one io_end_vec each time. So if the BIO has high performance, and we have only one thread to do EXT4 flush will be an bottleneck here. The "ext4-rsv-conversion" this workqueue is mainly for update the EXT4_IO_END_UNWRITTEN extend block(only exist on dioread_unlock and delay_alloc options are set) and extend status if I understand correctly here. Am I correct?
This works on my test system and passes xfstests, but will this cause any corruption on ext4 extends blocks updates, not even sure about the journal transaction updates either?
Can you tell me what I will break if this change is made?
Thanks
Davina
commit 573ae4f13f630d6660008f1974c0a8a29c30e18a upstream.
With special lengths supplied by user space, tee_shm_register() has
an integer overflow when calculating the number of pages covered by a
supplied user space memory region.
This may cause pin_user_pages_fast() to do a NULL pointer dereference.
Fix this by adding an an explicit call to access_ok() in
tee_ioctl_shm_register() to catch an invalid user space address early.
Fixes: 033ddf12bcf5 ("tee: add register user memory")
Cc: stable(a)vger.kernel.org # 5.4
Cc: stable(a)vger.kernel.org # 5.10
Reported-by: Nimish Mishra <neelam.nimish(a)gmail.com>
Reported-by: Anirban Chakraborty <ch.anirban00727(a)gmail.com>
Reported-by: Debdeep Mukhopadhyay <debdeep.mukhopadhyay(a)gmail.com>
Suggested-by: Jerome Forissier <jerome.forissier(a)linaro.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
[JW: backport to stable 5.4 and 5.10 + update commit message]
Signed-off-by: Jens Wiklander <jens.wiklander(a)linaro.org>
---
drivers/tee/tee_core.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/tee/tee_core.c b/drivers/tee/tee_core.c
index a7ccd4d2bd10..2db144d2d26f 100644
--- a/drivers/tee/tee_core.c
+++ b/drivers/tee/tee_core.c
@@ -182,6 +182,9 @@ tee_ioctl_shm_register(struct tee_context *ctx,
if (data.flags)
return -EINVAL;
+ if (!access_ok((void __user *)(unsigned long)data.addr, data.length))
+ return -EFAULT;
+
shm = tee_shm_register(ctx, data.addr, data.length,
TEE_SHM_DMA_BUF | TEE_SHM_USER_MAPPED);
if (IS_ERR(shm))
--
2.31.1
If the starting position of our insert range happens to be in the hole
between the two ext4_extent_idx, because the lblk of the ext4_extent in
the previous ext4_extent_idx is always less than the start, which leads
to the "extent" variable access across the boundary, the following UAF is
triggered:
==================================================================
BUG: KASAN: use-after-free in ext4_ext_shift_extents+0x257/0x790
Read of size 4 at addr ffff88819807a008 by task fallocate/8010
CPU: 3 PID: 8010 Comm: fallocate Tainted: G E 5.10.0+ #492
Call Trace:
dump_stack+0x7d/0xa3
print_address_description.constprop.0+0x1e/0x220
kasan_report.cold+0x67/0x7f
ext4_ext_shift_extents+0x257/0x790
ext4_insert_range+0x5b6/0x700
ext4_fallocate+0x39e/0x3d0
vfs_fallocate+0x26f/0x470
ksys_fallocate+0x3a/0x70
__x64_sys_fallocate+0x4f/0x60
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
==================================================================
For right shifts, we can divide them into the following situations:
1. When the first ee_block of ext4_extent_idx is greater than or equal to
start, make right shifts directly from the first ee_block.
1) If it is greater than start, we need to continue searching in the
previous ext4_extent_idx.
2) If it is equal to start, we can exit the loop (iterator=NULL).
2. When the first ee_block of ext4_extent_idx is less than start, then
traverse from the last extent to find the first extent whose ee_block
is less than start.
1) If extent is still the last extent after traversal, it means that
the last ee_block of ext4_extent_idx is less than start, that is,
start is located in the hole between idx and (idx+1), so we can
exit the loop directly (break) without right shifts.
2) Otherwise, make right shifts at the corresponding position of the
found extent, and then exit the loop (iterator=NULL).
Fixes: 331573febb6a ("ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate")
Cc: stable(a)vger.kernel.org # v4.2+
Signed-off-by: Zhihao Cheng <chengzhihao1(a)huawei.com>
Signed-off-by: Baokun Li <libaokun1(a)huawei.com>
---
V1->V2:
Initialize "ret" after the "again:" label to avoid return value mismatch.
Refactoring reduces cycles and makes code more readable.
fs/ext4/extents.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c148bb97b527..39c9f87de0be 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5179,6 +5179,7 @@ ext4_ext_shift_extents(struct inode *inode, handle_t *handle,
* and it is decreased till we reach start.
*/
again:
+ ret = 0;
if (SHIFT == SHIFT_LEFT)
iterator = &start;
else
@@ -5222,14 +5223,21 @@ ext4_ext_shift_extents(struct inode *inode, handle_t *handle,
ext4_ext_get_actual_len(extent);
} else {
extent = EXT_FIRST_EXTENT(path[depth].p_hdr);
- if (le32_to_cpu(extent->ee_block) > 0)
+ if (le32_to_cpu(extent->ee_block) > start)
*iterator = le32_to_cpu(extent->ee_block) - 1;
- else
- /* Beginning is reached, end of the loop */
+ else if (le32_to_cpu(extent->ee_block) == start)
iterator = NULL;
- /* Update path extent in case we need to stop */
- while (le32_to_cpu(extent->ee_block) < start)
+ else {
+ extent = EXT_LAST_EXTENT(path[depth].p_hdr);
+ while (le32_to_cpu(extent->ee_block) >= start)
+ extent--;
+
+ if (extent == EXT_LAST_EXTENT(path[depth].p_hdr))
+ break;
+
extent++;
+ iterator = NULL;
+ }
path[depth].p_ext = extent;
}
ret = ext4_ext_shift_path_extents(path, shift, inode,
--
2.31.1
Here are backports of the three patches that failed to apply to 5.15 due
to trivial context conflicts.
Hopefully they apply to the older stable trees as well as-is.
Note that the last patch depends on features that were not added until
5.9 as mentioned in the commit message. Note that the author of that
patch did not add a stable tag for this one, but backporting shouldn't
hurt.
Johan
Johan Hovold (3):
usb: dwc3: fix PHY disable sequence
usb: dwc3: qcom: fix use-after-free on runtime-PM wakeup
usb: dwc3: disable USB core PHY management
drivers/usb/dwc3/core.c | 19 ++++++++++---------
drivers/usb/dwc3/dwc3-qcom.c | 14 +++++++++++++-
drivers/usb/dwc3/host.c | 11 +++++++++++
3 files changed, 34 insertions(+), 10 deletions(-)
--
2.35.1
Hi
Here I'm submitting backport of patches
8238b4579866b7c1bb99883cfe102a43db5506ff and
d6ffe6067a54972564552ea45d320fb98db1ac5e to the stable branches.
Mikulas
This patch series backports a bunch of patches related IRQ handling
with respect to freeing the irq line while IRQ is in flight at CPU
or at the hardware level.
Recently we saw this issue in serial 8250 driver where the IRQ was being
freed while the irq was in flight or not yet delivered to the CPU. As a
result the irqchip was going into a wedged state and IRQ was not getting
delivered to the cpu. These patches helped fixed the issue in 4.14
kernel.
Let us know if more patches need backporting.
Lukas Wunner (2):
genirq: Update code comments wrt recycled thread_mask
genirq: Synchronize only with single thread on free_irq()
Thomas Gleixner (4):
genirq: Delay deactivation in free_irq()
genirq: Fix misleading synchronize_irq() documentation
genirq: Add optional hardware synchronization for shutdown
x86/ioapic: Implement irq_get_irqchip_state() callback
arch/x86/kernel/apic/io_apic.c | 46 ++++++++++++++
kernel/irq/autoprobe.c | 6 +-
kernel/irq/chip.c | 6 ++
kernel/irq/cpuhotplug.c | 2 +-
kernel/irq/internals.h | 5 ++
kernel/irq/manage.c | 106 ++++++++++++++++++++++-----------
6 files changed, 133 insertions(+), 38 deletions(-)
--
2.37.1