The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 7143b5ac5750f404ff3a594b34fdf3fc2f99f828 Mon Sep 17 00:00:00 2001
From: Stefano Garzarella <sgarzare(a)redhat.com>
Date: Fri, 21 Feb 2020 16:42:16 +0100
Subject: [PATCH] io_uring: prevent sq_thread from spinning when it should stop
This patch drops 'cur_mm' before calling cond_resched(), to prevent
the sq_thread from spinning even when the user process is finished.
Before this patch, if the user process ended without closing the
io_uring fd, the sq_thread continues to spin until the
'sq_thread_idle' timeout ends.
In the worst case where the 'sq_thread_idle' parameter is bigger than
INT_MAX, the sq_thread will spin forever.
Fixes: 6c271ce2f1d5 ("io_uring: add submission polling")
Signed-off-by: Stefano Garzarella <sgarzare(a)redhat.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6e249aa97ba3..b43467b3a8dc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5142,6 +5142,18 @@ static int io_sq_thread(void *data)
* to enter the kernel to reap and flush events.
*/
if (!to_submit || ret == -EBUSY) {
+ /*
+ * Drop cur_mm before scheduling, we can't hold it for
+ * long periods (or over schedule()). Do this before
+ * adding ourselves to the waitqueue, as the unuse/drop
+ * may sleep.
+ */
+ if (cur_mm) {
+ unuse_mm(cur_mm);
+ mmput(cur_mm);
+ cur_mm = NULL;
+ }
+
/*
* We're polling. If we're within the defined idle
* period, then let us spin without work before going
@@ -5156,18 +5168,6 @@ static int io_sq_thread(void *data)
continue;
}
- /*
- * Drop cur_mm before scheduling, we can't hold it for
- * long periods (or over schedule()). Do this before
- * adding ourselves to the waitqueue, as the unuse/drop
- * may sleep.
- */
- if (cur_mm) {
- unuse_mm(cur_mm);
- mmput(cur_mm);
- cur_mm = NULL;
- }
-
prepare_to_wait(&ctx->sqo_wait, &wait,
TASK_INTERRUPTIBLE);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 305e519ce48e935702c32241f07d393c3c8fed3e Mon Sep 17 00:00:00 2001
From: Alexander Potapenko <glider(a)google.com>
Date: Thu, 20 Feb 2020 20:04:30 -0800
Subject: [PATCH] lib/stackdepot.c: fix global out-of-bounds in stack_slabs
Walter Wu has reported a potential case in which init_stack_slab() is
called after stack_slabs[STACK_ALLOC_MAX_SLABS - 1] has already been
initialized. In that case init_stack_slab() will overwrite
stack_slabs[STACK_ALLOC_MAX_SLABS], which may result in a memory
corruption.
Link: http://lkml.kernel.org/r/20200218102950.260263-1-glider@google.com
Fixes: cd11016e5f521 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
Signed-off-by: Alexander Potapenko <glider(a)google.com>
Reported-by: Walter Wu <walter-zh.wu(a)mediatek.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Matthias Brugger <matthias.bgg(a)gmail.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Kate Stewart <kstewart(a)linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index ed717dd08ff3..81c69c08d1d1 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -83,15 +83,19 @@ static bool init_stack_slab(void **prealloc)
return true;
if (stack_slabs[depot_index] == NULL) {
stack_slabs[depot_index] = *prealloc;
+ *prealloc = NULL;
} else {
- stack_slabs[depot_index + 1] = *prealloc;
+ /* If this is the last depot slab, do not touch the next one. */
+ if (depot_index + 1 < STACK_ALLOC_MAX_SLABS) {
+ stack_slabs[depot_index + 1] = *prealloc;
+ *prealloc = NULL;
+ }
/*
* This smp_store_release pairs with smp_load_acquire() from
* |next_slab_inited| above and in stack_depot_save().
*/
smp_store_release(&next_slab_inited, 1);
}
- *prealloc = NULL;
return true;
}
Upstream: 82c69bf58650 ("drm/i915/gt: Detect if we miss WaIdleLiteRestore")
In order to avoid confusing the HW, we must never submit an empty ring
during lite-restore, that is we should always advance the RING_TAIL
before submitting to stay ahead of the RING_HEAD.
Normally this is prevented by keeping a couple of spare NOPs in the
request->wa_tail so that on resubmission we can advance the tail. This
relies on the request only being resubmitted once, which is the normal
condition as it is seen once for ELSP[1] and then later in ELSP[0]. On
preemption, the requests are unwound and the tail reset back to the
normal end point (as we know the request is incomplete and therefore its
RING_HEAD is even earlier).
However, if this w/a should fail we would try and resubmit the request
with the RING_TAIL already set to the location of this request's wa_tail
potentially causing a GPU hang. We can spot when we do try and
incorrectly resubmit without advancing the RING_TAIL and spare any
embarrassment by forcing the context restore.
In the case of preempt-to-busy, we leave the requests running on the HW
while we unwind. As the ring is still live, we cannot rewind our
rq->tail without forcing a reload so leave it set to rq->wa_tail and
only force a reload if we resubmit after a lite-restore. (Normally, the
forced reload will be a part of the preemption event.)
Fixes: 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy")
Closes: https://gitlab.freedesktop.org/drm/intel/issues/673
Signed-off-by: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala(a)linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Cc: stable(a)kernel.vger.org #5.3+
Link: https://patchwork.freedesktop.org/patch/msgid/20191209023215.3519970-1-chri…
---
drivers/gpu/drm/i915/gt/intel_lrc.c | 42 ++++++++++++++---------------
1 file changed, 20 insertions(+), 22 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 4949b5ad860f..e0e4f3deb2da 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -471,12 +471,6 @@ lrc_descriptor(struct intel_context *ce, struct intel_engine_cs *engine)
return desc;
}
-static void unwind_wa_tail(struct i915_request *rq)
-{
- rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES);
- assert_ring_tail_valid(rq->ring, rq->tail);
-}
-
static struct i915_request *
__unwind_incomplete_requests(struct intel_engine_cs *engine)
{
@@ -495,7 +489,6 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
continue; /* XXX */
__i915_request_unsubmit(rq);
- unwind_wa_tail(rq);
/*
* Push the request back into the queue for later resubmission.
@@ -650,13 +643,29 @@ execlists_schedule_out(struct i915_request *rq)
i915_request_put(rq);
}
-static u64 execlists_update_context(const struct i915_request *rq)
+static u64 execlists_update_context(struct i915_request *rq)
{
struct intel_context *ce = rq->hw_context;
- u64 desc;
+ u64 desc = ce->lrc_desc;
+ u32 tail;
- ce->lrc_reg_state[CTX_RING_TAIL + 1] =
- intel_ring_set_tail(rq->ring, rq->tail);
+ /*
+ * WaIdleLiteRestore:bdw,skl
+ *
+ * We should never submit the context with the same RING_TAIL twice
+ * just in case we submit an empty ring, which confuses the HW.
+ *
+ * We append a couple of NOOPs (gen8_emit_wa_tail) after the end of
+ * the normal request to be able to always advance the RING_TAIL on
+ * subsequent resubmissions (for lite restore). Should that fail us,
+ * and we try and submit the same tail again, force the context
+ * reload.
+ */
+ tail = intel_ring_set_tail(rq->ring, rq->tail);
+ if (unlikely(ce->lrc_reg_state[CTX_RING_TAIL + 1] == tail))
+ desc |= CTX_DESC_FORCE_RESTORE;
+ ce->lrc_reg_state[CTX_RING_TAIL + 1] = tail;
+ rq->tail = rq->wa_tail;
/*
* Make sure the context image is complete before we submit it to HW.
@@ -675,7 +684,6 @@ static u64 execlists_update_context(const struct i915_request *rq)
*/
mb();
- desc = ce->lrc_desc;
ce->lrc_desc &= ~CTX_DESC_FORCE_RESTORE;
return desc;
@@ -1150,16 +1158,6 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
if (!list_is_last(&last->sched.link,
&engine->active.requests))
return;
-
- /*
- * WaIdleLiteRestore:bdw,skl
- * Apply the wa NOOPs to prevent
- * ring:HEAD == rq:TAIL as we resubmit the
- * request. See gen8_emit_fini_breadcrumb() for
- * where we prepare the padding after the
- * end of the request.
- */
- last->tail = last->wa_tail;
}
}
--
2.25.1
Hi,
I was just looking throught the current 5.4-stable queue and saw that
28553fa992cb28 ('Btrfs: fix race between shrinking truncate and fiemap')
is queued. Upstream has a follow-up fix for this: 52e29e331070cd aka
'btrfs: don't set path->leave_spinning for truncate'.
Would be nice to get those in together. I only looked at 5.4, don't
know about other queues.
thanks,
Holger
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a5ae50dea9111db63d30d700766dd5509602f7ad Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Thu, 20 Feb 2020 13:29:49 +0000
Subject: [PATCH] Btrfs: fix deadlock during fast fsync when logging prealloc
extents beyond eof
While logging the prealloc extents of an inode during a fast fsync we call
btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
holding a read lock on a leaf of the inode's root (not the log root, the
fs/subvol root), and then that function locks the file range in the inode's
iotree. This can lead to a deadlock when:
* the fsync is ranged
* the file has prealloc extents beyond eof
* writeback for a range different from the fsync range starts
during the fsync
* the size of the file is not sector size aligned
Because when finishing an ordered extent we lock first a file range and
then try to COW the fs/subvol tree to insert an extent item.
The following diagram shows how the deadlock can happen.
CPU 1 CPU 2
btrfs_sync_file()
--> for range [0, 1MiB)
--> inode has a size of
1MiB and has 1 prealloc
extent beyond the
i_size, starting at offset
4MiB
flushes all delalloc for the
range [0MiB, 1MiB) and waits
for the respective ordered
extents to complete
--> before task at CPU 1 locks the
inode, a write into file range
[1MiB, 2MiB + 1KiB) is made
--> i_size is updated to 2MiB + 1KiB
--> writeback is started for that
range, [1MiB, 2MiB + 4KiB)
--> end offset rounded up to
be sector size aligned
btrfs_log_dentry_safe()
btrfs_log_inode_parent()
btrfs_log_inode()
btrfs_log_changed_extents()
btrfs_log_prealloc_extents()
--> does a search on the
inode's root
--> holds a read lock on
leaf X
btrfs_finish_ordered_io()
--> locks range [1MiB, 2MiB + 4KiB)
--> end offset rounded up
to be sector size aligned
--> tries to cow leaf X, through
insert_reserved_file_extent()
--> already locked by the
task at CPU 1
btrfs_truncate_inode_items()
--> gets an i_size of
2MiB + 1KiB, which is
not sector size
aligned
--> tries to lock file
range [2MiB, (u64)-1)
--> the start range
is rounded down
from 2MiB + 1K
to 2MiB to be sector
size aligned
--> but the subrange
[2MiB, 2MiB + 4KiB) is
already locked by
task at CPU 2 which
is waiting to get a
write lock on leaf X
for which we are
holding a read lock
*** deadlock ***
This results in a stack trace like the following, triggered by test case
generic/561 from fstests:
[ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
[ 2779.979536] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2779.990136] kworker/u8:6 D 0 247 2 0x80004000
[ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[ 2779.990466] Call Trace:
[ 2779.990491] ? __schedule+0x384/0xa30
[ 2779.990521] schedule+0x33/0xe0
[ 2779.990616] btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
[ 2779.990632] ? remove_wait_queue+0x60/0x60
[ 2779.990730] btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
[ 2779.990782] btrfs_search_slot+0x510/0x1000 [btrfs]
[ 2779.990869] btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
[ 2779.990944] __btrfs_drop_extents+0x161/0x1060 [btrfs]
[ 2779.990987] ? mark_held_locks+0x6d/0xc0
[ 2779.990994] ? __slab_alloc.isra.49+0x99/0x100
[ 2779.991060] ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
[ 2779.991145] insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
[ 2779.991222] ? start_transaction+0xdd/0x5c0 [btrfs]
[ 2779.991291] btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
[ 2779.991405] btrfs_work_helper+0xaa/0x720 [btrfs]
[ 2779.991432] process_one_work+0x26d/0x6a0
[ 2779.991460] worker_thread+0x4f/0x3e0
[ 2779.991481] ? process_one_work+0x6a0/0x6a0
[ 2779.991489] kthread+0x103/0x140
[ 2779.991499] ? kthread_create_worker_on_cpu+0x70/0x70
[ 2779.991515] ret_from_fork+0x3a/0x50
(...)
[ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
[ 2780.027480] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2780.030035] fsstress D 0 17375 17373 0x00004000
[ 2780.030038] Call Trace:
[ 2780.030044] ? __schedule+0x384/0xa30
[ 2780.030052] schedule+0x33/0xe0
[ 2780.030075] lock_extent_bits+0x20c/0x320 [btrfs]
[ 2780.030094] ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
[ 2780.030098] ? rcu_read_lock_sched_held+0x59/0xa0
[ 2780.030102] ? remove_wait_queue+0x60/0x60
[ 2780.030122] btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
[ 2780.030151] ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
[ 2780.030165] ? btrfs_search_slot+0x379/0x1000 [btrfs]
[ 2780.030195] btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
[ 2780.030202] ? do_raw_spin_unlock+0x49/0xc0
[ 2780.030215] ? btrfs_get_num_csums+0x10/0x10 [btrfs]
[ 2780.030239] btrfs_log_inode+0xf83/0x1124 [btrfs]
[ 2780.030251] ? __mutex_unlock_slowpath+0x45/0x2a0
[ 2780.030275] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
[ 2780.030282] ? dget_parent+0xa1/0x370
[ 2780.030309] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[ 2780.030329] btrfs_sync_file+0x3f3/0x490 [btrfs]
[ 2780.030339] do_fsync+0x38/0x60
[ 2780.030343] __x64_sys_fdatasync+0x13/0x20
[ 2780.030345] do_syscall_64+0x5c/0x280
[ 2780.030348] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
[ 2780.030361] Code: Bad RIP value.
[ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
[ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
[ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
[ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
[ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
So fix this by making btrfs_truncate_inode_items() not lock the range in
the inode's iotree when the target root is a log root, since it's not
needed to lock the range for log roots as the protection from the inode's
lock and log_mutex are all that's needed.
Fixes: 28553fa992cb28 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4f47ba652b31..1ccb3f8d528d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4103,8 +4103,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
return -ENOMEM;
path->reada = READA_BACK;
- lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
- &cached_state);
+ if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
+ lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+ &cached_state);
/*
* We want to drop from the next block forward in case this new size is
@@ -4368,11 +4369,10 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
if (!ret && last_size > new_size)
last_size = new_size;
btrfs_ordered_update_i_size(inode, last_size, NULL);
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start,
+ (u64)-1, &cached_state);
}
- unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
- &cached_state);
-
btrfs_free_path(path);
return ret;
}