blk_mq_freeze_queue() never terminates if one or more bios are on the plug
list and if the block device driver defines a .submit_bio() method.
This is the case for device mapper drivers. The deadlock happens because
blk_mq_freeze_queue() waits for q_usage_counter to drop to zero, because
a queue reference is held by bios on the plug list and because the
__bio_queue_enter() call in __submit_bio() waits for the queue to be
unfrozen.
This patch fixes the following deadlock:
Workqueue: dm-51_zwplugs blk_zone_wplug_bio_work
Call trace:
__schedule+0xb08/0x1160
schedule+0x48/0xc8
__bio_queue_enter+0xcc/0x1d0
__submit_bio+0x100/0x1b0
submit_bio_noacct_nocheck+0x230/0x49c
blk_zone_wplug_bio_work+0x168/0x250
process_one_work+0x26c/0x65c
worker_thread+0x33c/0x498
kthread+0x110/0x134
ret_from_fork+0x10/0x20
Call trace:
__switch_to+0x230/0x410
__schedule+0xb08/0x1160
schedule+0x48/0xc8
blk_mq_freeze_queue_wait+0x78/0xb8
blk_mq_freeze_queue+0x90/0xa4
queue_attr_store+0x7c/0xf0
sysfs_kf_write+0x98/0xc8
kernfs_fop_write_iter+0x12c/0x1d4
vfs_write+0x340/0x3ac
ksys_write+0x78/0xe8
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Damien Le Moal <dlemoal(a)kernel.org>
Cc: Yu Kuai <yukuai1(a)huaweicloud.com>
Cc: Ming Lei <ming.lei(a)redhat.com>
Cc: stable(a)vger.kernel.org
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Bart Van Assche <bvanassche(a)acm.org>
---
Changes compared to v1: fixed a race condition. Call bio_zone_write_plugging()
only before submitting the bio and not after it has been submitted.
block/blk-core.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index b862c66018f2..713fb3865260 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -621,6 +621,13 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
return BLK_STS_OK;
}
+/*
+ * Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been
+ * set because this causes blk_mq_freeze_queue() to deadlock if
+ * blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios
+ * on the plug list is not necessary since a q_usage_counter reference is held
+ * while a bio is on the plug list.
+ */
static void __submit_bio(struct bio *bio)
{
/* If plug is not used, add new plug here to cache nsecs time. */
@@ -633,8 +640,12 @@ static void __submit_bio(struct bio *bio)
if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
blk_mq_submit_bio(bio);
- } else if (likely(bio_queue_enter(bio) == 0)) {
+ } else {
struct gendisk *disk = bio->bi_bdev->bd_disk;
+ bool zwp = bio_zone_write_plugging(bio);
+
+ if (unlikely(!zwp && bio_queue_enter(bio) != 0))
+ goto finish_plug;
if ((bio->bi_opf & REQ_POLLED) &&
!(disk->queue->limits.features & BLK_FEAT_POLL)) {
@@ -643,9 +654,12 @@ static void __submit_bio(struct bio *bio)
} else {
disk->fops->submit_bio(bio);
}
- blk_queue_exit(disk->queue);
+
+ if (!zwp)
+ blk_queue_exit(disk->queue);
}
+finish_plug:
blk_finish_plug(&plug);
}
Customer is reporting a really subtle issue where we get random DMAR
faults, hangs and other nasties for kernel migration jobs when stressing
stuff like s2idle/s3/s4. The explosions seems to happen somewhere
after resuming the system with splats looking something like:
PM: suspend exit
rfkill: input handler disabled
xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=bcs, logical_mask: 0x2, guc_id=0
xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=24496, lrc_seqno=24496, guc_id=0, flags=0x13 in no process [-1]
xe 0000:00:02.0: [drm] GT0: Kernel-submitted job timed out
The likely cause appears to be a race between suspend cancelling the
worker that processes the free_job()'s, such that we still have pending
jobs to be freed after the cancel. Following from this, on resume the
pending_list will now contain at least one already complete job, but it
looks like we call drm_sched_resubmit_jobs(), which will then call
run_job() on everything still on the pending_list. But if the job was
already complete, then all the resources tied to the job, like the bb
itself, any memory that is being accessed, the iommu mappings etc. might
be long gone since those are usually tied to the fence signalling.
This scenario can be seen in ftrace when running a slightly modified
xe_pm (kernel was only modified to inject artificial latency into
free_job to make the race easier to hit):
xe_sched_job_run: dev=0000:00:02.0, fence=0xffff888276cc8540, seqno=0, lrc_seqno=0, gt=0, guc_id=0, batch_addr=0x000000146910 ...
xe_exec_queue_stop: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x0, flags=0x13
xe_exec_queue_stop: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=1, guc_state=0x0, flags=0x4
xe_exec_queue_stop: dev=0000:00:02.0, 4:0x1, gt=1, width=1, guc_id=0, guc_state=0x0, flags=0x3
xe_exec_queue_stop: dev=0000:00:02.0, 1:0x1, gt=1, width=1, guc_id=1, guc_state=0x0, flags=0x3
xe_exec_queue_stop: dev=0000:00:02.0, 4:0x1, gt=1, width=1, guc_id=2, guc_state=0x0, flags=0x3
xe_exec_queue_resubmit: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x0, flags=0x13
xe_sched_job_run: dev=0000:00:02.0, fence=0xffff888276cc8540, seqno=0, lrc_seqno=0, gt=0, guc_id=0, batch_addr=0x000000146910 ...
.....
xe_exec_queue_memory_cat_error: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x3, flags=0x13
So the job_run() is clearly triggered twice for the same job, even
though the first must have already signalled to completion during
suspend. We can also see a CAT error after the re-submit.
To prevent this try to call xe_sched_stop() to forcefully remove
anything on the pending_list that has already signalled, before we
re-submit.
v2:
- Make sure to re-arm the fence callbacks with sched_start().
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4856
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Auld <matthew.auld(a)intel.com>
Cc: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Cc: Matthew Brost <matthew.brost(a)intel.com>
Cc: William Tseng <william.tseng(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.8+
---
drivers/gpu/drm/xe/xe_gpu_scheduler.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.h b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
index c250ea773491..0c8fe0461df9 100644
--- a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
+++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
@@ -51,7 +51,9 @@ static inline void xe_sched_tdr_queue_imm(struct xe_gpu_scheduler *sched)
static inline void xe_sched_resubmit_jobs(struct xe_gpu_scheduler *sched)
{
+ drm_sched_stop(&sched->base, NULL); /* remove completed jobs */
drm_sched_resubmit_jobs(&sched->base);
+ drm_sched_start(&sched->base, 0); /* re-add fence callback for pending jobs */
}
static inline bool
--
2.49.0
Hi Greg and Sasha,
Please find attatched backports for 6.14 and 6.12 (which have -Wextra
enabled by default) to turn off a new warning from -Wextra in GCC 15 and
Clang 21, -Wunterminated-string-initialization, which is fatal when
CONFIG_WERROR is enabled. Please let me know if there are any issues or
questions.
Cheers,
Nathan
From: Dominique Martinet <dominique.martinet(a)atmark-techno.com>
BugLink: https://bugs.launchpad.net/bugs/2111592
commit 8a7d12d674ac ("net: usb: usbnet: fix name regression") assumed
that local addresses always came from the kernel, but some devices hand
out local mac addresses so we ended up with point-to-point devices with
a mac set by the driver, renaming to eth%d when they used to be named
usb%d.
Userspace should not rely on device name, but for the sake of stability
restore the local mac address check portion of the naming exception:
point to point devices which either have no mac set by the driver or
have a local mac handed out by the driver will keep the usb%d name.
(some USB LTE modems are known to hand out a stable mac from the locally
administered range; that mac appears to be random (different for
mulitple devices) and can be reset with device-specific commands, so
while such devices would benefit from getting a OUI reserved, we have
to deal with these and might as well preserve the existing behavior
to avoid breaking fragile openwrt configurations and such on upgrade.)
Link: https://lkml.kernel.org/r/20241203130457.904325-1-asmadeus@codewreck.org
Fixes: 8a7d12d674ac ("net: usb: usbnet: fix name regression")
Cc: stable(a)vger.kernel.org
Tested-by: Ahmed Naseef <naseefkm(a)gmail.com>
Signed-off-by: Dominique Martinet <dominique.martinet(a)atmark-techno.com>
Acked-by: Oliver Neukum <oneukum(a)suse.com>
Link: https://patch.msgid.link/20250326-usbnet_rename-v2-1-57eb21fcff26@atmark-te…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
(cherry picked from commit 2ea396448f26d0d7d66224cb56500a6789c7ed07)
Signed-off-by: Jianlin Lv <iecedge(a)gmail.com>
---
drivers/net/usb/usbnet.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 9f66c47dc58b..08cbc8e4b361 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -178,6 +178,17 @@ int usbnet_get_ethernet_addr(struct usbnet *dev, int iMACAddress)
}
EXPORT_SYMBOL_GPL(usbnet_get_ethernet_addr);
+static bool usbnet_needs_usb_name_format(struct usbnet *dev, struct net_device *net)
+{
+ /* Point to point devices which don't have a real MAC address
+ * (or report a fake local one) have historically used the usb%d
+ * naming. Preserve this..
+ */
+ return (dev->driver_info->flags & FLAG_POINTTOPOINT) != 0 &&
+ (is_zero_ether_addr(net->dev_addr) ||
+ is_local_ether_addr(net->dev_addr));
+}
+
static void intr_complete (struct urb *urb)
{
struct usbnet *dev = urb->context;
@@ -1766,13 +1777,11 @@ usbnet_probe (struct usb_interface *udev, const struct usb_device_id *prod)
if (status < 0)
goto out1;
- // heuristic: "usb%d" for links we know are two-host,
- // else "eth%d" when there's reasonable doubt. userspace
- // can rename the link if it knows better.
+ /* heuristic: rename to "eth%d" if we are not sure this link
+ * is two-host (these links keep "usb%d")
+ */
if ((dev->driver_info->flags & FLAG_ETHER) != 0 &&
- ((dev->driver_info->flags & FLAG_POINTTOPOINT) == 0 ||
- /* somebody touched it*/
- !is_zero_ether_addr(net->dev_addr)))
+ !usbnet_needs_usb_name_format(dev, net))
strscpy(net->name, "eth%d", sizeof(net->name));
/* WLAN devices should always be named "wlan%d" */
if ((dev->driver_info->flags & FLAG_WLAN) != 0)
--
2.34.1
在 2025/5/23 07:31, Sasha Levin 写道:
> This is a note to let you know that I've just added the patch titled
>
> btrfs: properly limit inline data extent according to block size
>
> to the 6.12-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> btrfs-properly-limit-inline-data-extent-according-to.patch
> and it can be found in the queue-6.12 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
Please drop this patch from all stable trees.
This is only for a debug feature, 2K block size, and it will never be
exposed to end users (only to allow people without a 64K page sized
system to test subpage routine on x86_64).
Thanks,
Qu
>
>
>
> commit a5afc96d757771c992eb3af4629a562ec52ba1dc
> Author: Qu Wenruo <wqu(a)suse.com>
> Date: Tue Feb 25 14:30:44 2025 +1030
>
> btrfs: properly limit inline data extent according to block size
>
> [ Upstream commit 23019d3e6617a8ec99a8d2f5947aa3dd8a74a1b8 ]
>
> Btrfs utilizes inline data extent for the following cases:
>
> - Regular small files
> - Symlinks
>
> And "btrfs check" detects any file extents that are too large as an
> error.
>
> It's not a problem for 4K block size, but for the incoming smaller
> block sizes (2K), it can cause problems due to bad limits:
>
> - Non-compressed inline data extents
> We do not allow a non-compressed inline data extent to be as large as
> block size.
>
> - Symlinks
> Currently the only real limit on symlinks are 4K, which can be larger
> than 2K block size.
>
> These will result btrfs-check to report too large file extents.
>
> Fix it by adding proper size checks for the above cases.
>
> Signed-off-by: Qu Wenruo <wqu(a)suse.com>
> Reviewed-by: David Sterba <dsterba(a)suse.com>
> Signed-off-by: David Sterba <dsterba(a)suse.com>
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9ce1270addb04..0da2611fb9c85 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -623,6 +623,10 @@ static bool can_cow_file_range_inline(struct btrfs_inode *inode,
> if (size > fs_info->sectorsize)
> return false;
>
> + /* We do not allow a non-compressed extent to be as large as block size. */
> + if (data_len >= fs_info->sectorsize)
> + return false;
> +
> /* We cannot exceed the maximum inline data size. */
> if (data_len > BTRFS_MAX_INLINE_DATA_SIZE(fs_info))
> return false;
> @@ -8691,7 +8695,12 @@ static int btrfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
> struct extent_buffer *leaf;
>
> name_len = strlen(symname);
> - if (name_len > BTRFS_MAX_INLINE_DATA_SIZE(fs_info))
> + /*
> + * Symlinks utilize uncompressed inline extent data, which should not
> + * reach block size.
> + */
> + if (name_len > BTRFS_MAX_INLINE_DATA_SIZE(fs_info) ||
> + name_len >= fs_info->sectorsize)
> return -ENAMETOOLONG;
>
> inode = new_inode(dir->i_sb);
On 22.05.25 23:06, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> btrfs: zoned: exit btrfs_can_activate_zone if BTRFS_FS_NEED_ZONE_FINISH is set
>
> to the 6.14-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> btrfs-zoned-exit-btrfs_can_activate_zone-if-btrfs_fs.patch
> and it can be found in the queue-6.14 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
>
Hey Sasha,
this patch is just a readability cleanup, no reason to backport it.
Thanks,
Johannes
>
>
> commit 1136d333d91088ecf2d5189367540a84e60449a0
> Author: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
> Date: Wed Feb 12 15:05:00 2025 +0100
>
> btrfs: zoned: exit btrfs_can_activate_zone if BTRFS_FS_NEED_ZONE_FINISH is set
>
> [ Upstream commit 26b38e28162ef4ceb1e0482299820fbbd7dbcd92 ]
>
> If BTRFS_FS_NEED_ZONE_FINISH is already set for the whole filesystem, exit
> early in btrfs_can_activate_zone(). There's no need to check if
> BTRFS_FS_NEED_ZONE_FINISH needs to be set if it is already set.
>
> Reviewed-by: Naohiro Aota <naohiro.aota(a)wdc.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
> Reviewed-by: David Sterba <dsterba(a)suse.com>
> Signed-off-by: David Sterba <dsterba(a)suse.com>
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
>
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index f39656668967c..4a3e02b49f295 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -2344,6 +2344,9 @@ bool btrfs_can_activate_zone(struct btrfs_fs_devices *fs_devices, u64 flags)
> if (!btrfs_is_zoned(fs_info))
> return true;
>
> + if (test_bit(BTRFS_FS_NEED_ZONE_FINISH, &fs_info->flags))
> + return false;
> +
> /* Check if there is a device with active zones left */
> mutex_lock(&fs_info->chunk_mutex);
> spin_lock(&fs_info->zone_active_bgs_lock);
>