Some recent Lenovo and Inspur machines with Zhaoxin CPUs fail to create
/sys/class/backlight/acpi_video0 on v6.6 kernels, while the same hardware
works correctly on v5.4.
Our analysis shows that the current implementation assumes the presence of a
GPU. The backlight registration is only triggered if a GPU is detected, but on
these platforms the backlight is handled purely by the EC without any GPU.
As a result, the detection path does not create the expected backlight node.
To fix this, move the following logic:
/* Use ACPI video if available, except when native should be preferred. */
if ((video_caps & ACPI_VIDEO_BACKLIGHT) &&
!(native_available && prefer_native_over_acpi_video()))
return acpi_backlight_video;
above the if (auto_detect) *auto_detect = true; statement.
This ensures that the ACPI video backlight node is created even when no GPU is
present, restoring the correct behavior observed on older kernels.
Fixes: 78dfc9d1d1ab ("ACPI: video: Add auto_detect arg to __acpi_video_get_backlight_type()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Zihuan Zhang <zhangzihuan(a)kylinos.cn>
---
drivers/acpi/video_detect.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/acpi/video_detect.c b/drivers/acpi/video_detect.c
index d507d5e08435..c1bb22b57f56 100644
--- a/drivers/acpi/video_detect.c
+++ b/drivers/acpi/video_detect.c
@@ -1011,6 +1011,11 @@ enum acpi_backlight_type __acpi_video_get_backlight_type(bool native, bool *auto
if (acpi_backlight_dmi != acpi_backlight_undef)
return acpi_backlight_dmi;
+ /* Use ACPI video if available, except when native should be preferred. */
+ if ((video_caps & ACPI_VIDEO_BACKLIGHT) &&
+ !(native_available && prefer_native_over_acpi_video()))
+ return acpi_backlight_video;
+
if (auto_detect)
*auto_detect = true;
@@ -1024,11 +1029,6 @@ enum acpi_backlight_type __acpi_video_get_backlight_type(bool native, bool *auto
if (dell_uart_present)
return acpi_backlight_dell_uart;
- /* Use ACPI video if available, except when native should be preferred. */
- if ((video_caps & ACPI_VIDEO_BACKLIGHT) &&
- !(native_available && prefer_native_over_acpi_video()))
- return acpi_backlight_video;
-
/* Use native if available */
if (native_available)
return acpi_backlight_native;
--
2.25.1
Hi,
Glad to know you and your company from Jordan.
I‘m Seven CTO of STHL We are a one-stop service provider for PCBA. We can help you with production from PCB to finished product assembly.
Why Partner With Us?
✅ One-Stop Expertise: From PCB fabrication, PCBA (SMT & Through-Hole), custom cable harnesses, , to final product assembly – we eliminate multi-vendor coordination risks.
✅ Cost Efficiency: 40%+ clients reduce logistics/QC costs through our integrated service model (ISO 9001:2015 certified).
✅ Speed-to-Market: Average 15% faster lead times achieved via in-house vertical integration.
Recent Success Case:
Helped a German IoT startup scale from prototype to 50K-unit/month production within 6 months through our:
PCB Design-for-Manufacturing (DFM) optimization Automated PCBA with 99.98% first-pass yield Mechanical housing CNC machining & IP67-rated assembly
Seven Marcus CTO
Shenzhen STHL Technology Co,Ltd
+8618569002840 Seven(a)pcba-china.com
在2025-06-04,Seven <seven(a)ems-sthi.com> 写道:-----原始邮件-----
发件人: Seven <seven(a)ems-sthi.com>
发件时间: 2025年06月04日 周三
收件人: [Linux-stable-mirror <linux-stable-mirror(a)lists.linaro.org>]
主题: Re:Jordan recommend me get in touch
Hi,
Glad to know you and your company from Jordan.
I‘m Seven CTO of STHL We are a one-stop service provider for PCBA. We can help you with production from PCB to finished product assembly.
Why Partner With Us?
✅ One-Stop Expertise: From PCB fabrication, PCBA (SMT & Through-Hole), custom cable harnesses, , to final product assembly – we eliminate multi-vendor coordination risks.
✅ Cost Efficiency: 40%+ clients reduce logistics/QC costs through our integrated service model (ISO 9001:2015 certified).
✅ Speed-to-Market: Average 15% faster lead times achieved via in-house vertical integration.
Recent Success Case:
Helped a German IoT startup scale from prototype to 50K-unit/month production within 6 months through our:
PCB Design-for-Manufacturing (DFM) optimization Automated PCBA with 99.98% first-pass yield Mechanical housing CNC machining & IP67-rated assembly
Seven Marcus CTO
Shenzhen STHL Technology Co,Ltd
+8618569002840 Seven(a)pcba-china.com
Hi Team,
We are observing an intermittent regression in UDP fragmentation
handling between Linux kernel versions v5.10.35 and v5.15.71.
Problem description:
Our application sends UDP packets that exceed the path MTU. An
intermediate hop returns an ICMP Type 3, Code 4 (Fragmentation Needed)
message.
On v5.10.35, the kernel correctly updates the Path MTU cache, and
subsequent packets are fragmented as expected.
On v5.15.71, although the ICMP message is received by the kernel,
subsequent UDP packets are sometimes not fragmented and continue to be
dropped.
System details:
Egress interface MTU: 9192 bytes
Path MTU at intermediate hop: 1500 bytes
Kernel parameter: ip_no_pmtu_disc=0 (default)
Questions / request for feedback:
Is this a known regression in the 5.15 kernel series?
We have verified that the Path MTU cache is usually updated correctly.
Is there a way to detect or log cases where the cache is not updated?
If this issue has already been addressed, could you please point us to
the relevant fix commit so we can backport and test it?
We have reviewed several patches between v5.10.35 and v5.15.71 related
to PMTU and ICMP handling and examined the code flow,
but have not been able to pinpoint the root cause.
Any guidance, insights, or pointers would be greatly appreciated.
Best regards,
Chandrasekharreddy C
The iput() function is a dangerous one - if the reference counter goes
to zero, the function may block for a long time due to:
- inode_wait_for_writeback() waits until writeback on this inode
completes
- the filesystem-specific "evict_inode" callback can do similar
things; e.g. all netfs-based filesystems will call
netfs_wait_for_outstanding_io() which is similar to
inode_wait_for_writeback()
Therefore, callers must carefully evaluate the context they're in and
check whether invoking iput() is a good idea at all.
Most of the time, this is not a problem because the dcache holds
references to all inodes, and the dcache is usually the one to release
the last reference. But this assumption is fragile. For example,
under (memcg) memory pressure, the dcache shrinker is more likely to
release inode references, moving the inode eviction to contexts where
that was extremely unlikely to occur.
Our production servers "found" at least two deadlock bugs in the Ceph
filesystem that were caused by this iput() behavior:
1. Writeback may lead to iput() calls in Ceph (e.g. from
ceph_put_wrbuffer_cap_refs()) which deadlocks in
inode_wait_for_writeback(). Waiting for writeback completion from
within writeback will obviously never be able to make any progress.
This leads to blocked kworkers like this:
INFO: task kworker/u777:6:1270802 blocked for more than 122 seconds.
Not tainted 6.16.7-i1-es #773
task:kworker/u777:6 state:D stack:0 pid:1270802 tgid:1270802 ppid:2
task_flags:0x4208060 flags:0x00004000
Workqueue: writeback wb_workfn (flush-ceph-3)
Call Trace:
<TASK>
__schedule+0x4ea/0x17d0
schedule+0x1c/0xc0
inode_wait_for_writeback+0x71/0xb0
evict+0xcf/0x200
ceph_put_wrbuffer_cap_refs+0xdd/0x220
ceph_invalidate_folio+0x97/0xc0
ceph_writepages_start+0x127b/0x14d0
do_writepages+0xba/0x150
__writeback_single_inode+0x34/0x290
writeback_sb_inodes+0x203/0x470
__writeback_inodes_wb+0x4c/0xe0
wb_writeback+0x189/0x2b0
wb_workfn+0x30b/0x3d0
process_one_work+0x143/0x2b0
worker_thread+0x30a/0x450
2. In the Ceph messenger thread (net/ceph/messenger*.c), any iput()
call may invoke ceph_evict_inode() which will deadlock in
netfs_wait_for_outstanding_io(); since this blocks the messenger
thread, completions from the Ceph servers will not ever be received
and handled.
It looks like these deadlock bugs have been in the Ceph filesystem
code since forever (therefore no "Fixes" tag in this patch). There
may be various ways to solve this:
- make iput() asynchronous and defer the actual eviction like fput()
(may add overhead)
- make iput() only asynchronous if I_SYNC is set (doesn't solve random
things happening inside the "evict_inode" callback)
- add iput_deferred() to make this asynchronous behavior/overhead
optional and explicit
- refactor Ceph to avoid iput() calls from within writeback and
messenger (if that is even possible)
- add a Ceph-specific workaround
After advice from Mateusz Guzik, I decided to do the latter. The
implementation is simple because it piggybacks on the existing
work_struct for ceph_queue_inode_work() - ceph_inode_work() calls
iput() at the end which means we can donate the last reference to it.
This patch adds ceph_iput_async() and converts lots of iput() calls to
it - at least those that may come through writeback and the messenger.
Signed-off-by: Max Kellermann <max.kellermann(a)ionos.com>
Cc: Mateusz Guzik <mjguzik(a)gmail.com>
Cc: stable(a)vger.kernel.org
---
v1->v2: using atomic_add_unless() instead of atomic_add_unless() to
avoid letting i_count drop to zero which may cause races (thanks
Mateusz Guzik)
Signed-off-by: Max Kellermann <max.kellermann(a)ionos.com>
---
fs/ceph/addr.c | 2 +-
fs/ceph/caps.c | 16 ++++++++--------
fs/ceph/dir.c | 2 +-
fs/ceph/inode.c | 34 ++++++++++++++++++++++++++++++++++
fs/ceph/mds_client.c | 30 +++++++++++++++---------------
fs/ceph/quota.c | 4 ++--
fs/ceph/snap.c | 10 +++++-----
fs/ceph/super.h | 2 ++
8 files changed, 68 insertions(+), 32 deletions(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 322ed268f14a..fc497c91530e 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -265,7 +265,7 @@ static void finish_netfs_read(struct ceph_osd_request *req)
subreq->error = err;
trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
netfs_read_subreq_terminated(subreq);
- iput(req->r_inode);
+ ceph_iput_async(req->r_inode);
ceph_dec_osd_stopping_blocker(fsc->mdsc);
}
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index b1a8ff612c41..af9e3ae9ab7e 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1771,7 +1771,7 @@ void ceph_flush_snaps(struct ceph_inode_info *ci,
spin_unlock(&mdsc->snap_flush_lock);
if (need_put)
- iput(inode);
+ ceph_iput_async(inode);
}
/*
@@ -3319,7 +3319,7 @@ static void __ceph_put_cap_refs(struct ceph_inode_info *ci, int had,
if (wake)
wake_up_all(&ci->i_cap_wq);
while (put-- > 0)
- iput(inode);
+ ceph_iput_async(inode);
}
void ceph_put_cap_refs(struct ceph_inode_info *ci, int had)
@@ -3419,7 +3419,7 @@ void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
if (complete_capsnap)
wake_up_all(&ci->i_cap_wq);
while (put-- > 0) {
- iput(inode);
+ ceph_iput_async(inode);
}
}
@@ -3917,7 +3917,7 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid,
if (wake_mdsc)
wake_up_all(&mdsc->cap_flushing_wq);
if (drop)
- iput(inode);
+ ceph_iput_async(inode);
}
void __ceph_remove_capsnap(struct inode *inode, struct ceph_cap_snap *capsnap,
@@ -4008,7 +4008,7 @@ static void handle_cap_flushsnap_ack(struct inode *inode, u64 flush_tid,
wake_up_all(&ci->i_cap_wq);
if (wake_mdsc)
wake_up_all(&mdsc->cap_flushing_wq);
- iput(inode);
+ ceph_iput_async(inode);
}
}
@@ -4557,7 +4557,7 @@ void ceph_handle_caps(struct ceph_mds_session *session,
done:
mutex_unlock(&session->s_mutex);
done_unlocked:
- iput(inode);
+ ceph_iput_async(inode);
out:
ceph_dec_mds_stopping_blocker(mdsc);
@@ -4636,7 +4636,7 @@ unsigned long ceph_check_delayed_caps(struct ceph_mds_client *mdsc)
doutc(cl, "on %p %llx.%llx\n", inode,
ceph_vinop(inode));
ceph_check_caps(ci, 0);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&mdsc->cap_delay_lock);
}
@@ -4675,7 +4675,7 @@ static void flush_dirty_session_caps(struct ceph_mds_session *s)
spin_unlock(&mdsc->cap_dirty_lock);
ceph_wait_on_async_create(inode);
ceph_check_caps(ci, CHECK_CAPS_FLUSH);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&mdsc->cap_dirty_lock);
}
spin_unlock(&mdsc->cap_dirty_lock);
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 32973c62c1a2..ec73ed52a227 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1290,7 +1290,7 @@ static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc,
ceph_mdsc_free_path_info(&path_info);
}
out:
- iput(req->r_old_inode);
+ ceph_iput_async(req->r_old_inode);
ceph_mdsc_release_dir_caps(req);
}
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index f67025465de0..d7c0ed82bf62 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2191,6 +2191,40 @@ void ceph_queue_inode_work(struct inode *inode, int work_bit)
}
}
+/**
+ * Queue an asynchronous iput() call in a worker thread. Use this
+ * instead of iput() in contexts where evicting the inode is unsafe.
+ * For example, inode eviction may cause deadlocks in
+ * inode_wait_for_writeback() (when called from within writeback) or
+ * in netfs_wait_for_outstanding_io() (when called from within the
+ * Ceph messenger).
+ */
+void ceph_iput_async(struct inode *inode)
+{
+ if (unlikely(!inode))
+ return;
+
+ if (likely(atomic_add_unless(&inode->i_count, -1, 1)))
+ /* somebody else is holding another reference -
+ * nothing left to do for us
+ */
+ return;
+
+ doutc(ceph_inode_to_fs_client(inode)->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));
+
+ /* simply queue a ceph_inode_work() (donating the remaining
+ * reference) without setting i_work_mask bit; other than
+ * putting the reference, there is nothing to do
+ */
+ WARN_ON_ONCE(!queue_work(ceph_inode_to_fs_client(inode)->inode_wq,
+ &ceph_inode(inode)->i_work));
+
+ /* note: queue_work() cannot fail; it i_work were already
+ * queued, then it would be holding another reference, but no
+ * such reference exists
+ */
+}
+
static void ceph_do_invalidate_pages(struct inode *inode)
{
struct ceph_client *cl = ceph_inode_to_client(inode);
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 3bc72b47fe4d..22d75c3be5a8 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1097,14 +1097,14 @@ void ceph_mdsc_release_request(struct kref *kref)
ceph_msg_put(req->r_reply);
if (req->r_inode) {
ceph_put_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
- iput(req->r_inode);
+ ceph_iput_async(req->r_inode);
}
if (req->r_parent) {
ceph_put_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
- iput(req->r_parent);
+ ceph_iput_async(req->r_parent);
}
- iput(req->r_target_inode);
- iput(req->r_new_inode);
+ ceph_iput_async(req->r_target_inode);
+ ceph_iput_async(req->r_new_inode);
if (req->r_dentry)
dput(req->r_dentry);
if (req->r_old_dentry)
@@ -1118,7 +1118,7 @@ void ceph_mdsc_release_request(struct kref *kref)
*/
ceph_put_cap_refs(ceph_inode(req->r_old_dentry_dir),
CEPH_CAP_PIN);
- iput(req->r_old_dentry_dir);
+ ceph_iput_async(req->r_old_dentry_dir);
}
kfree(req->r_path1);
kfree(req->r_path2);
@@ -1240,7 +1240,7 @@ static void __unregister_request(struct ceph_mds_client *mdsc,
}
if (req->r_unsafe_dir) {
- iput(req->r_unsafe_dir);
+ ceph_iput_async(req->r_unsafe_dir);
req->r_unsafe_dir = NULL;
}
@@ -1413,7 +1413,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc,
cap = rb_entry(rb_first(&ci->i_caps), struct ceph_cap, ci_node);
if (!cap) {
spin_unlock(&ci->i_ceph_lock);
- iput(inode);
+ ceph_iput_async(inode);
goto random;
}
mds = cap->session->s_mds;
@@ -1422,7 +1422,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc,
cap == ci->i_auth_cap ? "auth " : "", cap);
spin_unlock(&ci->i_ceph_lock);
out:
- iput(inode);
+ ceph_iput_async(inode);
return mds;
random:
@@ -1841,7 +1841,7 @@ int ceph_iterate_session_caps(struct ceph_mds_session *session,
spin_unlock(&session->s_cap_lock);
if (last_inode) {
- iput(last_inode);
+ ceph_iput_async(last_inode);
last_inode = NULL;
}
if (old_cap) {
@@ -1874,7 +1874,7 @@ int ceph_iterate_session_caps(struct ceph_mds_session *session,
session->s_cap_iterator = NULL;
spin_unlock(&session->s_cap_lock);
- iput(last_inode);
+ ceph_iput_async(last_inode);
if (old_cap)
ceph_put_cap(session->s_mdsc, old_cap);
@@ -1904,7 +1904,7 @@ static int remove_session_caps_cb(struct inode *inode, int mds, void *arg)
if (invalidate)
ceph_queue_invalidate(inode);
while (iputs--)
- iput(inode);
+ ceph_iput_async(inode);
return 0;
}
@@ -1944,7 +1944,7 @@ static void remove_session_caps(struct ceph_mds_session *session)
spin_unlock(&session->s_cap_lock);
inode = ceph_find_inode(sb, vino);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&session->s_cap_lock);
}
@@ -2512,7 +2512,7 @@ static void ceph_cap_unlink_work(struct work_struct *work)
doutc(cl, "on %p %llx.%llx\n", inode,
ceph_vinop(inode));
ceph_check_caps(ci, CHECK_CAPS_FLUSH);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&mdsc->cap_delay_lock);
}
}
@@ -3933,7 +3933,7 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
!req->r_reply_info.has_create_ino) {
/* This should never happen on an async create */
WARN_ON_ONCE(req->r_deleg_ino);
- iput(in);
+ ceph_iput_async(in);
in = NULL;
}
@@ -5313,7 +5313,7 @@ static void handle_lease(struct ceph_mds_client *mdsc,
out:
mutex_unlock(&session->s_mutex);
- iput(inode);
+ ceph_iput_async(inode);
ceph_dec_mds_stopping_blocker(mdsc);
return;
diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
index d90eda19bcc4..bba00f8926e6 100644
--- a/fs/ceph/quota.c
+++ b/fs/ceph/quota.c
@@ -76,7 +76,7 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
le64_to_cpu(h->max_files));
spin_unlock(&ci->i_ceph_lock);
- iput(inode);
+ ceph_iput_async(inode);
out:
ceph_dec_mds_stopping_blocker(mdsc);
}
@@ -190,7 +190,7 @@ void ceph_cleanup_quotarealms_inodes(struct ceph_mds_client *mdsc)
node = rb_first(&mdsc->quotarealms_inodes);
qri = rb_entry(node, struct ceph_quotarealm_inode, node);
rb_erase(node, &mdsc->quotarealms_inodes);
- iput(qri->inode);
+ ceph_iput_async(qri->inode);
kfree(qri);
}
mutex_unlock(&mdsc->quotarealms_inodes_mutex);
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index c65f2b202b2b..19f097e79b3c 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -735,7 +735,7 @@ static void queue_realm_cap_snaps(struct ceph_mds_client *mdsc,
if (!inode)
continue;
spin_unlock(&realm->inodes_with_caps_lock);
- iput(lastinode);
+ ceph_iput_async(lastinode);
lastinode = inode;
/*
@@ -762,7 +762,7 @@ static void queue_realm_cap_snaps(struct ceph_mds_client *mdsc,
spin_lock(&realm->inodes_with_caps_lock);
}
spin_unlock(&realm->inodes_with_caps_lock);
- iput(lastinode);
+ ceph_iput_async(lastinode);
if (capsnap)
kmem_cache_free(ceph_cap_snap_cachep, capsnap);
@@ -955,7 +955,7 @@ static void flush_snaps(struct ceph_mds_client *mdsc)
ihold(inode);
spin_unlock(&mdsc->snap_flush_lock);
ceph_flush_snaps(ci, &session);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&mdsc->snap_flush_lock);
}
spin_unlock(&mdsc->snap_flush_lock);
@@ -1116,12 +1116,12 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
ceph_get_snap_realm(mdsc, realm);
ceph_change_snap_realm(inode, realm);
spin_unlock(&ci->i_ceph_lock);
- iput(inode);
+ ceph_iput_async(inode);
continue;
skip_inode:
spin_unlock(&ci->i_ceph_lock);
- iput(inode);
+ ceph_iput_async(inode);
}
/* we may have taken some of the old realm's children. */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index cf176aab0f82..361a72a67bb8 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1085,6 +1085,8 @@ static inline void ceph_queue_flush_snaps(struct inode *inode)
ceph_queue_inode_work(inode, CEPH_I_WORK_FLUSH_SNAPS);
}
+void ceph_iput_async(struct inode *inode);
+
extern int ceph_try_to_choose_auth_mds(struct inode *inode, int mask);
extern int __ceph_do_getattr(struct inode *inode, struct page *locked_page,
int mask, bool force);
--
2.47.3
The iput() function is a dangerous one - if the reference counter goes
to zero, the function may block for a long time due to:
- inode_wait_for_writeback() waits until writeback on this inode
completes
- the filesystem-specific "evict_inode" callback can do similar
things; e.g. all netfs-based filesystems will call
netfs_wait_for_outstanding_io() which is similar to
inode_wait_for_writeback()
Therefore, callers must carefully evaluate the context they're in and
check whether invoking iput() is a good idea at all.
Most of the time, this is not a problem because the dcache holds
references to all inodes, and the dcache is usually the one to release
the last reference. But this assumption is fragile. For example,
under (memcg) memory pressure, the dcache shrinker is more likely to
release inode references, moving the inode eviction to contexts where
that was extremely unlikely to occur.
Our production servers "found" at least two deadlock bugs in the Ceph
filesystem that were caused by this iput() behavior:
1. Writeback may lead to iput() calls in Ceph (e.g. from
ceph_put_wrbuffer_cap_refs()) which deadlocks in
inode_wait_for_writeback(). Waiting for writeback completion from
within writeback will obviously never be able to make any progress.
This leads to blocked kworkers like this:
INFO: task kworker/u777:6:1270802 blocked for more than 122 seconds.
Not tainted 6.16.7-i1-es #773
task:kworker/u777:6 state:D stack:0 pid:1270802 tgid:1270802 ppid:2
task_flags:0x4208060 flags:0x00004000
Workqueue: writeback wb_workfn (flush-ceph-3)
Call Trace:
<TASK>
__schedule+0x4ea/0x17d0
schedule+0x1c/0xc0
inode_wait_for_writeback+0x71/0xb0
evict+0xcf/0x200
ceph_put_wrbuffer_cap_refs+0xdd/0x220
ceph_invalidate_folio+0x97/0xc0
ceph_writepages_start+0x127b/0x14d0
do_writepages+0xba/0x150
__writeback_single_inode+0x34/0x290
writeback_sb_inodes+0x203/0x470
__writeback_inodes_wb+0x4c/0xe0
wb_writeback+0x189/0x2b0
wb_workfn+0x30b/0x3d0
process_one_work+0x143/0x2b0
worker_thread+0x30a/0x450
2. In the Ceph messenger thread (net/ceph/messenger*.c), any iput()
call may invoke ceph_evict_inode() which will deadlock in
netfs_wait_for_outstanding_io(); since this blocks the messenger
thread, completions from the Ceph servers will not ever be received
and handled.
It looks like these deadlock bugs have been in the Ceph filesystem
code since forever (therefore no "Fixes" tag in this patch). There
may be various ways to solve this:
- make iput() asynchronous and defer the actual eviction like fput()
(may add overhead)
- make iput() only asynchronous if I_SYNC is set (doesn't solve random
things happening inside the "evict_inode" callback)
- add iput_deferred() to make this asynchronous behavior/overhead
optional and explicit
- refactor Ceph to avoid iput() calls from within writeback and
messenger (if that is even possible)
- add a Ceph-specific workaround
After advice from Mateusz Guzik, I decided to do the latter. The
implementation is simple because it piggybacks on the existing
work_struct for ceph_queue_inode_work() - ceph_inode_work() calls
iput() at the end which means we can donate the last reference to it.
Since Ceph has a few iput() callers in a loop, it seemed simple enough
to pass this counter and use atomic_sub() instead of atomic_dec().
This patch adds ceph_iput_n_async() and converts lots of iput() calls
to it - at least those that may come through writeback and the
messenger.
Signed-off-by: Max Kellermann <max.kellermann(a)ionos.com>
Cc: Mateusz Guzik <mjguzik(a)gmail.com>
Cc: stable(a)vger.kernel.org
---
fs/ceph/addr.c | 2 +-
fs/ceph/caps.c | 21 ++++++++++-----------
fs/ceph/dir.c | 2 +-
fs/ceph/inode.c | 42 ++++++++++++++++++++++++++++++++++++++++++
fs/ceph/mds_client.c | 32 ++++++++++++++++----------------
fs/ceph/quota.c | 4 ++--
fs/ceph/snap.c | 10 +++++-----
fs/ceph/super.h | 7 +++++++
8 files changed, 84 insertions(+), 36 deletions(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 322ed268f14a..fc497c91530e 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -265,7 +265,7 @@ static void finish_netfs_read(struct ceph_osd_request *req)
subreq->error = err;
trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
netfs_read_subreq_terminated(subreq);
- iput(req->r_inode);
+ ceph_iput_async(req->r_inode);
ceph_dec_osd_stopping_blocker(fsc->mdsc);
}
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index b1a8ff612c41..bd88b5287a2b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1771,7 +1771,7 @@ void ceph_flush_snaps(struct ceph_inode_info *ci,
spin_unlock(&mdsc->snap_flush_lock);
if (need_put)
- iput(inode);
+ ceph_iput_async(inode);
}
/*
@@ -3318,8 +3318,8 @@ static void __ceph_put_cap_refs(struct ceph_inode_info *ci, int had,
}
if (wake)
wake_up_all(&ci->i_cap_wq);
- while (put-- > 0)
- iput(inode);
+ if (put > 0)
+ ceph_iput_n_async(inode, put);
}
void ceph_put_cap_refs(struct ceph_inode_info *ci, int had)
@@ -3418,9 +3418,8 @@ void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
}
if (complete_capsnap)
wake_up_all(&ci->i_cap_wq);
- while (put-- > 0) {
- iput(inode);
- }
+ if (put > 0)
+ ceph_iput_n_async(inode, put);
}
/*
@@ -3917,7 +3916,7 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid,
if (wake_mdsc)
wake_up_all(&mdsc->cap_flushing_wq);
if (drop)
- iput(inode);
+ ceph_iput_async(inode);
}
void __ceph_remove_capsnap(struct inode *inode, struct ceph_cap_snap *capsnap,
@@ -4008,7 +4007,7 @@ static void handle_cap_flushsnap_ack(struct inode *inode, u64 flush_tid,
wake_up_all(&ci->i_cap_wq);
if (wake_mdsc)
wake_up_all(&mdsc->cap_flushing_wq);
- iput(inode);
+ ceph_iput_async(inode);
}
}
@@ -4557,7 +4556,7 @@ void ceph_handle_caps(struct ceph_mds_session *session,
done:
mutex_unlock(&session->s_mutex);
done_unlocked:
- iput(inode);
+ ceph_iput_async(inode);
out:
ceph_dec_mds_stopping_blocker(mdsc);
@@ -4636,7 +4635,7 @@ unsigned long ceph_check_delayed_caps(struct ceph_mds_client *mdsc)
doutc(cl, "on %p %llx.%llx\n", inode,
ceph_vinop(inode));
ceph_check_caps(ci, 0);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&mdsc->cap_delay_lock);
}
@@ -4675,7 +4674,7 @@ static void flush_dirty_session_caps(struct ceph_mds_session *s)
spin_unlock(&mdsc->cap_dirty_lock);
ceph_wait_on_async_create(inode);
ceph_check_caps(ci, CHECK_CAPS_FLUSH);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&mdsc->cap_dirty_lock);
}
spin_unlock(&mdsc->cap_dirty_lock);
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 32973c62c1a2..ec73ed52a227 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1290,7 +1290,7 @@ static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc,
ceph_mdsc_free_path_info(&path_info);
}
out:
- iput(req->r_old_inode);
+ ceph_iput_async(req->r_old_inode);
ceph_mdsc_release_dir_caps(req);
}
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index f67025465de0..385d5261632d 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2191,6 +2191,48 @@ void ceph_queue_inode_work(struct inode *inode, int work_bit)
}
}
+/**
+ * Queue an asynchronous iput() call in a worker thread. Use this
+ * instead of iput() in contexts where evicting the inode is unsafe.
+ * For example, inode eviction may cause deadlocks in
+ * inode_wait_for_writeback() (when called from within writeback) or
+ * in netfs_wait_for_outstanding_io() (when called from within the
+ * Ceph messenger).
+ *
+ * @n: how many references to put
+ */
+void ceph_iput_n_async(struct inode *inode, int n)
+{
+ if (unlikely(!inode))
+ return;
+
+ if (likely(atomic_sub_return(n, &inode->i_count) > 0))
+ /* somebody else is holding another reference -
+ * nothing left to do for us
+ */
+ return;
+
+ doutc(ceph_inode_to_fs_client(inode)->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));
+
+ /* the reference counter is now 0, i.e. nobody else is holding
+ * a reference to this inode; restore it to 1 and donate it to
+ * ceph_inode_work() which will call iput() at the end
+ */
+ atomic_set(&inode->i_count, 1);
+
+ /* simply queue a ceph_inode_work() without setting
+ * i_work_mask bit; other than putting the reference, there is
+ * nothing to do
+ */
+ WARN_ON_ONCE(!queue_work(ceph_inode_to_fs_client(inode)->inode_wq,
+ &ceph_inode(inode)->i_work));
+
+ /* note: queue_work() cannot fail; it i_work were already
+ * queued, then it would be holding another reference, but no
+ * such reference exists
+ */
+}
+
static void ceph_do_invalidate_pages(struct inode *inode)
{
struct ceph_client *cl = ceph_inode_to_client(inode);
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 3bc72b47fe4d..d7fce1ad8073 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1097,14 +1097,14 @@ void ceph_mdsc_release_request(struct kref *kref)
ceph_msg_put(req->r_reply);
if (req->r_inode) {
ceph_put_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
- iput(req->r_inode);
+ ceph_iput_async(req->r_inode);
}
if (req->r_parent) {
ceph_put_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
- iput(req->r_parent);
+ ceph_iput_async(req->r_parent);
}
- iput(req->r_target_inode);
- iput(req->r_new_inode);
+ ceph_iput_async(req->r_target_inode);
+ ceph_iput_async(req->r_new_inode);
if (req->r_dentry)
dput(req->r_dentry);
if (req->r_old_dentry)
@@ -1118,7 +1118,7 @@ void ceph_mdsc_release_request(struct kref *kref)
*/
ceph_put_cap_refs(ceph_inode(req->r_old_dentry_dir),
CEPH_CAP_PIN);
- iput(req->r_old_dentry_dir);
+ ceph_iput_async(req->r_old_dentry_dir);
}
kfree(req->r_path1);
kfree(req->r_path2);
@@ -1240,7 +1240,7 @@ static void __unregister_request(struct ceph_mds_client *mdsc,
}
if (req->r_unsafe_dir) {
- iput(req->r_unsafe_dir);
+ ceph_iput_async(req->r_unsafe_dir);
req->r_unsafe_dir = NULL;
}
@@ -1413,7 +1413,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc,
cap = rb_entry(rb_first(&ci->i_caps), struct ceph_cap, ci_node);
if (!cap) {
spin_unlock(&ci->i_ceph_lock);
- iput(inode);
+ ceph_iput_async(inode);
goto random;
}
mds = cap->session->s_mds;
@@ -1422,7 +1422,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc,
cap == ci->i_auth_cap ? "auth " : "", cap);
spin_unlock(&ci->i_ceph_lock);
out:
- iput(inode);
+ ceph_iput_async(inode);
return mds;
random:
@@ -1841,7 +1841,7 @@ int ceph_iterate_session_caps(struct ceph_mds_session *session,
spin_unlock(&session->s_cap_lock);
if (last_inode) {
- iput(last_inode);
+ ceph_iput_async(last_inode);
last_inode = NULL;
}
if (old_cap) {
@@ -1874,7 +1874,7 @@ int ceph_iterate_session_caps(struct ceph_mds_session *session,
session->s_cap_iterator = NULL;
spin_unlock(&session->s_cap_lock);
- iput(last_inode);
+ ceph_iput_async(last_inode);
if (old_cap)
ceph_put_cap(session->s_mdsc, old_cap);
@@ -1903,8 +1903,8 @@ static int remove_session_caps_cb(struct inode *inode, int mds, void *arg)
wake_up_all(&ci->i_cap_wq);
if (invalidate)
ceph_queue_invalidate(inode);
- while (iputs--)
- iput(inode);
+ if (iputs > 0)
+ ceph_iput_n_async(inode, iputs);
return 0;
}
@@ -1944,7 +1944,7 @@ static void remove_session_caps(struct ceph_mds_session *session)
spin_unlock(&session->s_cap_lock);
inode = ceph_find_inode(sb, vino);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&session->s_cap_lock);
}
@@ -2512,7 +2512,7 @@ static void ceph_cap_unlink_work(struct work_struct *work)
doutc(cl, "on %p %llx.%llx\n", inode,
ceph_vinop(inode));
ceph_check_caps(ci, CHECK_CAPS_FLUSH);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&mdsc->cap_delay_lock);
}
}
@@ -3933,7 +3933,7 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
!req->r_reply_info.has_create_ino) {
/* This should never happen on an async create */
WARN_ON_ONCE(req->r_deleg_ino);
- iput(in);
+ ceph_iput_async(in);
in = NULL;
}
@@ -5313,7 +5313,7 @@ static void handle_lease(struct ceph_mds_client *mdsc,
out:
mutex_unlock(&session->s_mutex);
- iput(inode);
+ ceph_iput_async(inode);
ceph_dec_mds_stopping_blocker(mdsc);
return;
diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
index d90eda19bcc4..bba00f8926e6 100644
--- a/fs/ceph/quota.c
+++ b/fs/ceph/quota.c
@@ -76,7 +76,7 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
le64_to_cpu(h->max_files));
spin_unlock(&ci->i_ceph_lock);
- iput(inode);
+ ceph_iput_async(inode);
out:
ceph_dec_mds_stopping_blocker(mdsc);
}
@@ -190,7 +190,7 @@ void ceph_cleanup_quotarealms_inodes(struct ceph_mds_client *mdsc)
node = rb_first(&mdsc->quotarealms_inodes);
qri = rb_entry(node, struct ceph_quotarealm_inode, node);
rb_erase(node, &mdsc->quotarealms_inodes);
- iput(qri->inode);
+ ceph_iput_async(qri->inode);
kfree(qri);
}
mutex_unlock(&mdsc->quotarealms_inodes_mutex);
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index c65f2b202b2b..19f097e79b3c 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -735,7 +735,7 @@ static void queue_realm_cap_snaps(struct ceph_mds_client *mdsc,
if (!inode)
continue;
spin_unlock(&realm->inodes_with_caps_lock);
- iput(lastinode);
+ ceph_iput_async(lastinode);
lastinode = inode;
/*
@@ -762,7 +762,7 @@ static void queue_realm_cap_snaps(struct ceph_mds_client *mdsc,
spin_lock(&realm->inodes_with_caps_lock);
}
spin_unlock(&realm->inodes_with_caps_lock);
- iput(lastinode);
+ ceph_iput_async(lastinode);
if (capsnap)
kmem_cache_free(ceph_cap_snap_cachep, capsnap);
@@ -955,7 +955,7 @@ static void flush_snaps(struct ceph_mds_client *mdsc)
ihold(inode);
spin_unlock(&mdsc->snap_flush_lock);
ceph_flush_snaps(ci, &session);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&mdsc->snap_flush_lock);
}
spin_unlock(&mdsc->snap_flush_lock);
@@ -1116,12 +1116,12 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
ceph_get_snap_realm(mdsc, realm);
ceph_change_snap_realm(inode, realm);
spin_unlock(&ci->i_ceph_lock);
- iput(inode);
+ ceph_iput_async(inode);
continue;
skip_inode:
spin_unlock(&ci->i_ceph_lock);
- iput(inode);
+ ceph_iput_async(inode);
}
/* we may have taken some of the old realm's children. */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index cf176aab0f82..15c09b6c94aa 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1085,6 +1085,13 @@ static inline void ceph_queue_flush_snaps(struct inode *inode)
ceph_queue_inode_work(inode, CEPH_I_WORK_FLUSH_SNAPS);
}
+void ceph_iput_n_async(struct inode *inode, int n);
+
+static inline void ceph_iput_async(struct inode *inode)
+{
+ ceph_iput_n_async(inode, 1);
+}
+
extern int ceph_try_to_choose_auth_mds(struct inode *inode, int mask);
extern int __ceph_do_getattr(struct inode *inode, struct page *locked_page,
int mask, bool force);
--
2.47.3
The specification, Section 7.10, "Software Steps to Drain Page Requests &
Responses," requires software to submit an Invalidation Wait Descriptor
(inv_wait_dsc) with the Page-request Drain (PD=1) flag set, along with
the Invalidation Wait Completion Status Write flag (SW=1). It then waits
for the Invalidation Wait Descriptor's completion.
However, the PD field in the Invalidation Wait Descriptor is optional, as
stated in Section 6.5.2.9, "Invalidation Wait Descriptor":
"Page-request Drain (PD): Remapping hardware implementations reporting
Page-request draining as not supported (PDS = 0 in ECAP_REG) treat this
field as reserved."
This implies that if the IOMMU doesn't support the PDS capability, software
can't drain page requests and group responses as expected.
Do not enable PCI/PRI if the IOMMU doesn't support PDS.
Reported-by: Joel Granados <joel.granados(a)kernel.org>
Closes: https://lore.kernel.org/r/20250909-jag-pds-v1-1-ad8cba0e494e@kernel.org
Fixes: 66ac4db36f4c ("iommu/vt-d: Add page request draining support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu(a)linux.intel.com>
---
drivers/iommu/intel/iommu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 9c3ab9d9f69a..92759a8f8330 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3812,7 +3812,7 @@ static struct iommu_device *intel_iommu_probe_device(struct device *dev)
}
if (info->ats_supported && ecap_prs(iommu->ecap) &&
- pci_pri_supported(pdev))
+ ecap_pds(iommu->ecap) && pci_pri_supported(pdev))
info->pri_supported = 1;
}
}
--
2.43.0
Hi Stable,
Please provide a quote for your products:
Include:
1.Pricing (per unit)
2.Delivery cost & timeline
3.Quote expiry date
Deadline: September
Thanks!
Kamal Prasad
Albinayah Trading
ep_events_available() checks for available events by looking at ep->rdllist
and ep->ovflist. However, this is done without a lock, therefore the
returned value is not reliable. Because it is possible that both checks on
ep->rdllist and ep->ovflist are false while ep_start_scan() or
ep_done_scan() is being executed on other CPUs, despite events are
available.
This bug can be observed by:
1. Create an eventpoll with at least one ready level-triggered event
2. Create multiple threads who do epoll_wait() with zero timeout. The
threads do not consume the events, therefore all epoll_wait() should
return at least one event.
If one thread is executing ep_events_available() while another thread is
executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly
return no event for the former thread.
This reproducer is implemented as TEST(epoll65) in
tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
Fix it by skipping ep_events_available(), just call ep_try_send_events()
directly.
epoll_sendevents() (io_uring) suffers the same problem, fix that as well.
There is still ep_busy_loop() who uses ep_events_available() without lock,
but it is probably okay (?) for busy-polling.
Fixes: c5a282e9635e ("fs/epoll: reduce the scope of wq lock in epoll_wait()")
Fixes: e59d3c64cba6 ("epoll: eliminate unnecessary lock for zero timeout")
Fixes: ae3a4f1fdc2c ("eventpoll: add epoll_sendevents() helper")
Signed-off-by: Nam Cao <namcao(a)linutronix.de>
Cc: stable(a)vger.kernel.org
---
fs/eventpoll.c | 16 ++--------------
1 file changed, 2 insertions(+), 14 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 0fbf5dfedb24..541481eafc20 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2022,7 +2022,7 @@ static int ep_schedule_timeout(ktime_t *to)
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
int maxevents, struct timespec64 *timeout)
{
- int res, eavail, timed_out = 0;
+ int res, eavail = 1, timed_out = 0;
u64 slack = 0;
wait_queue_entry_t wait;
ktime_t expires, *to = NULL;
@@ -2041,16 +2041,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
timed_out = 1;
}
- /*
- * This call is racy: We may or may not see events that are being added
- * to the ready list under the lock (e.g., in IRQ callbacks). For cases
- * with a non-zero timeout, this thread will check the ready list under
- * lock and will add to the wait queue. For cases with a zero
- * timeout, the user by definition should not care and will have to
- * recheck again.
- */
- eavail = ep_events_available(ep);
-
while (1) {
if (eavail) {
res = ep_try_send_events(ep, events, maxevents);
@@ -2496,9 +2486,7 @@ int epoll_sendevents(struct file *file, struct epoll_event __user *events,
* Racy call, but that's ok - it should get retried based on
* poll readiness anyway.
*/
- if (ep_events_available(ep))
- return ep_try_send_events(ep, events, maxevents);
- return 0;
+ return ep_try_send_events(ep, events, maxevents);
}
/*
--
2.39.5
The patch titled
Subject: selftests/mm: skip soft-dirty tests when CONFIG_MEM_SOFT_DIRTY is disabled
has been added to the -mm mm-new branch. Its filename is
selftests-mm-skip-soft-dirty-tests-when-config_mem_soft_dirty-is-disabled.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Lance Yang <lance.yang(a)linux.dev>
Subject: selftests/mm: skip soft-dirty tests when CONFIG_MEM_SOFT_DIRTY is disabled
Date: Wed, 17 Sep 2025 21:31:37 +0800
The madv_populate and soft-dirty kselftests currently fail on systems
where CONFIG_MEM_SOFT_DIRTY is disabled.
Introduce a new helper softdirty_supported() into vm_util.c/h to ensure
tests are properly skipped when the feature is not enabled.
Link: https://lkml.kernel.org/r/20250917133137.62802-1-lance.yang@linux.dev
Fixes: 9f3265db6ae8 ("selftests: vm: add test for Soft-Dirty PTE bit")
Signed-off-by: Lance Yang <lance.yang(a)linux.dev>
Acked-by: David Hildenbrand <david(a)redhat.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Gabriel Krisman Bertazi <krisman(a)collabora.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
tools/testing/selftests/mm/madv_populate.c | 21 +------------------
tools/testing/selftests/mm/soft-dirty.c | 5 +++-
tools/testing/selftests/mm/vm_util.c | 17 +++++++++++++++
tools/testing/selftests/mm/vm_util.h | 1
4 files changed, 24 insertions(+), 20 deletions(-)
--- a/tools/testing/selftests/mm/madv_populate.c~selftests-mm-skip-soft-dirty-tests-when-config_mem_soft_dirty-is-disabled
+++ a/tools/testing/selftests/mm/madv_populate.c
@@ -264,23 +264,6 @@ static void test_softdirty(void)
munmap(addr, SIZE);
}
-static int system_has_softdirty(void)
-{
- /*
- * There is no way to check if the kernel supports soft-dirty, other
- * than by writing to a page and seeing if the bit was set. But the
- * tests are intended to check that the bit gets set when it should, so
- * doing that check would turn a potentially legitimate fail into a
- * skip. Fortunately, we know for sure that arm64 does not support
- * soft-dirty. So for now, let's just use the arch as a corse guide.
- */
-#if defined(__aarch64__)
- return 0;
-#else
- return 1;
-#endif
-}
-
int main(int argc, char **argv)
{
int nr_tests = 16;
@@ -288,7 +271,7 @@ int main(int argc, char **argv)
pagesize = getpagesize();
- if (system_has_softdirty())
+ if (softdirty_supported())
nr_tests += 5;
ksft_print_header();
@@ -300,7 +283,7 @@ int main(int argc, char **argv)
test_holes();
test_populate_read();
test_populate_write();
- if (system_has_softdirty())
+ if (softdirty_supported())
test_softdirty();
err = ksft_get_fail_cnt();
--- a/tools/testing/selftests/mm/soft-dirty.c~selftests-mm-skip-soft-dirty-tests-when-config_mem_soft_dirty-is-disabled
+++ a/tools/testing/selftests/mm/soft-dirty.c
@@ -200,8 +200,11 @@ int main(int argc, char **argv)
int pagesize;
ksft_print_header();
- ksft_set_plan(15);
+ if (!softdirty_supported())
+ ksft_exit_skip("soft-dirty is not support\n");
+
+ ksft_set_plan(15);
pagemap_fd = open(PAGEMAP_FILE_PATH, O_RDONLY);
if (pagemap_fd < 0)
ksft_exit_fail_msg("Failed to open %s\n", PAGEMAP_FILE_PATH);
--- a/tools/testing/selftests/mm/vm_util.c~selftests-mm-skip-soft-dirty-tests-when-config_mem_soft_dirty-is-disabled
+++ a/tools/testing/selftests/mm/vm_util.c
@@ -449,6 +449,23 @@ bool check_vmflag_pfnmap(void *addr)
return check_vmflag(addr, "pf");
}
+bool softdirty_supported(void)
+{
+ char *addr;
+ bool supported = false;
+ const size_t pagesize = getpagesize();
+
+ /* New mappings are expected to be marked with VM_SOFTDIRTY (sd). */
+ addr = mmap(0, pagesize, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ if (!addr)
+ ksft_exit_fail_msg("mmap failed\n");
+
+ supported = check_vmflag(addr, "sd");
+ munmap(addr, pagesize);
+ return supported;
+}
+
/*
* Open an fd at /proc/$pid/maps and configure procmap_out ready for
* PROCMAP_QUERY query. Returns 0 on success, or an error code otherwise.
--- a/tools/testing/selftests/mm/vm_util.h~selftests-mm-skip-soft-dirty-tests-when-config_mem_soft_dirty-is-disabled
+++ a/tools/testing/selftests/mm/vm_util.h
@@ -104,6 +104,7 @@ bool find_vma_procmap(struct procmap_fd
int close_procmap(struct procmap_fd *procmap);
int write_sysfs(const char *file_path, unsigned long val);
int read_sysfs(const char *file_path, unsigned long *val);
+bool softdirty_supported(void);
static inline int open_self_procmap(struct procmap_fd *procmap_out)
{
_
Patches currently in -mm which might be from lance.yang(a)linux.dev are
hung_task-fix-warnings-caused-by-unaligned-lock-pointers.patch
mm-skip-mlocked-thps-that-are-underused-early-in-deferred_split_scan.patch
selftests-mm-skip-soft-dirty-tests-when-config_mem_soft_dirty-is-disabled.patch
From: Frédéric Danis <frederic.danis(a)collabora.com>
OBEX download from iPhone is currently slow due to small packet size
used to transfer data which doesn't follow the MTU negotiated during
L2CAP connection, i.e. 672 bytes instead of 32767:
< ACL Data TX: Handle 11 flags 0x00 dlen 12
L2CAP: Connection Request (0x02) ident 18 len 4
PSM: 4103 (0x1007)
Source CID: 72
> ACL Data RX: Handle 11 flags 0x02 dlen 16
L2CAP: Connection Response (0x03) ident 18 len 8
Destination CID: 14608
Source CID: 72
Result: Connection successful (0x0000)
Status: No further information available (0x0000)
< ACL Data TX: Handle 11 flags 0x00 dlen 27
L2CAP: Configure Request (0x04) ident 20 len 19
Destination CID: 14608
Flags: 0x0000
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 32767
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 63
Max transmit: 3
Retransmission timeout: 2000
Monitor timeout: 12000
Maximum PDU size: 1009
> ACL Data RX: Handle 11 flags 0x02 dlen 26
L2CAP: Configure Request (0x04) ident 72 len 18
Destination CID: 72
Flags: 0x0000
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 32
Max transmit: 255
Retransmission timeout: 0
Monitor timeout: 0
Maximum PDU size: 65527
Option: Frame Check Sequence (0x05) [mandatory]
FCS: 16-bit FCS (0x01)
< ACL Data TX: Handle 11 flags 0x00 dlen 29
L2CAP: Configure Response (0x05) ident 72 len 21
Source CID: 14608
Flags: 0x0000
Result: Success (0x0000)
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 672
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 32
Max transmit: 255
Retransmission timeout: 2000
Monitor timeout: 12000
Maximum PDU size: 1009
> ACL Data RX: Handle 11 flags 0x02 dlen 32
L2CAP: Configure Response (0x05) ident 20 len 24
Source CID: 72
Flags: 0x0000
Result: Success (0x0000)
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 32767
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 63
Max transmit: 3
Retransmission timeout: 2000
Monitor timeout: 12000
Maximum PDU size: 1009
Option: Frame Check Sequence (0x05) [mandatory]
FCS: 16-bit FCS (0x01)
...
> ACL Data RX: Handle 11 flags 0x02 dlen 680
Channel: 72 len 676 ctrl 0x0202 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
I-frame: Unsegmented TxSeq 1 ReqSeq 2
< ACL Data TX: Handle 11 flags 0x00 dlen 13
Channel: 14608 len 9 ctrl 0x0204 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
I-frame: Unsegmented TxSeq 2 ReqSeq 2
> ACL Data RX: Handle 11 flags 0x02 dlen 680
Channel: 72 len 676 ctrl 0x0304 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
I-frame: Unsegmented TxSeq 2 ReqSeq 3
The MTUs are negotiated for each direction. In this traces 32767 for
iPhone->localhost and no MTU for localhost->iPhone, which based on
'4.4 L2CAP_CONFIGURATION_REQ' (Core specification v5.4, Vol. 3, Part
A):
The only parameters that should be included in the
L2CAP_CONFIGURATION_REQ packet are those that require different
values than the default or previously agreed values.
...
Any missing configuration parameters are assumed to have their
most recently explicitly or implicitly accepted values.
and '5.1 Maximum transmission unit (MTU)':
If the remote device sends a positive L2CAP_CONFIGURATION_RSP
packet it should include the actual MTU to be used on this channel
for traffic flowing into the local device.
...
The default value is 672 octets.
is set by BlueZ to 672 bytes.
It seems that the iPhone used the lowest negotiated value to transfer
data to the localhost instead of the negotiated one for the incoming
direction.
This could be fixed by using the MTU negotiated for the other
direction, if exists, in the L2CAP_CONFIGURATION_RSP.
This allows to use segmented packets as in the following traces:
< ACL Data TX: Handle 11 flags 0x00 dlen 12
L2CAP: Connection Request (0x02) ident 22 len 4
PSM: 4103 (0x1007)
Source CID: 72
< ACL Data TX: Handle 11 flags 0x00 dlen 27
L2CAP: Configure Request (0x04) ident 24 len 19
Destination CID: 2832
Flags: 0x0000
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 32767
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 63
Max transmit: 3
Retransmission timeout: 2000
Monitor timeout: 12000
Maximum PDU size: 1009
> ACL Data RX: Handle 11 flags 0x02 dlen 26
L2CAP: Configure Request (0x04) ident 15 len 18
Destination CID: 72
Flags: 0x0000
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 32
Max transmit: 255
Retransmission timeout: 0
Monitor timeout: 0
Maximum PDU size: 65527
Option: Frame Check Sequence (0x05) [mandatory]
FCS: 16-bit FCS (0x01)
< ACL Data TX: Handle 11 flags 0x00 dlen 29
L2CAP: Configure Response (0x05) ident 15 len 21
Source CID: 2832
Flags: 0x0000
Result: Success (0x0000)
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 32767
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 32
Max transmit: 255
Retransmission timeout: 2000
Monitor timeout: 12000
Maximum PDU size: 1009
> ACL Data RX: Handle 11 flags 0x02 dlen 32
L2CAP: Configure Response (0x05) ident 24 len 24
Source CID: 72
Flags: 0x0000
Result: Success (0x0000)
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 32767
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 63
Max transmit: 3
Retransmission timeout: 2000
Monitor timeout: 12000
Maximum PDU size: 1009
Option: Frame Check Sequence (0x05) [mandatory]
FCS: 16-bit FCS (0x01)
...
> ACL Data RX: Handle 11 flags 0x02 dlen 1009
Channel: 72 len 1005 ctrl 0x4202 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
I-frame: Start (len 21884) TxSeq 1 ReqSeq 2
> ACL Data RX: Handle 11 flags 0x02 dlen 1009
Channel: 72 len 1005 ctrl 0xc204 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
I-frame: Continuation TxSeq 2 ReqSeq 2
This has been tested with kernel 5.4 and BlueZ 5.77.
Cc: stable(a)vger.kernel.org
Signed-off-by: Frédéric Danis <frederic.danis(a)collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
---
net/bluetooth/l2cap_core.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/net/bluetooth/l2cap_core.c b/net/bluetooth/l2cap_core.c
index a5bde5db58ef..40daa38276f3 100644
--- a/net/bluetooth/l2cap_core.c
+++ b/net/bluetooth/l2cap_core.c
@@ -3415,7 +3415,7 @@ static int l2cap_parse_conf_req(struct l2cap_chan *chan, void *data, size_t data
struct l2cap_conf_rfc rfc = { .mode = L2CAP_MODE_BASIC };
struct l2cap_conf_efs efs;
u8 remote_efs = 0;
- u16 mtu = L2CAP_DEFAULT_MTU;
+ u16 mtu = 0;
u16 result = L2CAP_CONF_SUCCESS;
u16 size;
@@ -3520,6 +3520,13 @@ static int l2cap_parse_conf_req(struct l2cap_chan *chan, void *data, size_t data
/* Configure output options and let the other side know
* which ones we don't like. */
+ /* If MTU is not provided in configure request, use the most recently
+ * explicitly or implicitly accepted value for the other direction,
+ * or the default value.
+ */
+ if (mtu == 0)
+ mtu = chan->imtu ? chan->imtu : L2CAP_DEFAULT_MTU;
+
if (mtu < L2CAP_DEFAULT_MIN_MTU)
result = L2CAP_CONF_UNACCEPT;
else {
--
2.45.2
Forbid USB runtime PM (autosuspend) for AX88772* in bind.
usbnet enables runtime PM by default in probe, so disabling it via the
usb_driver flag is ineffective. For AX88772B, autosuspend shows no
measurable power saving in my tests (no link partner, admin up/down).
The ~0.453 W -> ~0.248 W reduction on 6.1 comes from phylib powering
the PHY off on admin-down, not from USB autosuspend.
With autosuspend active, resume paths may require calling phylink/phylib
(caller must hold RTNL) and doing MDIO I/O. Taking RTNL from a USB PM
resume can deadlock (RTNL may already be held), and MDIO can attempt a
runtime-wake while the USB PM lock is held. Given the lack of benefit
and poor test coverage (autosuspend is usually disabled by default in
distros), forbid runtime PM here to avoid these hazards.
This affects only AX88772* devices (per-interface in bind). System
sleep/resume is unchanged.
Fixes: 4a2c7217cd5a ("net: usb: asix: ax88772: manage PHY PM from MAC")
Reported-by: Hubert Wiśniewski <hubert.wisniewski.25632(a)gmail.com>
Closes: https://lore.kernel.org/all/20220622141638.GE930160@montezuma.acc.umu.se
Reported-by: Marek Szyprowski <m.szyprowski(a)samsung.com>
Closes: https://lore.kernel.org/all/b5ea8296-f981-445d-a09a-2f389d7f6fdd@samsung.com
Cc: stable(a)vger.kernel.org
Signed-off-by: Oleksij Rempel <o.rempel(a)pengutronix.de>
---
Link to the measurement results:
https://lore.kernel.org/all/aMkPMa650kfKfmF4@pengutronix.de/
---
drivers/net/usb/asix_devices.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index 792ddda1ad49..0d341d7e6154 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -625,6 +625,22 @@ static void ax88772_suspend(struct usbnet *dev)
asix_read_medium_status(dev, 1));
}
+/*
+ * Notes on PM callbacks and locking context:
+ *
+ * - asix_suspend()/asix_resume() are invoked for both runtime PM and
+ * system-wide suspend/resume. For struct usb_driver the ->resume()
+ * callback does not receive pm_message_t, so the resume type cannot
+ * be distinguished here.
+ *
+ * - The MAC driver must hold RTNL when calling phylink interfaces such as
+ * phylink_suspend()/resume(). Those calls will also perform MDIO I/O.
+ *
+ * - Taking RTNL and doing MDIO from a runtime-PM resume callback (while
+ * the USB PM lock is held) is fragile. Since autosuspend brings no
+ * measurable power saving for this device with current driver version, it is
+ * disabled below.
+ */
static int asix_suspend(struct usb_interface *intf, pm_message_t message)
{
struct usbnet *dev = usb_get_intfdata(intf);
@@ -919,6 +935,16 @@ static int ax88772_bind(struct usbnet *dev, struct usb_interface *intf)
if (ret)
goto initphy_err;
+ /* Disable USB runtime PM (autosuspend) for this interface.
+ * Rationale:
+ * - No measurable power saving from autosuspend for this device.
+ * - phylink/phylib calls require caller-held RTNL and do MDIO I/O,
+ * which is unsafe from USB PM resume paths (possible RTNL already
+ * held, USB PM lock held).
+ * System suspend/resume is unaffected.
+ */
+ pm_runtime_forbid(&intf->dev);
+
return 0;
initphy_err:
@@ -948,6 +974,10 @@ static void ax88772_unbind(struct usbnet *dev, struct usb_interface *intf)
phylink_destroy(priv->phylink);
ax88772_mdio_unregister(priv);
asix_rx_fixup_common_free(dev->driver_priv);
+ /* Re-allow runtime PM on disconnect for tidiness. The interface
+ * goes away anyway, but this balances forbid for debug sanity.
+ */
+ pm_runtime_allow(&intf->dev);
}
static void ax88178_unbind(struct usbnet *dev, struct usb_interface *intf)
@@ -1600,6 +1630,10 @@ static struct usb_driver asix_driver = {
.resume = asix_resume,
.reset_resume = asix_resume,
.disconnect = usbnet_disconnect,
+ /* usbnet will force supports_autosuspend=1; we explicitly forbid RPM
+ * per-interface in bind to keep autosuspend disabled for this driver
+ * by using pm_runtime_forbid().
+ */
.supports_autosuspend = 1,
.disable_hub_initiated_lpm = 1,
};
--
2.47.3
On Wed, Sep 17, 2025 at 10:03 AM Andrei Vagin <avagin(a)google.com> wrote:
>
> is
>
> On Wed, Sep 17, 2025 at 8:59 AM Eric Dumazet <edumazet(a)google.com> wrote:
> >
> > On Wed, Sep 17, 2025 at 8:39 AM Andrei Vagin <avagin(a)google.com> wrote:
> > >
> > > On Wed, Sep 17, 2025 at 6:53 AM Eric Dumazet <edumazet(a)google.com> wrote:
> > > >
> > > > Andrei Vagin reported that blamed commit broke CRIU.
> > > >
> > > > Indeed, while we want to keep sk_uid unchanged when a socket
> > > > is cloned, we want to clear sk->sk_ino.
> > > >
> > > > Otherwise, sock_diag might report multiple sockets sharing
> > > > the same inode number.
> > > >
> > > > Move the clearing part from sock_orphan() to sk_set_socket(sk, NULL),
> > > > called both from sock_orphan() and sk_clone_lock().
> > > >
> > > > Fixes: 5d6b58c932ec ("net: lockless sock_i_ino()")
> > > > Closes: https://lore.kernel.org/netdev/aMhX-VnXkYDpKd9V@google.com/
> > > > Closes: https://github.com/checkpoint-restore/criu/issues/2744
> > > > Reported-by: Andrei Vagin <avagin(a)google.com>
> > > > Signed-off-by: Eric Dumazet <edumazet(a)google.com>
> > >
> > > Acked-by: Andrei Vagin <avagin(a)google.com>
> > > I think we need to add `Cc: stable(a)vger.kernel.org`.
> >
> > I never do this. Note that the prior patch had no such CC.
>
> The original patch has been ported to the v6.16 kernels. According to the
> kernel documentation
> (https://www.kernel.org/doc/html/v6.5/process/stable-kernel-rules.html),
> adding Cc: stable(a)vger.kernel.org is required for automatic porting into
> stable trees. Without this tag, someone will likely need to manually request
> that this patch be ported. This is my understanding of how the stable
> branch process works, sorry if I missed something.
Andrei, I think I know pretty well what I am doing. You do not have to
explain to me anything.
Thank you.
Running sha224_kunit on a KMSAN-enabled kernel results in a crash in
kmsan_internal_set_shadow_origin():
BUG: unable to handle page fault for address: ffffbc3840291000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0
Oops: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G N 6.17.0-rc3 #10 PREEMPT(voluntary)
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100
[...]
Call Trace:
<TASK>
__msan_memset+0xee/0x1a0
sha224_final+0x9e/0x350
test_hash_buffer_overruns+0x46f/0x5f0
? kmsan_get_shadow_origin_ptr+0x46/0xa0
? __pfx_test_hash_buffer_overruns+0x10/0x10
kunit_try_run_case+0x198/0xa00
This occurs when memset() is called on a buffer that is not 4-byte
aligned and extends to the end of a guard page, i.e. the next page is
unmapped.
The bug is that the loop at the end of
kmsan_internal_set_shadow_origin() accesses the wrong shadow memory
bytes when the address is not 4-byte aligned. Since each 4 bytes are
associated with an origin, it rounds the address and size so that it can
access all the origins that contain the buffer. However, when it checks
the corresponding shadow bytes for a particular origin, it incorrectly
uses the original unrounded shadow address. This results in reads from
shadow memory beyond the end of the buffer's shadow memory, which
crashes when that memory is not mapped.
To fix this, correctly align the shadow address before accessing the 4
shadow bytes corresponding to each origin.
Fixes: 2ef3cec44c60 ("kmsan: do not wipe out origin when doing partial unpoisoning")
Cc: stable(a)vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers(a)kernel.org>
---
mm/kmsan/core.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/kmsan/core.c b/mm/kmsan/core.c
index 1ea711786c522..8bca7fece47f0 100644
--- a/mm/kmsan/core.c
+++ b/mm/kmsan/core.c
@@ -193,11 +193,12 @@ depot_stack_handle_t kmsan_internal_chain_origin(depot_stack_handle_t id)
void kmsan_internal_set_shadow_origin(void *addr, size_t size, int b,
u32 origin, bool checked)
{
u64 address = (u64)addr;
- u32 *shadow_start, *origin_start;
+ void *shadow_start;
+ u32 *aligned_shadow, *origin_start;
size_t pad = 0;
KMSAN_WARN_ON(!kmsan_metadata_is_contiguous(addr, size));
shadow_start = kmsan_get_metadata(addr, KMSAN_META_SHADOW);
if (!shadow_start) {
@@ -212,13 +213,16 @@ void kmsan_internal_set_shadow_origin(void *addr, size_t size, int b,
}
return;
}
__memset(shadow_start, b, size);
- if (!IS_ALIGNED(address, KMSAN_ORIGIN_SIZE)) {
+ if (IS_ALIGNED(address, KMSAN_ORIGIN_SIZE)) {
+ aligned_shadow = shadow_start;
+ } else {
pad = address % KMSAN_ORIGIN_SIZE;
address -= pad;
+ aligned_shadow = shadow_start - pad;
size += pad;
}
size = ALIGN(size, KMSAN_ORIGIN_SIZE);
origin_start =
(u32 *)kmsan_get_metadata((void *)address, KMSAN_META_ORIGIN);
@@ -228,11 +232,11 @@ void kmsan_internal_set_shadow_origin(void *addr, size_t size, int b,
* and unconditionally overwrite the old origin slot.
* If the new origin is zero, overwrite the old origin slot iff the
* corresponding shadow slot is zero.
*/
for (int i = 0; i < size / KMSAN_ORIGIN_SIZE; i++) {
- if (origin || !shadow_start[i])
+ if (origin || !aligned_shadow[i])
origin_start[i] = origin;
}
}
struct page *kmsan_vmalloc_to_page_or_null(void *vaddr)
base-commit: 1b237f190eb3d36f52dffe07a40b5eb210280e00
--
2.50.1
On Wed 17-09-25 11:18:50, Eric Hagberg wrote:
> I stumbled across a problem where the 6.6.103 kernel will fail when
> running the ioctl_loop06 test from the LTP test suite... and worse
> than failing the test, it leaves the system in a state where you can't
> run "losetup -a" again because the /dev/loopN device that the test
> created and failed the test on... hangs in a LOOP_GET_STATUS64 ioctl.
>
> It also leaves the system in a state where you can't re-kexec into a
> copy of the kernel as it gets completely hung at the point where it
> says "starting Reboot via kexec"...
Thanks for the report! Please report issues with stable kernels to
stable(a)vger.kernel.org (CCed now) because they can act on them.
> If I revert just that patch from 6.6.103 (or newer) kernels, then the
> test succeeds and doesn't leave the host in a bad state. The patch
> applied to 6.12 doesn't cause this problem, but I also see that there
> are quite a few other changes to the loop subsystem in 6.12 that never
> made it to 6.6.
>
> For now, I'll probably just revert your patch in my 6.6 kernel builds,
> but I wouldn't be surprised if others stumble across this issue as
> well, so maybe it should be reverted or fixed some other way.
Yes, I think revert from 6.6 stable kernel is warranted (unless somebody
has time to figure out what else is missing to make the patch work with
that stable branch).
Honza
--
Jan Kara <jack(a)suse.com>
SUSE Labs, CR
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 535fd4c98452c87537a40610abba45daf5761ec6
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025091722-chatter-dyslexia-7db3@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 535fd4c98452c87537a40610abba45daf5761ec6 Mon Sep 17 00:00:00 2001
From: Hugo Villeneuve <hvilleneuve(a)dimonoff.com>
Date: Thu, 31 Jul 2025 08:44:50 -0400
Subject: [PATCH] serial: sc16is7xx: fix bug in flow control levels init
When trying to set MCR[2], XON1 is incorrectly accessed instead. And when
writing to the TCR register to configure flow control levels, we are
incorrectly writing to the MSR register. The default value of $00 is then
used for TCR, which means that selectable trigger levels in FCR are used
in place of TCR.
TCR/TLR access requires EFR[4] (enable enhanced functions) and MCR[2]
to be set. EFR[4] is already set in probe().
MCR access requires LCR[7] to be zero.
Since LCR is set to $BF when trying to set MCR[2], XON1 is incorrectly
accessed instead because MCR shares the same address space as XON1.
Since MCR[2] is unmodified and still zero, when writing to TCR we are in
fact writing to MSR because TCR/TLR registers share the same address space
as MSR/SPR.
Fix by first removing useless reconfiguration of EFR[4] (enable enhanced
functions), as it is already enabled in sc16is7xx_probe() since commit
43c51bb573aa ("sc16is7xx: make sure device is in suspend once probed").
Now LCR is $00, which means that MCR access is enabled.
Also remove regcache_cache_bypass() calls since we no longer access the
enhanced registers set, and TCR is already declared as volatile (in fact
by declaring MSR as volatile, which shares the same address).
Finally disable access to TCR/TLR registers after modifying them by
clearing MCR[2].
Note: the comment about "... and internal clock div" is wrong and can be
ignored/removed as access to internal clock div registers (DLL/DLH)
is permitted only when LCR[7] is logic 1, not when enhanced features
is enabled. And DLL/DLH access is not needed in sc16is7xx_startup().
Fixes: dfeae619d781 ("serial: sc16is7xx")
Cc: stable(a)vger.kernel.org
Signed-off-by: Hugo Villeneuve <hvilleneuve(a)dimonoff.com>
Link: https://lore.kernel.org/r/20250731124451.1108864-1-hugo@hugovil.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/drivers/tty/serial/sc16is7xx.c b/drivers/tty/serial/sc16is7xx.c
index 3f38fba8f6ea..a668e0bb26b3 100644
--- a/drivers/tty/serial/sc16is7xx.c
+++ b/drivers/tty/serial/sc16is7xx.c
@@ -1177,17 +1177,6 @@ static int sc16is7xx_startup(struct uart_port *port)
sc16is7xx_port_write(port, SC16IS7XX_FCR_REG,
SC16IS7XX_FCR_FIFO_BIT);
- /* Enable EFR */
- sc16is7xx_port_write(port, SC16IS7XX_LCR_REG,
- SC16IS7XX_LCR_CONF_MODE_B);
-
- regcache_cache_bypass(one->regmap, true);
-
- /* Enable write access to enhanced features and internal clock div */
- sc16is7xx_port_update(port, SC16IS7XX_EFR_REG,
- SC16IS7XX_EFR_ENABLE_BIT,
- SC16IS7XX_EFR_ENABLE_BIT);
-
/* Enable TCR/TLR */
sc16is7xx_port_update(port, SC16IS7XX_MCR_REG,
SC16IS7XX_MCR_TCRTLR_BIT,
@@ -1199,7 +1188,8 @@ static int sc16is7xx_startup(struct uart_port *port)
SC16IS7XX_TCR_RX_RESUME(24) |
SC16IS7XX_TCR_RX_HALT(48));
- regcache_cache_bypass(one->regmap, false);
+ /* Disable TCR/TLR access */
+ sc16is7xx_port_update(port, SC16IS7XX_MCR_REG, SC16IS7XX_MCR_TCRTLR_BIT, 0);
/* Now, initialize the UART */
sc16is7xx_port_write(port, SC16IS7XX_LCR_REG, SC16IS7XX_LCR_WORD_LEN_8);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 535fd4c98452c87537a40610abba45daf5761ec6
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025091721-speak-detoxify-e6fe@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 535fd4c98452c87537a40610abba45daf5761ec6 Mon Sep 17 00:00:00 2001
From: Hugo Villeneuve <hvilleneuve(a)dimonoff.com>
Date: Thu, 31 Jul 2025 08:44:50 -0400
Subject: [PATCH] serial: sc16is7xx: fix bug in flow control levels init
When trying to set MCR[2], XON1 is incorrectly accessed instead. And when
writing to the TCR register to configure flow control levels, we are
incorrectly writing to the MSR register. The default value of $00 is then
used for TCR, which means that selectable trigger levels in FCR are used
in place of TCR.
TCR/TLR access requires EFR[4] (enable enhanced functions) and MCR[2]
to be set. EFR[4] is already set in probe().
MCR access requires LCR[7] to be zero.
Since LCR is set to $BF when trying to set MCR[2], XON1 is incorrectly
accessed instead because MCR shares the same address space as XON1.
Since MCR[2] is unmodified and still zero, when writing to TCR we are in
fact writing to MSR because TCR/TLR registers share the same address space
as MSR/SPR.
Fix by first removing useless reconfiguration of EFR[4] (enable enhanced
functions), as it is already enabled in sc16is7xx_probe() since commit
43c51bb573aa ("sc16is7xx: make sure device is in suspend once probed").
Now LCR is $00, which means that MCR access is enabled.
Also remove regcache_cache_bypass() calls since we no longer access the
enhanced registers set, and TCR is already declared as volatile (in fact
by declaring MSR as volatile, which shares the same address).
Finally disable access to TCR/TLR registers after modifying them by
clearing MCR[2].
Note: the comment about "... and internal clock div" is wrong and can be
ignored/removed as access to internal clock div registers (DLL/DLH)
is permitted only when LCR[7] is logic 1, not when enhanced features
is enabled. And DLL/DLH access is not needed in sc16is7xx_startup().
Fixes: dfeae619d781 ("serial: sc16is7xx")
Cc: stable(a)vger.kernel.org
Signed-off-by: Hugo Villeneuve <hvilleneuve(a)dimonoff.com>
Link: https://lore.kernel.org/r/20250731124451.1108864-1-hugo@hugovil.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/drivers/tty/serial/sc16is7xx.c b/drivers/tty/serial/sc16is7xx.c
index 3f38fba8f6ea..a668e0bb26b3 100644
--- a/drivers/tty/serial/sc16is7xx.c
+++ b/drivers/tty/serial/sc16is7xx.c
@@ -1177,17 +1177,6 @@ static int sc16is7xx_startup(struct uart_port *port)
sc16is7xx_port_write(port, SC16IS7XX_FCR_REG,
SC16IS7XX_FCR_FIFO_BIT);
- /* Enable EFR */
- sc16is7xx_port_write(port, SC16IS7XX_LCR_REG,
- SC16IS7XX_LCR_CONF_MODE_B);
-
- regcache_cache_bypass(one->regmap, true);
-
- /* Enable write access to enhanced features and internal clock div */
- sc16is7xx_port_update(port, SC16IS7XX_EFR_REG,
- SC16IS7XX_EFR_ENABLE_BIT,
- SC16IS7XX_EFR_ENABLE_BIT);
-
/* Enable TCR/TLR */
sc16is7xx_port_update(port, SC16IS7XX_MCR_REG,
SC16IS7XX_MCR_TCRTLR_BIT,
@@ -1199,7 +1188,8 @@ static int sc16is7xx_startup(struct uart_port *port)
SC16IS7XX_TCR_RX_RESUME(24) |
SC16IS7XX_TCR_RX_HALT(48));
- regcache_cache_bypass(one->regmap, false);
+ /* Disable TCR/TLR access */
+ sc16is7xx_port_update(port, SC16IS7XX_MCR_REG, SC16IS7XX_MCR_TCRTLR_BIT, 0);
/* Now, initialize the UART */
sc16is7xx_port_write(port, SC16IS7XX_LCR_REG, SC16IS7XX_LCR_WORD_LEN_8);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 535fd4c98452c87537a40610abba45daf5761ec6
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025091721-moonwalk-barrack-423e@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 535fd4c98452c87537a40610abba45daf5761ec6 Mon Sep 17 00:00:00 2001
From: Hugo Villeneuve <hvilleneuve(a)dimonoff.com>
Date: Thu, 31 Jul 2025 08:44:50 -0400
Subject: [PATCH] serial: sc16is7xx: fix bug in flow control levels init
When trying to set MCR[2], XON1 is incorrectly accessed instead. And when
writing to the TCR register to configure flow control levels, we are
incorrectly writing to the MSR register. The default value of $00 is then
used for TCR, which means that selectable trigger levels in FCR are used
in place of TCR.
TCR/TLR access requires EFR[4] (enable enhanced functions) and MCR[2]
to be set. EFR[4] is already set in probe().
MCR access requires LCR[7] to be zero.
Since LCR is set to $BF when trying to set MCR[2], XON1 is incorrectly
accessed instead because MCR shares the same address space as XON1.
Since MCR[2] is unmodified and still zero, when writing to TCR we are in
fact writing to MSR because TCR/TLR registers share the same address space
as MSR/SPR.
Fix by first removing useless reconfiguration of EFR[4] (enable enhanced
functions), as it is already enabled in sc16is7xx_probe() since commit
43c51bb573aa ("sc16is7xx: make sure device is in suspend once probed").
Now LCR is $00, which means that MCR access is enabled.
Also remove regcache_cache_bypass() calls since we no longer access the
enhanced registers set, and TCR is already declared as volatile (in fact
by declaring MSR as volatile, which shares the same address).
Finally disable access to TCR/TLR registers after modifying them by
clearing MCR[2].
Note: the comment about "... and internal clock div" is wrong and can be
ignored/removed as access to internal clock div registers (DLL/DLH)
is permitted only when LCR[7] is logic 1, not when enhanced features
is enabled. And DLL/DLH access is not needed in sc16is7xx_startup().
Fixes: dfeae619d781 ("serial: sc16is7xx")
Cc: stable(a)vger.kernel.org
Signed-off-by: Hugo Villeneuve <hvilleneuve(a)dimonoff.com>
Link: https://lore.kernel.org/r/20250731124451.1108864-1-hugo@hugovil.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/drivers/tty/serial/sc16is7xx.c b/drivers/tty/serial/sc16is7xx.c
index 3f38fba8f6ea..a668e0bb26b3 100644
--- a/drivers/tty/serial/sc16is7xx.c
+++ b/drivers/tty/serial/sc16is7xx.c
@@ -1177,17 +1177,6 @@ static int sc16is7xx_startup(struct uart_port *port)
sc16is7xx_port_write(port, SC16IS7XX_FCR_REG,
SC16IS7XX_FCR_FIFO_BIT);
- /* Enable EFR */
- sc16is7xx_port_write(port, SC16IS7XX_LCR_REG,
- SC16IS7XX_LCR_CONF_MODE_B);
-
- regcache_cache_bypass(one->regmap, true);
-
- /* Enable write access to enhanced features and internal clock div */
- sc16is7xx_port_update(port, SC16IS7XX_EFR_REG,
- SC16IS7XX_EFR_ENABLE_BIT,
- SC16IS7XX_EFR_ENABLE_BIT);
-
/* Enable TCR/TLR */
sc16is7xx_port_update(port, SC16IS7XX_MCR_REG,
SC16IS7XX_MCR_TCRTLR_BIT,
@@ -1199,7 +1188,8 @@ static int sc16is7xx_startup(struct uart_port *port)
SC16IS7XX_TCR_RX_RESUME(24) |
SC16IS7XX_TCR_RX_HALT(48));
- regcache_cache_bypass(one->regmap, false);
+ /* Disable TCR/TLR access */
+ sc16is7xx_port_update(port, SC16IS7XX_MCR_REG, SC16IS7XX_MCR_TCRTLR_BIT, 0);
/* Now, initialize the UART */
sc16is7xx_port_write(port, SC16IS7XX_LCR_REG, SC16IS7XX_LCR_WORD_LEN_8);