This patch series fixes delayed hw_error handling during SSR.
Patch 1 adds a wakeup to ensure hw_error is processed promptly after coredump collection.
Patch 2 corrects the timeout unit from jiffies to ms.
Changes v3:
- patch2 add Fixes tag
- Link to v2
https://lore.kernel.org/all/20251106140103.1406081-1-quic_shuaz@quicinc.com/
Changes v2:
- Split timeout conversion into a separate patch.
- Clarified commit messages and added test case description.
- Link to v1
https://lore.kernel.org/all/20251104112601.2670019-1-quic_shuaz@quicinc.com/
Shuai Zhang (2):
Bluetooth: qca: Fix delayed hw_error handling due to missing wakeup
during SSR
Bluetooth: hci_qca: Convert timeout from jiffies to ms
drivers/bluetooth/hci_qca.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--
2.34.1
From: Chuck Lever <chuck.lever(a)oracle.com>
Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
does not also persist file time stamps. To wit, Section 18.32.3
of RFC 8881 mandates:
> The client specifies with the stable parameter the method of how
> the data is to be processed by the server. If stable is
> FILE_SYNC4, the server MUST commit the data written plus all file
> system metadata to stable storage before returning results. This
> corresponds to the NFSv2 protocol semantics. Any other behavior
> constitutes a protocol violation. If stable is DATA_SYNC4, then
> the server MUST commit all of the data to stable storage and
> enough of the metadata to retrieve the data before returning.
Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") replaced:
- flags |= RWF_SYNC;
with:
+ kiocb.ki_flags |= IOCB_DSYNC;
which appears to be correct given:
if (flags & RWF_SYNC)
kiocb_flags |= IOCB_DSYNC;
in kiocb_set_rw_flags(). However the author of that commit did not
appreciate that the previous line in kiocb_set_rw_flags() results
in IOCB_SYNC also being set:
kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
RWF_SUPPORTED contains RWF_SYNC, and RWF_SYNC is the same bit as
IOCB_SYNC. Reviewers at the time did not catch the omission.
Reported-by: Mike Snitzer <snitzer(a)kernel.org>
Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
Fixes: 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()")
Cc: stable(a)vger.kernel.org
Reviewed-by: Jeff Layton <jlayton(a)kernel.org>
Reviewed-by: NeilBrown <neil(a)brown.name>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
---
fs/nfsd/vfs.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f537a7b4ee01..5333d49910d9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1314,8 +1314,18 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
stable = NFS_UNSTABLE;
init_sync_kiocb(&kiocb, file);
kiocb.ki_pos = offset;
- if (stable && !fhp->fh_use_wgather)
- kiocb.ki_flags |= IOCB_DSYNC;
+ if (likely(!fhp->fh_use_wgather)) {
+ switch (stable) {
+ case NFS_FILE_SYNC:
+ /* persist data and timestamps */
+ kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
+ break;
+ case NFS_DATA_SYNC:
+ /* persist data only */
+ kiocb.ki_flags |= IOCB_DSYNC;
+ break;
+ }
+ }
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
--
2.51.0
From: Viken Dadhaniya <viken.dadhaniya(a)oss.qualcomm.com>
[ Upstream commit fc6a5b540c02d1ec624e4599f45a17f2941a5c00 ]
GENI UART driver currently supports only non-DFS (Dynamic Frequency
Scaling) mode for source frequency selection. However, to operate correctly
in DFS mode, the GENI SCLK register must be programmed with the appropriate
DFS index. Failing to do so can result in incorrect frequency selection
Add support for Dynamic Frequency Scaling (DFS) mode in the GENI UART
driver by configuring the GENI_CLK_SEL register with the appropriate DFS
index. This ensures correct frequency selection when operating in DFS mode.
Replace the UART driver-specific logic for clock selection with the GENI
common driver function to obtain the desired frequency and corresponding
clock index. This improves maintainability and consistency across
GENI-based drivers.
Signed-off-by: Viken Dadhaniya <viken.dadhaniya(a)oss.qualcomm.com>
Link: https://lore.kernel.org/r/20250903063136.3015237-1-viken.dadhaniya@oss.qual…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
LLM Generated explanations, may be completely bogus:
YES
- Fixes a real bug in DFS mode: The UART driver previously never
programmed the GENI DFS clock selection register, so on platforms
where the GENI core clock runs in Dynamic Frequency Scaling (DFS)
mode, UART could pick the wrong source clock and thus the wrong baud.
This change explicitly programs the DFS index so the selected source
frequency matches the computed divider.
- New write of the DFS index to the hardware register:
drivers/tty/serial/qcom_geni_serial.c:1306
- DFS clock select register and mask exist in the common header:
include/linux/soc/qcom/geni-se.h:85, include/linux/soc/qcom/geni-
se.h:145
- Uses the common GENI clock-matching helper instead of ad‑hoc logic:
The patch replaces driver-local clock rounding/tolerance code with the
GENI core’s frequency matching routine, ensuring consistent clock
selection across GENI-based drivers and improving maintainability.
- New source frequency selection via common helper:
drivers/tty/serial/qcom_geni_serial.c:1270
- Common helper is present and exported in the GENI core:
drivers/soc/qcom/qcom-geni-se.c:720
- Maintains existing divisor programming and adds a safety check: The
driver still computes and programs the serial clock divider, now with
a guard to avoid overflow of the divider field.
- Divider computation and range check:
drivers/tty/serial/qcom_geni_serial.c:1277,
drivers/tty/serial/qcom_geni_serial.c:1279
- Divider write to both M/S clock cfg registers remains as before:
drivers/tty/serial/qcom_geni_serial.c:1303,
drivers/tty/serial/qcom_geni_serial.c:1304
- Consistency with other GENI drivers already using DFS index
programming: Other GENI protocol drivers (e.g., SPI) already program
`SE_GENI_CLK_SEL` with the index returned by the common helper, so
this change aligns UART with established practice and reduces risk.
- SPI uses the same pattern: drivers/spi/spi-geni-qcom.c:383,
drivers/spi/spi-geni-qcom.c:385–386
- Small, contained, and low-risk:
- Touches a single driver file with a localized change in clock setup.
- No ABI or architectural changes; relies on existing GENI core
helpers and headers.
- Additional register write is standard and used by other GENI
drivers; masks index with `CLK_SEL_MSK`
(include/linux/soc/qcom/geni-se.h:145) for safety.
- Includes defensive error handling if no matching clock level is
found and a divider overflow guard
(drivers/tty/serial/qcom_geni_serial.c:1271–1275,
drivers/tty/serial/qcom_geni_serial.c:1279–1281).
- User impact: Without this, UART on DFS-enabled platforms can run at an
incorrect baud, causing broken serial communication (including
console). The fix directly addresses that functional issue.
- Stable backport criteria:
- Fixes an important, user-visible bug (incorrect baud under DFS).
- Minimal and self-contained change, no new features or interfaces.
- Leverages existing, widely used GENI core APIs already present in
stable series.
Note: One minor nit in the debug print includes an extra newline before
`clk_idx`, but it’s harmless and does not affect functionality
(drivers/tty/serial/qcom_geni_serial.c:1284).
drivers/tty/serial/qcom_geni_serial.c | 92 ++++++---------------------
1 file changed, 21 insertions(+), 71 deletions(-)
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 81f385d900d06..ff401e331f1bb 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -1,5 +1,8 @@
// SPDX-License-Identifier: GPL-2.0
-// Copyright (c) 2017-2018, The Linux foundation. All rights reserved.
+/*
+ * Copyright (c) 2017-2018, The Linux foundation. All rights reserved.
+ * Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries.
+ */
/* Disable MMIO tracing to prevent excessive logging of unwanted MMIO traces */
#define __DISABLE_TRACE_MMIO__
@@ -1253,75 +1256,15 @@ static int qcom_geni_serial_startup(struct uart_port *uport)
return 0;
}
-static unsigned long find_clk_rate_in_tol(struct clk *clk, unsigned int desired_clk,
- unsigned int *clk_div, unsigned int percent_tol)
-{
- unsigned long freq;
- unsigned long div, maxdiv;
- u64 mult;
- unsigned long offset, abs_tol, achieved;
-
- abs_tol = div_u64((u64)desired_clk * percent_tol, 100);
- maxdiv = CLK_DIV_MSK >> CLK_DIV_SHFT;
- div = 1;
- while (div <= maxdiv) {
- mult = (u64)div * desired_clk;
- if (mult != (unsigned long)mult)
- break;
-
- offset = div * abs_tol;
- freq = clk_round_rate(clk, mult - offset);
-
- /* Can only get lower if we're done */
- if (freq < mult - offset)
- break;
-
- /*
- * Re-calculate div in case rounding skipped rates but we
- * ended up at a good one, then check for a match.
- */
- div = DIV_ROUND_CLOSEST(freq, desired_clk);
- achieved = DIV_ROUND_CLOSEST(freq, div);
- if (achieved <= desired_clk + abs_tol &&
- achieved >= desired_clk - abs_tol) {
- *clk_div = div;
- return freq;
- }
-
- div = DIV_ROUND_UP(freq, desired_clk);
- }
-
- return 0;
-}
-
-static unsigned long get_clk_div_rate(struct clk *clk, unsigned int baud,
- unsigned int sampling_rate, unsigned int *clk_div)
-{
- unsigned long ser_clk;
- unsigned long desired_clk;
-
- desired_clk = baud * sampling_rate;
- if (!desired_clk)
- return 0;
-
- /*
- * try to find a clock rate within 2% tolerance, then within 5%
- */
- ser_clk = find_clk_rate_in_tol(clk, desired_clk, clk_div, 2);
- if (!ser_clk)
- ser_clk = find_clk_rate_in_tol(clk, desired_clk, clk_div, 5);
-
- return ser_clk;
-}
-
static int geni_serial_set_rate(struct uart_port *uport, unsigned int baud)
{
struct qcom_geni_serial_port *port = to_dev_port(uport);
unsigned long clk_rate;
- unsigned int avg_bw_core;
+ unsigned int avg_bw_core, clk_idx;
unsigned int clk_div;
u32 ver, sampling_rate;
u32 ser_clk_cfg;
+ int ret;
sampling_rate = UART_OVERSAMPLING;
/* Sampling rate is halved for IP versions >= 2.5 */
@@ -1329,17 +1272,22 @@ static int geni_serial_set_rate(struct uart_port *uport, unsigned int baud)
if (ver >= QUP_SE_VERSION_2_5)
sampling_rate /= 2;
- clk_rate = get_clk_div_rate(port->se.clk, baud,
- sampling_rate, &clk_div);
- if (!clk_rate) {
- dev_err(port->se.dev,
- "Couldn't find suitable clock rate for %u\n",
- baud * sampling_rate);
+ ret = geni_se_clk_freq_match(&port->se, baud * sampling_rate, &clk_idx, &clk_rate, false);
+ if (ret) {
+ dev_err(port->se.dev, "Failed to find src clk for baud rate: %d ret: %d\n",
+ baud, ret);
+ return ret;
+ }
+
+ clk_div = DIV_ROUND_UP(clk_rate, baud * sampling_rate);
+ /* Check if calculated divider exceeds maximum allowed value */
+ if (clk_div > (CLK_DIV_MSK >> CLK_DIV_SHFT)) {
+ dev_err(port->se.dev, "Calculated clock divider %u exceeds maximum\n", clk_div);
return -EINVAL;
}
- dev_dbg(port->se.dev, "desired_rate = %u, clk_rate = %lu, clk_div = %u\n",
- baud * sampling_rate, clk_rate, clk_div);
+ dev_dbg(port->se.dev, "desired_rate = %u, clk_rate = %lu, clk_div = %u\n, clk_idx = %u\n",
+ baud * sampling_rate, clk_rate, clk_div, clk_idx);
uport->uartclk = clk_rate;
port->clk_rate = clk_rate;
@@ -1359,6 +1307,8 @@ static int geni_serial_set_rate(struct uart_port *uport, unsigned int baud)
writel(ser_clk_cfg, uport->membase + GENI_SER_M_CLK_CFG);
writel(ser_clk_cfg, uport->membase + GENI_SER_S_CLK_CFG);
+ /* Configure clock selection register with the selected clock index */
+ writel(clk_idx & CLK_SEL_MSK, uport->membase + SE_GENI_CLK_SEL);
return 0;
}
--
2.51.0
From: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
The selective fetch code doesn't handle asycn flips correctly.
There is a nonsense check for async flips in
intel_psr2_sel_fetch_config_valid() but that only gets called
for modesets/fastsets and thus does nothing for async flips.
Currently intel_async_flip_check_hw() is very unhappy as the
selective fetch code pulls in planes that are not even async
flips capable.
Reject async flips when selective fetch is enabled, until
someone fixes this properly (ie. disable selective fetch while
async flips are being issued).
Cc: stable(a)vger.kernel.org
Signed-off-by: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
---
drivers/gpu/drm/i915/display/intel_display.c | 8 ++++++++
drivers/gpu/drm/i915/display/intel_psr.c | 6 ------
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
index 42ec78798666..10583592fefe 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -6020,6 +6020,14 @@ static int intel_async_flip_check_uapi(struct intel_atomic_state *state,
return -EINVAL;
}
+ /* FIXME: selective fetch should be disabled for async flips */
+ if (new_crtc_state->enable_psr2_sel_fetch) {
+ drm_dbg_kms(display->drm,
+ "[CRTC:%d:%s] async flip disallowed with PSR2 selective fetch\n",
+ crtc->base.base.id, crtc->base.name);
+ return -EINVAL;
+ }
+
for_each_oldnew_intel_plane_in_state(state, plane, old_plane_state,
new_plane_state, i) {
if (plane->pipe != crtc->pipe)
diff --git a/drivers/gpu/drm/i915/display/intel_psr.c b/drivers/gpu/drm/i915/display/intel_psr.c
index 05014ffe3ce1..65d77aea9536 100644
--- a/drivers/gpu/drm/i915/display/intel_psr.c
+++ b/drivers/gpu/drm/i915/display/intel_psr.c
@@ -1296,12 +1296,6 @@ static bool intel_psr2_sel_fetch_config_valid(struct intel_dp *intel_dp,
return false;
}
- if (crtc_state->uapi.async_flip) {
- drm_dbg_kms(display->drm,
- "PSR2 sel fetch not enabled, async flip enabled\n");
- return false;
- }
-
return crtc_state->enable_psr2_sel_fetch = true;
}
--
2.49.1
From: Steven Rostedt <rostedt(a)goodmis.org>
The function ring_buffer_map_get_reader() is a bit more strict than the
other get reader functions, and except for certain situations the
rb_get_reader_page() should not return NULL. If it does, it triggers a
warning.
This warning was triggering but after looking at why, it was because
another acceptable situation was happening and it wasn't checked for.
If the reader catches up to the writer and there's still data to be read
on the reader page, then the rb_get_reader_page() will return NULL as
there's no new page to get.
In this situation, the reader page should not be updated and no warning
should trigger.
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Vincent Donnefort <vdonnefort(a)google.com>
Reported-by: syzbot+92a3745cea5ec6360309(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/690babec.050a0220.baf87.0064.GAE@google.com/
Link: https://lore.kernel.org/20251016132848.1b11bb37@gandalf.local.home
Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/ring_buffer.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1244d2c5c384..afcd3747264d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7344,6 +7344,10 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
goto out;
}
+ /* Did the reader catch up with the writer? */
+ if (cpu_buffer->reader_page == cpu_buffer->commit_page)
+ goto out;
+
reader = rb_get_reader_page(cpu_buffer);
if (WARN_ON(!reader))
goto out;
--
2.51.0
From: Vladimir Oltean <vladimir.oltean(a)nxp.com>
[ Upstream commit 6c24a03a61a245fe34d47582898331fa034b6ccd ]
Alexander Sverdlin presents 2 problems during shutdown with the
lan9303 driver. One is specific to lan9303 and the other just happens
to reproduce there.
The first problem is that lan9303 is unique among DSA drivers in that it
calls dev_get_drvdata() at "arbitrary runtime" (not probe, not shutdown,
not remove):
phy_state_machine()
-> ...
-> dsa_user_phy_read()
-> ds->ops->phy_read()
-> lan9303_phy_read()
-> chip->ops->phy_read()
-> lan9303_mdio_phy_read()
-> dev_get_drvdata()
But we never stop the phy_state_machine(), so it may continue to run
after dsa_switch_shutdown(). Our common pattern in all DSA drivers is
to set drvdata to NULL to suppress the remove() method that may come
afterwards. But in this case it will result in an NPD.
The second problem is that the way in which we set
dp->master->dsa_ptr = NULL; is concurrent with receive packet
processing. dsa_switch_rcv() checks once whether dev->dsa_ptr is NULL,
but afterwards, rather than continuing to use that non-NULL value,
dev->dsa_ptr is dereferenced again and again without NULL checks:
dsa_master_find_slave() and many other places. In between dereferences,
there is no locking to ensure that what was valid once continues to be
valid.
Both problems have the common aspect that closing the master interface
solves them.
In the first case, dev_close(master) triggers the NETDEV_GOING_DOWN
event in dsa_slave_netdevice_event() which closes slave ports as well.
dsa_port_disable_rt() calls phylink_stop(), which synchronously stops
the phylink state machine, and ds->ops->phy_read() will thus no longer
call into the driver after this point.
In the second case, dev_close(master) should do this, as per
Documentation/networking/driver.rst:
| Quiescence
| ----------
|
| After the ndo_stop routine has been called, the hardware must
| not receive or transmit any data. All in flight packets must
| be aborted. If necessary, poll or wait for completion of
| any reset commands.
So it should be sufficient to ensure that later, when we zeroize
master->dsa_ptr, there will be no concurrent dsa_switch_rcv() call
on this master.
The addition of the netif_device_detach() function is to ensure that
ioctls, rtnetlinks and ethtool requests on the slave ports no longer
propagate down to the driver - we're no longer prepared to handle them.
The race condition actually did not exist when commit 0650bf52b31f
("net: dsa: be compatible with masters which unregister on shutdown")
first introduced dsa_switch_shutdown(). It was created later, when we
stopped unregistering the slave interfaces from a bad spot, and we just
replaced that sequence with a racy zeroization of master->dsa_ptr
(one which doesn't ensure that the interfaces aren't up).
Reported-by: Alexander Sverdlin <alexander.sverdlin(a)siemens.com>
Closes: https://lore.kernel.org/netdev/2d2e3bba17203c14a5ffdabc174e3b6bbb9ad438.cam…
Closes: https://lore.kernel.org/netdev/c1bf4de54e829111e0e4a70e7bd1cf523c9550ff.cam…
Fixes: ee534378f005 ("net: dsa: fix panic when DSA master device unbinds on shutdown")
Reviewed-by: Alexander Sverdlin <alexander.sverdlin(a)siemens.com>
Tested-by: Alexander Sverdlin <alexander.sverdlin(a)siemens.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean(a)nxp.com>
Link: https://patch.msgid.link/20240913203549.3081071-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
[ Modification: Using dp->master and dp->slave instead of dp->conduit and dp->user ]
Signed-off-by: Rajani Kantha <681739313(a)139.com>
---
net/dsa/dsa.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 07736edc8b6a..c9bf1a9a6c99 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -1613,6 +1613,7 @@ EXPORT_SYMBOL_GPL(dsa_unregister_switch);
void dsa_switch_shutdown(struct dsa_switch *ds)
{
struct net_device *master, *slave_dev;
+ LIST_HEAD(close_list);
struct dsa_port *dp;
mutex_lock(&dsa2_mutex);
@@ -1622,10 +1623,16 @@ void dsa_switch_shutdown(struct dsa_switch *ds)
rtnl_lock();
+ dsa_switch_for_each_cpu_port(dp, ds)
+ list_add(&dp->master->close_list, &close_list);
+
+ dev_close_many(&close_list, true);
+
dsa_switch_for_each_user_port(dp, ds) {
master = dsa_port_to_master(dp);
slave_dev = dp->slave;
+ netif_device_detach(slave_dev);
netdev_upper_dev_unlink(master, slave_dev);
}
--
2.17.1
[ Upstream commit f04aad36a07cc17b7a5d5b9a2d386ce6fae63e93 ]
syzkaller discovered the following crash: (kernel BUG)
[ 44.607039] ------------[ cut here ]------------
[ 44.607422] kernel BUG at mm/userfaultfd.c:2067!
[ 44.608148] Oops: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[ 44.608814] CPU: 1 UID: 0 PID: 2475 Comm: reproducer Not tainted 6.16.0-rc6 #1 PREEMPT(none)
[ 44.609635] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 44.610695] RIP: 0010:userfaultfd_release_all+0x3a8/0x460
<snip other registers, drop unreliable trace>
[ 44.617726] Call Trace:
[ 44.617926] <TASK>
[ 44.619284] userfaultfd_release+0xef/0x1b0
[ 44.620976] __fput+0x3f9/0xb60
[ 44.621240] fput_close_sync+0x110/0x210
[ 44.622222] __x64_sys_close+0x8f/0x120
[ 44.622530] do_syscall_64+0x5b/0x2f0
[ 44.622840] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 44.623244] RIP: 0033:0x7f365bb3f227
Kernel panics because it detects UFFD inconsistency during
userfaultfd_release_all(). Specifically, a VMA which has a valid pointer
to vma->vm_userfaultfd_ctx, but no UFFD flags in vma->vm_flags.
The inconsistency is caused in ksm_madvise(): when user calls madvise()
with MADV_UNMEARGEABLE on a VMA that is registered for UFFD in MINOR mode,
it accidentally clears all flags stored in the upper 32 bits of
vma->vm_flags.
Assuming x86_64 kernel build, unsigned long is 64-bit and unsigned int and
int are 32-bit wide. This setup causes the following mishap during the &=
~VM_MERGEABLE assignment.
VM_MERGEABLE is a 32-bit constant of type unsigned int, 0x8000'0000.
After ~ is applied, it becomes 0x7fff'ffff unsigned int, which is then
promoted to unsigned long before the & operation. This promotion fills
upper 32 bits with leading 0s, as we're doing unsigned conversion (and
even for a signed conversion, this wouldn't help as the leading bit is 0).
& operation thus ends up AND-ing vm_flags with 0x0000'0000'7fff'ffff
instead of intended 0xffff'ffff'7fff'ffff and hence accidentally clears
the upper 32-bits of its value.
Fix it by changing `VM_MERGEABLE` constant to unsigned long, using the
BIT() macro.
Note: other VM_* flags are not affected: This only happens to the
VM_MERGEABLE flag, as the other VM_* flags are all constants of type int
and after ~ operation, they end up with leading 1 and are thus converted
to unsigned long with leading 1s.
Note 2:
After commit 31defc3b01d9 ("userfaultfd: remove (VM_)BUG_ON()s"), this is
no longer a kernel BUG, but a WARNING at the same place:
[ 45.595973] WARNING: CPU: 1 PID: 2474 at mm/userfaultfd.c:2067
but the root-cause (flag-drop) remains the same.
[akpm(a)linux-foundation.org: rust bindgen wasn't able to handle BIT(), from Miguel]
Link: https://lore.kernel.org/oe-kbuild-all/202510030449.VfSaAjvd-lkp@intel.com/
Link: https://lkml.kernel.org/r/20251001090353.57523-2-acsjakub@amazon.de
Fixes: 7677f7fd8be7 ("userfaultfd: add minor fault registration mode")
Signed-off-by: Jakub Acs <acsjakub(a)amazon.de>
Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis(a)gmail.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Acked-by: SeongJae Park <sj(a)kernel.org>
Tested-by: Alice Ryhl <aliceryhl(a)google.com>
Tested-by: Miguel Ojeda <miguel.ojeda.sandonis(a)gmail.com>
Cc: Xu Xin <xu.xin16(a)zte.com.cn>
Cc: Chengming Zhou <chengming.zhou(a)linux.dev>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
[ acsjakub: drop rust-compatibility change (no rust in 5.4) ]
Signed-off-by: Jakub Acs <acsjakub(a)amazon.de>
---
Why sending to stable version from before "fixes"?
In the original patch, I set fixes tag to the change that allows the
panic to manifest, not to the one that is real root-cause of the
problem.
The change that introduced the root-cause of the problem is:
f8af4da3b4c1 ("ksm: the mm interface to ksm"), as pointed out by
Vlastimil in [1].
Hence, as the older kernels can be affected by the flag-drop as well,
backport to older kernels.
[1]: https://lore.kernel.org/all/13c7242e-3a40-469b-9e99-8a65a21449bb@suse.cz/
include/linux/mm.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 57cba6e4fdcd..be8c793233d3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -293,7 +293,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */
#define VM_HUGEPAGE 0x20000000 /* MADV_HUGEPAGE marked this vma */
#define VM_NOHUGEPAGE 0x40000000 /* MADV_NOHUGEPAGE marked this vma */
-#define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
+#define VM_MERGEABLE BIT(31) /* KSM may merge identical pages */
#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Christof Hellmis
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
[ Upstream commit f04aad36a07cc17b7a5d5b9a2d386ce6fae63e93 ]
syzkaller discovered the following crash: (kernel BUG)
[ 44.607039] ------------[ cut here ]------------
[ 44.607422] kernel BUG at mm/userfaultfd.c:2067!
[ 44.608148] Oops: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[ 44.608814] CPU: 1 UID: 0 PID: 2475 Comm: reproducer Not tainted 6.16.0-rc6 #1 PREEMPT(none)
[ 44.609635] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 44.610695] RIP: 0010:userfaultfd_release_all+0x3a8/0x460
<snip other registers, drop unreliable trace>
[ 44.617726] Call Trace:
[ 44.617926] <TASK>
[ 44.619284] userfaultfd_release+0xef/0x1b0
[ 44.620976] __fput+0x3f9/0xb60
[ 44.621240] fput_close_sync+0x110/0x210
[ 44.622222] __x64_sys_close+0x8f/0x120
[ 44.622530] do_syscall_64+0x5b/0x2f0
[ 44.622840] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 44.623244] RIP: 0033:0x7f365bb3f227
Kernel panics because it detects UFFD inconsistency during
userfaultfd_release_all(). Specifically, a VMA which has a valid pointer
to vma->vm_userfaultfd_ctx, but no UFFD flags in vma->vm_flags.
The inconsistency is caused in ksm_madvise(): when user calls madvise()
with MADV_UNMEARGEABLE on a VMA that is registered for UFFD in MINOR mode,
it accidentally clears all flags stored in the upper 32 bits of
vma->vm_flags.
Assuming x86_64 kernel build, unsigned long is 64-bit and unsigned int and
int are 32-bit wide. This setup causes the following mishap during the &=
~VM_MERGEABLE assignment.
VM_MERGEABLE is a 32-bit constant of type unsigned int, 0x8000'0000.
After ~ is applied, it becomes 0x7fff'ffff unsigned int, which is then
promoted to unsigned long before the & operation. This promotion fills
upper 32 bits with leading 0s, as we're doing unsigned conversion (and
even for a signed conversion, this wouldn't help as the leading bit is 0).
& operation thus ends up AND-ing vm_flags with 0x0000'0000'7fff'ffff
instead of intended 0xffff'ffff'7fff'ffff and hence accidentally clears
the upper 32-bits of its value.
Fix it by changing `VM_MERGEABLE` constant to unsigned long, using the
BIT() macro.
Note: other VM_* flags are not affected: This only happens to the
VM_MERGEABLE flag, as the other VM_* flags are all constants of type int
and after ~ operation, they end up with leading 1 and are thus converted
to unsigned long with leading 1s.
Note 2:
After commit 31defc3b01d9 ("userfaultfd: remove (VM_)BUG_ON()s"), this is
no longer a kernel BUG, but a WARNING at the same place:
[ 45.595973] WARNING: CPU: 1 PID: 2474 at mm/userfaultfd.c:2067
but the root-cause (flag-drop) remains the same.
[akpm(a)linux-foundation.org: rust bindgen wasn't able to handle BIT(), from Miguel]
Link: https://lore.kernel.org/oe-kbuild-all/202510030449.VfSaAjvd-lkp@intel.com/
Link: https://lkml.kernel.org/r/20251001090353.57523-2-acsjakub@amazon.de
Fixes: 7677f7fd8be7 ("userfaultfd: add minor fault registration mode")
Signed-off-by: Jakub Acs <acsjakub(a)amazon.de>
Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis(a)gmail.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Acked-by: SeongJae Park <sj(a)kernel.org>
Tested-by: Alice Ryhl <aliceryhl(a)google.com>
Tested-by: Miguel Ojeda <miguel.ojeda.sandonis(a)gmail.com>
Cc: Xu Xin <xu.xin16(a)zte.com.cn>
Cc: Chengming Zhou <chengming.zhou(a)linux.dev>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
[ acsjakub: drop rust-compatibility change (no rust in 5.10) ]
Signed-off-by: Jakub Acs <acsjakub(a)amazon.de>
---
Why sending to stable version from before "fixes"?
In the original patch, I set fixes tag to the change that allows the
panic to manifest, not to the one that is real root-cause of the
problem.
The change that introduced the root-cause of the problem is:
f8af4da3b4c1 ("ksm: the mm interface to ksm"), as pointed out by
Vlastimil in [1].
Hence, as the older kernels can be affected by the flag-drop as well,
backport to older kernels.
[1]: https://lore.kernel.org/all/13c7242e-3a40-469b-9e99-8a65a21449bb@suse.cz/
include/linux/mm.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e168d87d6f2e..4787d39bbad4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -296,7 +296,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */
#define VM_HUGEPAGE 0x20000000 /* MADV_HUGEPAGE marked this vma */
#define VM_NOHUGEPAGE 0x40000000 /* MADV_NOHUGEPAGE marked this vma */
-#define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
+#define VM_MERGEABLE BIT(31) /* KSM may merge identical pages */
#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Christof Hellmis
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597