The mode setting logic in ad7816_store_mode was reversed due to
incorrect handling of the strcmp return value. strcmp returns 0 on
match, so the `if (strcmp(buf, "full"))` block executed when the
input was not "full".
This resulted in "full" setting the mode to AD7816_PD (power-down) and
other inputs setting it to AD7816_FULL.
Fix this by checking it against 0 to correctly check for "full" and
"power-down", mapping them to AD7816_FULL and AD7816_PD respectively.
Fixes: 7924425db04a ("staging: iio: adc: new driver for AD7816 devices")
Cc: stable(a)vger.kernel.org
Signed-off-by: Gabriel Shahrouzi <gshahrouzi(a)gmail.com>
---
Changes since v3:
- Tag stable(a)vger.kernel.org instead of an email CC
- Use the correct version for patch
Changes since v2:
- Add fixes tag that references commit that introduced the bug.
- Replace sysfs_streq with strcmp.
---
drivers/staging/iio/adc/ad7816.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/staging/iio/adc/ad7816.c b/drivers/staging/iio/adc/ad7816.c
index 6c14d7bcdd675..081b17f498638 100644
--- a/drivers/staging/iio/adc/ad7816.c
+++ b/drivers/staging/iio/adc/ad7816.c
@@ -136,7 +136,7 @@ static ssize_t ad7816_store_mode(struct device *dev,
struct iio_dev *indio_dev = dev_to_iio_dev(dev);
struct ad7816_chip_info *chip = iio_priv(indio_dev);
- if (strcmp(buf, "full")) {
+ if (strcmp(buf, "full") == 0) {
gpiod_set_value(chip->rdwr_pin, 1);
chip->mode = AD7816_FULL;
} else {
--
2.43.0
The mode setting logic in ad7816_store_mode was reversed due to
incorrect handling of the strcmp return value. strcmp returns 0 on
match, so the `if (strcmp(buf, "full"))` block executed when the
input was not "full".
This resulted in "full" setting the mode to AD7816_PD (power-down) and
other inputs setting it to AD7816_FULL.
Fix this by checking it against 0 to correctly check for "full" and
"power-down", mapping them to AD7816_FULL and AD7816_PD respectively.
Fixes: 7924425db04a ("staging: iio: adc: new driver for AD7816 devices")
Signed-off-by: Gabriel Shahrouzi <gshahrouzi(a)gmail.com>
---
Changes since v2:
- Add fixes tag that references commit that introduced the bug.
- Replace sysfs_streq with strcmp.
---
drivers/staging/iio/adc/ad7816.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/staging/iio/adc/ad7816.c b/drivers/staging/iio/adc/ad7816.c
index 6c14d7bcdd675..081b17f498638 100644
--- a/drivers/staging/iio/adc/ad7816.c
+++ b/drivers/staging/iio/adc/ad7816.c
@@ -136,7 +136,7 @@ static ssize_t ad7816_store_mode(struct device *dev,
struct iio_dev *indio_dev = dev_to_iio_dev(dev);
struct ad7816_chip_info *chip = iio_priv(indio_dev);
- if (strcmp(buf, "full")) {
+ if (strcmp(buf, "full") == 0) {
gpiod_set_value(chip->rdwr_pin, 1);
chip->mode = AD7816_FULL;
} else {
--
2.43.0
If a new reset event appears before the previous one has been
processed, the device can get stuck into a reset loop. This happens
rarely, but blocks the device when it does, and floods the log with
messages like the following:
lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
The only bit that the driver pays attention to in the interrupt data
is "link was reset". If there's a flapping status bit in that endpoint
data (such as if PHY negotiation needs a few tries to get a stable
link), polling at a slower rate allows the state to settle.
This is a simplified version of a patch that's been in the Raspberry
Pi downstream kernel since their 4.14 branch, see also:
https://github.com/raspberrypi/linux/issues/2447
Signed-off-by: Fiona Klute <fiona.klute(a)gmx.de>
Cc: kernel-list(a)raspberrypi.com
Cc: stable(a)vger.kernel.org
---
For the stable crew: I've *tested* the patch with 6.12.7 and 6.13.5 on
a Revolution Pi Connect 4 (Raspberry Pi CM4 based device with built-in
LAN7800 as second ethernet port), according to the linked issue for
the RPi downstream kernel the problem should be present in all
maintained longterm kernel versions, too (based on how long they've
carried a patch).
drivers/net/usb/lan78xx.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index a91bf9c7e31d..7bf01a31a932 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -173,6 +173,12 @@
#define INT_EP_GPIO_1 (1)
#define INT_EP_GPIO_0 (0)
+/* highspeed device, so polling interval is in microframes (eight per
+ * millisecond)
+ */
+#define INT_URB_MICROFRAMES_PER_MS 8
+#define MIN_INT_URB_INTERVAL_MS 8
+
static const char lan78xx_gstrings[][ETH_GSTRING_LEN] = {
"RX FCS Errors",
"RX Alignment Errors",
@@ -4527,7 +4533,11 @@ static int lan78xx_probe(struct usb_interface *intf,
if (ret < 0)
goto out4;
- period = ep_intr->desc.bInterval;
+ period = max(ep_intr->desc.bInterval,
+ MIN_INT_URB_INTERVAL_MS * INT_URB_MICROFRAMES_PER_MS);
+ dev_info(&intf->dev,
+ "interrupt urb period set to %d, bInterval is %d\n",
+ period, ep_intr->desc.bInterval);
maxp = usb_maxpacket(dev->udev, dev->pipe_intr);
dev->urb_intr = usb_alloc_urb(0, GFP_KERNEL);
base-commit: dd83757f6e686a2188997cb58b5975f744bb7786
--
2.47.2
Starting with Rust 1.86.0 (see upstream commit b151b513ba2b ("Insert null
checks for pointer dereferences when debug assertions are enabled") [1]),
under some kernel configurations with `CONFIG_RUST_DEBUG_ASSERTIONS=y`,
one may trigger a new `objtool` warning:
rust/kernel.o: warning: objtool: _R..._6kernel9workqueue6system()
falls through to next function _R...9workqueue14system_highpri()
due to a call to the `noreturn` symbol:
core::panicking::panic_null_pointer_dereference
Thus add it to the list so that `objtool` knows it is actually `noreturn`.
See commit 56d680dd23c3 ("objtool/rust: list `noreturn` Rust functions")
for more details.
Cc: stable(a)vger.kernel.org # Needed in 6.12.y and later (Rust is pinned in older LTSs).
Fixes: 56d680dd23c3 ("objtool/rust: list `noreturn` Rust functions")
Link: https://github.com/rust-lang/rust/commit/b151b513ba2b65c7506ec1a80f2712bbd0… [1]
Signed-off-by: Miguel Ojeda <ojeda(a)kernel.org>
---
tools/objtool/check.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 4a1f6c3169b3..67006eeb30c8 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -225,6 +225,7 @@ static bool is_rust_noreturn(const struct symbol *func)
str_ends_with(func->name, "_4core9panicking14panic_nounwind") ||
str_ends_with(func->name, "_4core9panicking18panic_bounds_check") ||
str_ends_with(func->name, "_4core9panicking19assert_failed_inner") ||
+ str_ends_with(func->name, "_4core9panicking30panic_null_pointer_dereference") ||
str_ends_with(func->name, "_4core9panicking36panic_misaligned_pointer_dereference") ||
strstr(func->name, "_4core9panicking13assert_failed") ||
strstr(func->name, "_4core9panicking11panic_const24panic_const_") ||
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
--
2.49.0
If KASAN is enabled, and one runs in a clean repository e.g.:
make LLVM=1 prepare
make LLVM=1 prepare
Then the Rust code gets rebuilt, which should not happen.
The reason is some of the LLVM KASAN `rustc` flags are added in the
second run:
-Cllvm-args=-asan-instrumentation-with-call-threshold=10000
-Cllvm-args=-asan-stack=0
-Cllvm-args=-asan-globals=1
-Cllvm-args=-asan-kernel-mem-intrinsic-prefix=1
Further runs do not rebuild Rust because the flags do not change anymore.
Rebuilding like that in the second run is bad, even if this just happens
with KASAN enabled, but missing flags in the first one is even worse.
The root issue is that we pass, for some architectures and for the moment,
a generated `target.json` file. That file is not ready by the time `rustc`
gets called for the flag test, and thus the flag test fails just because
the file is not available, e.g.:
$ ... --target=./scripts/target.json ... -Cllvm-args=...
error: target file "./scripts/target.json" does not exist
There are a few approaches we could take here to solve this. For instance,
we could ensure that every time that the config is rebuilt, we regenerate
the file and recompute the flags. Or we could use the LLVM version to
check for these flags, instead of testing the flag (which may have other
advantages, such as allowing us to detect renames on the LLVM side).
However, it may be easier than that: `rustc` is aware of the `-Cllvm-args`
regardless of the `--target` (e.g. I checked that the list printed
is the same, plus that I can check for these flags even if I pass
a completely unrelated target), and thus we can just eliminate the
dependency completely.
Thus filter out the target.
This does mean that `rustc-option` cannot be used to test a flag that
requires the right target, but we don't have other users yet, it is a
minimal change and we want to get rid of custom targets in the future.
We could only filter in the case `target.json` is used, to make it work
in more cases, but then it would be harder to notice that it may not
work in a couple architectures.
Cc: Matthew Maurer <mmaurer(a)google.com>
Cc: Sami Tolvanen <samitolvanen(a)google.com>
Cc: stable(a)vger.kernel.org
Fixes: e3117404b411 ("kbuild: rust: Enable KASAN support")
Signed-off-by: Miguel Ojeda <ojeda(a)kernel.org>
---
By the way, I noticed that we are not getting `asan-instrument-allocas` enabled
in neither C nor Rust -- upstream LLVM renamed it in commit 8176ee9b5dda ("[asan]
Rename asan-instrument-allocas -> asan-instrument-dynamic-allocas")). But it
happened a very long time ago (9 years ago), and the addition in the kernel
is fairly old too, in 342061ee4ef3 ("kasan: support alloca() poisoning").
I assume it should either be renamed or removed? Happy to send a patch if so.
scripts/Makefile.compiler | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/scripts/Makefile.compiler b/scripts/Makefile.compiler
index 8956587b8547..7ed7f92a7daa 100644
--- a/scripts/Makefile.compiler
+++ b/scripts/Makefile.compiler
@@ -80,7 +80,7 @@ ld-option = $(call try-run, $(LD) $(KBUILD_LDFLAGS) $(1) -v,$(1),$(2),$(3))
# TODO: remove RUSTC_BOOTSTRAP=1 when we raise the minimum GNU Make version to 4.4
__rustc-option = $(call try-run,\
echo '#![allow(missing_docs)]#![feature(no_core)]#![no_core]' | RUSTC_BOOTSTRAP=1\
- $(1) --sysroot=/dev/null $(filter-out --sysroot=/dev/null,$(2)) $(3)\
+ $(1) --sysroot=/dev/null $(filter-out --sysroot=/dev/null --target=%,$(2)) $(3)\
--crate-type=rlib --out-dir=$(TMPOUT) --emit=obj=- - >/dev/null,$(3),$(4))
# rustc-option
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
--
2.49.0
Typically HDMI to MIPI CSI-2 bridges have a pin to signal image data is
being received. On the host side this is wired to a GPIO for polling or
interrupts. This includes the Lontium HDMI to MIPI CSI-2 bridges
lt6911uxe and lt6911uxc.
The GPIO "hpd" is used already by other HDMI to CSI-2 bridges, use it
here as well.
Signed-off-by: Dongcheng Yan <dongcheng.yan(a)intel.com>
---
drivers/platform/x86/intel/int3472/common.h | 1 +
drivers/platform/x86/intel/int3472/discrete.c | 6 ++++++
2 files changed, 7 insertions(+)
diff --git a/drivers/platform/x86/intel/int3472/common.h b/drivers/platform/x86/intel/int3472/common.h
index 145dec66df64..db4cd3720e24 100644
--- a/drivers/platform/x86/intel/int3472/common.h
+++ b/drivers/platform/x86/intel/int3472/common.h
@@ -22,6 +22,7 @@
#define INT3472_GPIO_TYPE_POWER_ENABLE 0x0b
#define INT3472_GPIO_TYPE_CLK_ENABLE 0x0c
#define INT3472_GPIO_TYPE_PRIVACY_LED 0x0d
+#define INT3472_GPIO_TYPE_HOTPLUG_DETECT 0x13
#define INT3472_PDEV_MAX_NAME_LEN 23
#define INT3472_MAX_SENSOR_GPIOS 3
diff --git a/drivers/platform/x86/intel/int3472/discrete.c b/drivers/platform/x86/intel/int3472/discrete.c
index 30ff8f3ea1f5..28d41b1b3809 100644
--- a/drivers/platform/x86/intel/int3472/discrete.c
+++ b/drivers/platform/x86/intel/int3472/discrete.c
@@ -186,6 +186,10 @@ static void int3472_get_con_id_and_polarity(struct acpi_device *adev, u8 *type,
*con_id = "privacy-led";
*gpio_flags = GPIO_ACTIVE_HIGH;
break;
+ case INT3472_GPIO_TYPE_HOTPLUG_DETECT:
+ *con_id = "hpd";
+ *gpio_flags = GPIO_LOOKUP_FLAGS_DEFAULT;
+ break;
case INT3472_GPIO_TYPE_POWER_ENABLE:
*con_id = "power-enable";
*gpio_flags = GPIO_ACTIVE_HIGH;
@@ -212,6 +216,7 @@ static void int3472_get_con_id_and_polarity(struct acpi_device *adev, u8 *type,
* 0x0b Power enable
* 0x0c Clock enable
* 0x0d Privacy LED
+ * 0x13 Hotplug detect
*
* There are some known platform specific quirks where that does not quite
* hold up; for example where a pin with type 0x01 (Power down) is mapped to
@@ -281,6 +286,7 @@ static int skl_int3472_handle_gpio_resources(struct acpi_resource *ares,
switch (type) {
case INT3472_GPIO_TYPE_RESET:
case INT3472_GPIO_TYPE_POWERDOWN:
+ case INT3472_GPIO_TYPE_HOTPLUG_DETECT:
ret = skl_int3472_map_gpio_to_sensor(int3472, agpio, con_id, gpio_flags);
if (ret)
err_msg = "Failed to map GPIO pin to sensor\n";
base-commit: 01c6df60d5d4ae00cd5c1648818744838bba7763
--
2.34.1
This series adds fine grained trap control in EL2 required for FEAT_PMUv3p9
registers like PMICNTR_EL0, PMICFILTR_EL0, and PMUACR_EL1 which are already
being used in the kernel. This is required to prevent their EL1 access trap
into EL2.
The following commits that enabled access into FEAT_PMUv3p9 registers have
already been merged upstream from 6.12 onwards.
d8226d8cfbaf ("perf: arm_pmuv3: Add support for Armv9.4 PMU instruction counter")
0bbff9ed8165 ("perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control")
The sysreg patches in this series are required for the final patch which
fixes the actual problem.
Changes in V2:
- Replaced [] with () for upstream commit reference across patches in response
to the following warning from Sasha
https://lore.kernel.org/stable/f1153021-846b-4fb1-8c4d-9fa813f982d3@arm.com/
Anshuman Khandual (7):
arm64/sysreg: Update register fields for ID_AA64MMFR0_EL1
arm64/sysreg: Add register fields for HDFGRTR2_EL2
arm64/sysreg: Add register fields for HDFGWTR2_EL2
arm64/sysreg: Add register fields for HFGITR2_EL2
arm64/sysreg: Add register fields for HFGRTR2_EL2
arm64/sysreg: Add register fields for HFGWTR2_EL2
arm64/boot: Enable EL2 requirements for FEAT_PMUv3p9
Documentation/arch/arm64/booting.rst | 22 ++++++
arch/arm64/include/asm/el2_setup.h | 25 +++++++
arch/arm64/tools/sysreg | 103 +++++++++++++++++++++++++++
3 files changed, 150 insertions(+)
--
2.30.2
From: Abhishek Sahu <abhsahu(a)nvidia.com>
[ Upstream commit eadf88ecf6ac7d6a9f47a76c6055d9a1987a8991 ]
If 'vfio_pci_core_device::needs_pm_restore' is set (PCI device does
not have No_Soft_Reset bit set in its PMCSR config register), then
the current PCI state will be saved locally in
'vfio_pci_core_device::pm_save' during D0->D3hot transition and same
will be restored back during D3hot->D0 transition.
For saving the PCI state locally, pci_store_saved_state() is being
used and the pci_load_and_free_saved_state() will free the allocated
memory.
But for reset related IOCTLs, vfio driver calls PCI reset-related
API's which will internally change the PCI power state back to D0. So,
when the guest resumes, then it will get the current state as D0 and it
will skip the call to vfio_pci_set_power_state() for changing the
power state to D0 explicitly. In this case, the memory pointed by
'pm_save' will never be freed. In a malicious sequence, the state changing
to D3hot followed by VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be
run in a loop and it can cause an OOM situation.
This patch frees the earlier allocated memory first before overwriting
'pm_save' to prevent the mentioned memory leak.
Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition")
Signed-off-by: Abhishek Sahu <abhsahu(a)nvidia.com>
Link: https://lore.kernel.org/r/20220217122107.22434-2-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson(a)redhat.com>
[Minor context change fixed]
Signed-off-by: Feng Liu <Feng.Liu3(a)windriver.com>
Signed-off-by: He Zhe <Zhe.He(a)windriver.com>
---
Verified the build test.
---
drivers/vfio/pci/vfio_pci.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 57ae8b46b836..7835712f730d 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -299,6 +299,19 @@ int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
if (!ret) {
/* D3 might be unsupported via quirk, skip unless in D3 */
if (needs_save && pdev->current_state >= PCI_D3hot) {
+ /*
+ * The current PCI state will be saved locally in
+ * 'pm_save' during the D3hot transition. When the
+ * device state is changed to D0 again with the current
+ * function, then pci_store_saved_state() will restore
+ * the state and will free the memory pointed by
+ * 'pm_save'. There are few cases where the PCI power
+ * state can be changed to D0 without the involvement
+ * of the driver. For these cases, free the earlier
+ * allocated memory first before overwriting 'pm_save'
+ * to prevent the memory leak.
+ */
+ kfree(vdev->pm_save);
vdev->pm_save = pci_store_saved_state(pdev);
} else if (needs_restore) {
pci_load_and_free_saved_state(pdev, &vdev->pm_save);
--
2.34.1
The function brcmf_usb_dl_writeimage() calls the function
brcmf_usb_dl_cmd() but dose not check its return value. The
'state.state' and the 'state.bytes' are uninitialized if the
function brcmf_usb_dl_cmd() fails. It is dangerous to use
uninitialized variables in the conditions.
Add error handling for brcmf_usb_dl_cmd() to jump to error
handling path if the brcmf_usb_dl_cmd() fails and the
'state.state' and the 'state.bytes' are uninitialized.
Fixes: 71bb244ba2fd ("brcm80211: fmac: add USB support for bcm43235/6/8 chipsets")
Cc: stable(a)vger.kernel.org # v3.4+
Signed-off-by: Wentao Liang <vulab(a)iscas.ac.cn>
---
drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c
index 50dddac8a2ab..1c97cd777225 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c
@@ -901,7 +901,9 @@ brcmf_usb_dl_writeimage(struct brcmf_usbdev_info *devinfo, u8 *fw, int fwlen)
}
/* 1) Prepare USB boot loader for runtime image */
- brcmf_usb_dl_cmd(devinfo, DL_START, &state, sizeof(state));
+ err = brcmf_usb_dl_cmd(devinfo, DL_START, &state, sizeof(state));
+ if (err)
+ goto fail;
rdlstate = le32_to_cpu(state.state);
rdlbytes = le32_to_cpu(state.bytes);
--
2.42.0.windows.2
The user can set any value to the variable ‘bo_number’, via the ioctl
command DRM_IOCTL_AMDGPU_BO_LIST. This will affect the arithmetic
expression ‘in->bo_number * in->bo_info_size’, which is prone to
overflow. Add a valid value check.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 964d0fbf6301 ("drm/amdgpu: Allow to create BO lists in CS ioctl v3")
Cc: stable(a)vger.kernel.org
Signed-off-by: Denis Arefev <arefev(a)swemel.ru>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c
index 702f6610d024..dd30d2426ff7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c
@@ -189,6 +189,9 @@ int amdgpu_bo_create_list_entry_array(struct drm_amdgpu_bo_list_in *in,
struct drm_amdgpu_bo_list_entry *info;
int r;
+ if (!in->bo_number || in->bo_number > UINT_MAX / info_size)
+ return -EINVAL;
+
info = kvmalloc_array(in->bo_number, info_size, GFP_KERNEL);
if (!info)
return -ENOMEM;
--
2.43.0
Once cdev_device_add() failed, we should use put_device() to decrement
reference count for cleanup. Or it could cause memory leak. Although
operations in err_free_ida are similar to the operations in callback
function fsi_slave_release(), put_device() is a correct handling
operation as comments require when cdev_device_add() fails.
As comment of device_add() says, 'if device_add() succeeds, you should
call device_del() when you want to get rid of it. If device_add() has
not succeeded, use only put_device() to drop the reference count'.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 371975b0b075 ("fsi/core: Fix error paths on CFAM init")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
drivers/fsi/fsi-core.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/fsi/fsi-core.c b/drivers/fsi/fsi-core.c
index 50e8736039fe..c494fc0bd747 100644
--- a/drivers/fsi/fsi-core.c
+++ b/drivers/fsi/fsi-core.c
@@ -1084,7 +1084,8 @@ static int fsi_slave_init(struct fsi_master *master, int link, uint8_t id)
rc = cdev_device_add(&slave->cdev, &slave->dev);
if (rc) {
dev_err(&slave->dev, "Error %d creating slave device\n", rc);
- goto err_free_ida;
+ put_device(&slave->dev);
+ return rc;
}
/* Now that we have the cdev registered with the core, any fatal
@@ -1110,8 +1111,6 @@ static int fsi_slave_init(struct fsi_master *master, int link, uint8_t id)
return 0;
-err_free_ida:
- fsi_free_minor(slave->dev.devt);
err_free:
of_node_put(slave->dev.of_node);
kfree(slave);
--
2.25.1
There is a nasty regression wrt mt7921e in the last LTS series (6.12).
If your computer crashes or fails to get out of hibernation, then at the
next boot the mt7921e wifi does not work, with dmesg reporting that it
is unable to change power state from d3cold to d0.
The issue is nasty, because rebooting won't help.
The only solution that I have found to the issue is booting a 6.6
kernel. With that the wife gets alive again. If, at this point, you boot
into 6.12, everything seems to be fine again, until a boot fails and
from that moment on you are without wifi.
Working around the issue is not very discoverable. On my machine not
even a hardware reset (40s of power off button pressed) helped alone,
without going through the 6.6 kernel boot.
I think the regression was introduced with 6.12 and I remember having no
issues with previous kernels, but cannot be 100% sure.
Similarly, I do not know if the bug is with the wifi card driver itself
or with something related to PM or PCIe.
The machine on which I am experiencing the issue is an Asus ROG 14
laptop (2022 edition), with AMD CPU and GPU.
Thanks for the attention,
Sergio
Dear stable
To whom it may Concern.
We PRIVATE COMPANY SGG ENERGY TECHNOLOGY LTD, With BIN Code:
211140900115 and Office address located at Astana, Esil District,
Syganak Street, Building 54a, N.P. 142A. With authorized legalized
mandate certificate from trusted and reliable supplier refinery of
petroleum commodities. Attached kindly find our official offer for
your review and if any of our give procedure is acceptable to any
ready buyer’s management, do kindly proceed to issue an official ICPO
to commence transaction.
And further your management is in need of assistance and further
clarification do not hesitate to contact our management.
Thank You.
---
PRIVATE COMPANY SGG ENERGY TECHNOLOGY LTD.
Address: Astana, Esil District, Syganak Street, Building 54a, N.P.
142A
BIN: 211140900115
Email:kazpetro@sggenergytech.com
Tel/WhatsApp: +7 775 540 3537
Director: Dou Shengjun
The function dst_ci_command() calls the function dst_comm_init()
but does not handle the error if the init fails. A proper implementation
can be found in dst_command() in /source/drivers/media/pci/bt8xx/dst.c.
Add error handling to the dst_comm_init(). Print an error message via
dprintk(), and jump to the 'error' label if the function fails.
Fixes: 50b215a05878 ("[PATCH] dvb: DST: reorganize Twinhan DST driver to support CI")
Cc: stable(a)vger.kernel.org # v2.6+
Signed-off-by: Wentao Liang <vulab(a)iscas.ac.cn>
---
drivers/media/pci/bt8xx/dst_ca.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/media/pci/bt8xx/dst_ca.c b/drivers/media/pci/bt8xx/dst_ca.c
index a9cc6e7a57f9..a743f7653fdd 100644
--- a/drivers/media/pci/bt8xx/dst_ca.c
+++ b/drivers/media/pci/bt8xx/dst_ca.c
@@ -66,7 +66,10 @@ static int dst_ci_command(struct dst_state* state, u8 * data, u8 *ca_string, u8
u8 reply;
mutex_lock(&state->dst_mutex);
- dst_comm_init(state);
+ if (dst_comm_init(state) < 0) {
+ dprintk(verbose, DST_CA_ERROR, 1, "DST initialization failed.");
+ goto error;
+ }
msleep(65);
if (write_dst(state, data, len)) {
--
2.42.0.windows.2
This series adds fine grained trap control in EL2 required for FEAT_PMUv3p9
registers like PMICNTR_EL0, PMICFILTR_EL0, and PMUACR_EL1 which are already
being used in the kernel. This is required to prevent their EL1 access trap
into EL2.
The following commits that enabled access into FEAT_PMUv3p9 registers have
already been merged upstream from 6.12 onwards.
d8226d8cfbaf ("perf: arm_pmuv3: Add support for Armv9.4 PMU instruction counter")
0bbff9ed8165 ("perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control")
The sysreg patches in this series are required for the final patch which
fixes the actual problem.
Anshuman Khandual (7):
arm64/sysreg: Update register fields for ID_AA64MMFR0_EL1
arm64/sysreg: Add register fields for HDFGRTR2_EL2
arm64/sysreg: Add register fields for HDFGWTR2_EL2
arm64/sysreg: Add register fields for HFGITR2_EL2
arm64/sysreg: Add register fields for HFGRTR2_EL2
arm64/sysreg: Add register fields for HFGWTR2_EL2
arm64/boot: Enable EL2 requirements for FEAT_PMUv3p9
Rob Herring (Arm) (1):
perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control
Documentation/arch/arm64/booting.rst | 22 ++++++
arch/arm/include/asm/arm_pmuv3.h | 6 ++
arch/arm64/include/asm/arm_pmuv3.h | 10 +++
arch/arm64/include/asm/el2_setup.h | 25 ++++++
arch/arm64/tools/sysreg | 111 +++++++++++++++++++++++++++
drivers/perf/arm_pmuv3.c | 29 ++++---
include/linux/perf/arm_pmuv3.h | 1 +
7 files changed, 194 insertions(+), 10 deletions(-)
--
2.30.2
The gpiod_direction_output_raw_commit() function checks if any setter
callback is present before doing anything. As the new GPIO setters with
return values were introduced, make this check also succeed if one is
present.
Fixes: 98ce1eb1fd87 ("gpiolib: introduce gpio_chip setters that return values")
Signed-off-by: Mathieu Dubois-Briand <mathieu.dubois-briand(a)bootlin.com>
---
Changes in v2:
- Add Fixes: tag and Cc: to stable
- Link to v1: https://lore.kernel.org/r/20250411-mdb-gpiolib-setters-fix-v1-1-dea302ab744…
---
drivers/gpio/gpiolib.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
index b8197502a5ac..cd4fecbb41f2 100644
--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -2879,7 +2879,7 @@ static int gpiod_direction_output_raw_commit(struct gpio_desc *desc, int value)
* output-only, but if there is then not even a .set() operation it
* is pretty tricky to drive the output line.
*/
- if (!guard.gc->set && !guard.gc->direction_output) {
+ if (!guard.gc->set && !guard.gc->set_rv && !guard.gc->direction_output) {
gpiod_warn(desc,
"%s: missing set() and direction_output() operations\n",
__func__);
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250411-mdb-gpiolib-setters-fix-b87976992070
Best regards,
--
Mathieu Dubois-Briand <mathieu.dubois-briand(a)bootlin.com>
From: Abhinav Kumar <quic_abhinavk(a)quicinc.com>
[ Upstream commit 7e182cb4f5567f53417b762ec0d679f0b6f0039d ]
In certain use-cases, a CRTC could switch between two encoders
and because the mode being programmed on the CRTC remains
the same during this switch, the CRTC's mode_changed remains false.
In such cases, the encoder's mode_set also gets skipped.
Skipping mode_set on the encoder for such cases could cause an issue
because even though the same CRTC mode was being used, the encoder
type could have changed like the CRTC could have switched from a
real time encoder to a writeback encoder OR vice-versa.
Allow encoder's mode_set to happen even when connectors changed on a
CRTC and not just when the mode changed.
Signed-off-by: Abhinav Kumar <quic_abhinavk(a)quicinc.com>
Signed-off-by: Jessica Zhang <quic_jesszhan(a)quicinc.com>
Reviewed-by: Maxime Ripard <mripard(a)kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20241211-abhinavk-modeset-fix…
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov(a)linaro.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/gpu/drm/drm_atomic_helper.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index d91d6c063a1d2..70d97a7fc6864 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -1225,7 +1225,7 @@ crtc_set_mode(struct drm_device *dev, struct drm_atomic_state *old_state)
mode = &new_crtc_state->mode;
adjusted_mode = &new_crtc_state->adjusted_mode;
- if (!new_crtc_state->mode_changed)
+ if (!new_crtc_state->mode_changed && !new_crtc_state->connectors_changed)
continue;
DRM_DEBUG_ATOMIC("modeset on [ENCODER:%d:%s]\n",
--
2.39.5
Previously, commit ed129ec9057f ("KVM: x86: forcibly leave nested mode
on vCPU reset") addressed an issue where a triple fault occurring in
nested mode could lead to use-after-free scenarios. However, the commit
did not handle the analogous situation for System Management Mode (SMM).
This omission results in triggering a WARN when a vCPU reset occurs
while still in SMM mode, due to the check in kvm_vcpu_reset(). This
situation was reprodused using Syzkaller by:
1) Creating a KVM VM and vCPU
2) Sending a KVM_SMI ioctl to explicitly enter SMM
3) Executing invalid instructions causing consecutive exceptions and
eventually a triple fault
The issue manifests as follows:
WARNING: CPU: 0 PID: 25506 at arch/x86/kvm/x86.c:12112
kvm_vcpu_reset+0x1d2/0x1530 arch/x86/kvm/x86.c:12112
Modules linked in:
CPU: 0 PID: 25506 Comm: syz-executor.0 Not tainted
6.1.130-syzkaller-00157-g164fe5dde9b6 #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-1 04/01/2014
RIP: 0010:kvm_vcpu_reset+0x1d2/0x1530 arch/x86/kvm/x86.c:12112
Call Trace:
<TASK>
shutdown_interception+0x66/0xb0 arch/x86/kvm/svm/svm.c:2136
svm_invoke_exit_handler+0x110/0x530 arch/x86/kvm/svm/svm.c:3395
svm_handle_exit+0x424/0x920 arch/x86/kvm/svm/svm.c:3457
vcpu_enter_guest arch/x86/kvm/x86.c:10959 [inline]
vcpu_run+0x2c43/0x5a90 arch/x86/kvm/x86.c:11062
kvm_arch_vcpu_ioctl_run+0x50f/0x1cf0 arch/x86/kvm/x86.c:11283
kvm_vcpu_ioctl+0x570/0xf00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4122
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x19a/0x210 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Architecturally, hardware CPUs exit SMM upon receiving a triple
fault as part of a hardware reset. To reflect this behavior and
avoid triggering the WARN, this patch explicitly calls
kvm_smm_changed(vcpu, false) in the SVM-specific shutdown_interception()
handler prior to resetting the vCPU.
The initial version of this patch attempted to address the issue by calling
kvm_smm_changed() inside kvm_vcpu_reset(). However, this approach was not
architecturally accurate, as INIT is blocked during SMM and SMM should not
be exited implicitly during a generic vCPU reset. This version moves the
fix into shutdown_interception() for SVM, where the triple fault is
actually handled, and where exiting SMM explicitly is appropriate.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: ed129ec9057f ("KVM: x86: forcibly leave nested mode on vCPU reset")
Cc: stable(a)vger.kernel.org
Suggested-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: Mikhail Lobanov <m.lobanov(a)rosa.ru>
---
v2: Move SMM exit from kvm_vcpu_reset() to SVM's shutdown_interception(),
per suggestion from Sean Christopherson <seanjc(a)google.com>.
arch/x86/kvm/svm/svm.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d5d0c5c3300b..34a002a87c28 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2231,6 +2231,8 @@ static int shutdown_interception(struct kvm_vcpu *vcpu)
*/
if (!sev_es_guest(vcpu->kvm)) {
clear_page(svm->vmcb);
+ if (is_smm(vcpu))
+ kvm_smm_changed(vcpu, false);
kvm_vcpu_reset(vcpu, true);
}
--
2.47.2
From: Jeff Layton <jlayton(a)kernel.org>
[ Upstream commit 930b64ca0c511521f0abdd1d57ce52b2a6e3476b ]
Currently, nfsd_proc_stat_init() ignores the return value of
svc_proc_register(). If the procfile creation fails, then the kernel
will WARN when it tries to remove the entry later.
Fix nfsd_proc_stat_init() to return the same type of pointer as
svc_proc_register(), and fix up nfsd_net_init() to check that and fail
the nfsd_net construction if it occurs.
svc_proc_register() can fail if the dentry can't be allocated, or if an
identical dentry already exists. The second case is pretty unlikely in
the nfsd_net construction codepath, so if this happens, return -ENOMEM.
Reported-by: syzbot+e34ad04f27991521104c(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-nfs/67a47501.050a0220.19061f.05f9.GAE@google.…
Cc: stable(a)vger.kernel.org # v6.9
Signed-off-by: Jeff Layton <jlayton(a)kernel.org>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
---
fs/nfsd/nfsctl.c | 9 ++++++++-
fs/nfsd/stats.c | 4 ++--
fs/nfsd/stats.h | 2 +-
3 files changed, 11 insertions(+), 4 deletions(-)
I did not have any problem cherry-picking 930b64 onto v6.12.23. This
built and ran some simple NFSD tests in my lab.
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index e83629f39604..2e835e7c107e 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -2244,8 +2244,14 @@ static __net_init int nfsd_net_init(struct net *net)
NFSD_STATS_COUNTERS_NUM);
if (retval)
goto out_repcache_error;
+
memset(&nn->nfsd_svcstats, 0, sizeof(nn->nfsd_svcstats));
nn->nfsd_svcstats.program = &nfsd_programs[0];
+ if (!nfsd_proc_stat_init(net)) {
+ retval = -ENOMEM;
+ goto out_proc_error;
+ }
+
for (i = 0; i < sizeof(nn->nfsd_versions); i++)
nn->nfsd_versions[i] = nfsd_support_version(i);
for (i = 0; i < sizeof(nn->nfsd4_minorversions); i++)
@@ -2255,12 +2261,13 @@ static __net_init int nfsd_net_init(struct net *net)
nfsd4_init_leases_net(nn);
get_random_bytes(&nn->siphash_key, sizeof(nn->siphash_key));
seqlock_init(&nn->writeverf_lock);
- nfsd_proc_stat_init(net);
#if IS_ENABLED(CONFIG_NFS_LOCALIO)
INIT_LIST_HEAD(&nn->local_clients);
#endif
return 0;
+out_proc_error:
+ percpu_counter_destroy_many(nn->counter, NFSD_STATS_COUNTERS_NUM);
out_repcache_error:
nfsd_idmap_shutdown(net);
out_idmap_error:
diff --git a/fs/nfsd/stats.c b/fs/nfsd/stats.c
index bb22893f1157..f7eaf95e20fc 100644
--- a/fs/nfsd/stats.c
+++ b/fs/nfsd/stats.c
@@ -73,11 +73,11 @@ static int nfsd_show(struct seq_file *seq, void *v)
DEFINE_PROC_SHOW_ATTRIBUTE(nfsd);
-void nfsd_proc_stat_init(struct net *net)
+struct proc_dir_entry *nfsd_proc_stat_init(struct net *net)
{
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
- svc_proc_register(net, &nn->nfsd_svcstats, &nfsd_proc_ops);
+ return svc_proc_register(net, &nn->nfsd_svcstats, &nfsd_proc_ops);
}
void nfsd_proc_stat_shutdown(struct net *net)
diff --git a/fs/nfsd/stats.h b/fs/nfsd/stats.h
index 04aacb6c36e2..e4efb0e4e56d 100644
--- a/fs/nfsd/stats.h
+++ b/fs/nfsd/stats.h
@@ -10,7 +10,7 @@
#include <uapi/linux/nfsd/stats.h>
#include <linux/percpu_counter.h>
-void nfsd_proc_stat_init(struct net *net);
+struct proc_dir_entry *nfsd_proc_stat_init(struct net *net);
void nfsd_proc_stat_shutdown(struct net *net);
static inline void nfsd_stats_rc_hits_inc(struct nfsd_net *nn)
--
2.47.0
From: Eder Zulian <ezulian(a)redhat.com>
commit 7f4ec77f3fee41dd6a41f03a40703889e6e8f7b2 upstream.
Initialize 'new_off' and 'pad_bits' to 0 and 'pad_type' to NULL in
btf_dump_emit_bit_padding to prevent compiler warnings/errors which are
observed when compiling with 'EXTRA_CFLAGS=-g -Og' options, but do not
happen when compiling with current default options.
For example, when compiling libbpf with
$ make "EXTRA_CFLAGS=-g -Og" -C tools/lib/bpf/ clean all
Clang version 17.0.6 and GCC 13.3.1 fail to compile btf_dump.c due to
following errors:
btf_dump.c: In function ‘btf_dump_emit_bit_padding’:
btf_dump.c:903:42: error: ‘new_off’ may be used uninitialized [-Werror=maybe-uninitialized]
903 | if (new_off > cur_off && new_off <= next_off) {
| ~~~~~~~~^~~~~~~~~~~
btf_dump.c:870:13: note: ‘new_off’ was declared here
870 | int new_off, pad_bits, bits, i;
| ^~~~~~~
btf_dump.c:917:25: error: ‘pad_type’ may be used uninitialized [-Werror=maybe-uninitialized]
917 | btf_dump_printf(d, "\n%s%s: %d;", pfx(lvl), pad_type,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
918 | in_bitfield ? new_off - cur_off : 0);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
btf_dump.c:871:21: note: ‘pad_type’ was declared here
871 | const char *pad_type;
| ^~~~~~~~
btf_dump.c:930:20: error: ‘pad_bits’ may be used uninitialized [-Werror=maybe-uninitialized]
930 | if (bits == pad_bits) {
| ^
btf_dump.c:870:22: note: ‘pad_bits’ was declared here
870 | int new_off, pad_bits, bits, i;
| ^~~~~~~~
cc1: all warnings being treated as errors
Signed-off-by: Eder Zulian <ezulian(a)redhat.com>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Acked-by: Jiri Olsa <jolsa(a)kernel.org>
Link: https://lore.kernel.org/bpf/20241022172329.3871958-3-ezulian@redhat.com
Signed-off-by: He Zhe <zhe.he(a)windriver.com>
Signed-off-by: Xiangyu Chen <xiangyu.chen(a)windriver.com>
---
Build test passed.
---
tools/lib/bpf/btf_dump.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/lib/bpf/btf_dump.c b/tools/lib/bpf/btf_dump.c
index 0a7327541c17..46cce18c8308 100644
--- a/tools/lib/bpf/btf_dump.c
+++ b/tools/lib/bpf/btf_dump.c
@@ -867,8 +867,8 @@ static void btf_dump_emit_bit_padding(const struct btf_dump *d,
} pads[] = {
{"long", d->ptr_sz * 8}, {"int", 32}, {"short", 16}, {"char", 8}
};
- int new_off, pad_bits, bits, i;
- const char *pad_type;
+ int new_off = 0, pad_bits = 0, bits, i;
+ const char *pad_type = NULL;
if (cur_off >= next_off)
return; /* no gap */
--
2.34.1
From: Xiaxi Shen <shenxiaxi26(a)gmail.com>
commit 0ce160c5bdb67081a62293028dc85758a8efb22a upstream.
Syzbot has found an ODEBUG bug in ext4_fill_super
The del_timer_sync function cancels the s_err_report timer,
which reminds about filesystem errors daily. We should
guarantee the timer is no longer active before kfree(sbi).
When filesystem mounting fails, the flow goes to failed_mount3,
where an error occurs when ext4_stop_mmpd is called, causing
a read I/O failure. This triggers the ext4_handle_error function
that ultimately re-arms the timer,
leaving the s_err_report timer active before kfree(sbi) is called.
Fix the issue by canceling the s_err_report timer after calling ext4_stop_mmpd.
Signed-off-by: Xiaxi Shen <shenxiaxi26(a)gmail.com>
Reported-and-tested-by: syzbot+59e0101c430934bc9a36(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=59e0101c430934bc9a36
Link: https://patch.msgid.link/20240715043336.98097-1-shenxiaxi26@gmail.com
Signed-off-by: Theodore Ts'o <tytso(a)mit.edu>
Cc: stable(a)kernel.org
[Minor context change fixed]
Signed-off-by: Xiangyu Chen <xiangyu.chen(a)windriver.com>
Signed-off-by: He Zhe <zhe.he(a)windriver.com>
---
Verified the build test.
---
fs/ext4/super.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 126b582d85fc..8f33b6db8aae 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5085,8 +5085,8 @@ failed_mount9: __maybe_unused
failed_mount3:
/* flush s_error_work before sbi destroy */
flush_work(&sbi->s_error_work);
- del_timer_sync(&sbi->s_err_report);
ext4_stop_mmpd(sbi);
+ del_timer_sync(&sbi->s_err_report);
failed_mount2:
rcu_read_lock();
group_desc = rcu_dereference(sbi->s_group_desc);
--
2.34.1
From: Andrii Nakryiko <andrii(a)kernel.org>
[ Upstream commit bc27c52eea189e8f7492d40739b7746d67b65beb ]
We use map->freeze_mutex to prevent races between map_freeze() and
memory mapping BPF map contents with writable permissions. The way we
naively do this means we'll hold freeze_mutex for entire duration of all
the mm and VMA manipulations, which is completely unnecessary. This can
potentially also lead to deadlocks, as reported by syzbot in [0].
So, instead, hold freeze_mutex only during writeability checks, bump
(proactively) "write active" count for the map, unlock the mutex and
proceed with mmap logic. And only if something went wrong during mmap
logic, then undo that "write active" counter increment.
[0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@google.com/
Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Reported-by: syzbot+4dc041c686b7c816a71e(a)syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Link: https://lore.kernel.org/r/20250129012246.1515826-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
Signed-off-by: David Sauerwein <dssauerw(a)amazon.de>
---
kernel/bpf/syscall.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7fce461eae10..6f309248f13f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -654,7 +654,7 @@ static const struct vm_operations_struct bpf_map_default_vmops = {
static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
{
struct bpf_map *map = filp->private_data;
- int err;
+ int err = 0;
if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
map_value_has_timer(map))
@@ -679,7 +679,12 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
err = -EACCES;
goto out;
}
+ bpf_map_write_active_inc(map);
}
+out:
+ mutex_unlock(&map->freeze_mutex);
+ if (err)
+ return err;
/* set default open/close callbacks */
vma->vm_ops = &bpf_map_default_vmops;
@@ -690,13 +695,11 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
vma->vm_flags &= ~VM_MAYWRITE;
err = map->ops->map_mmap(map, vma);
- if (err)
- goto out;
+ if (err) {
+ if (vma->vm_flags & VM_WRITE)
+ bpf_map_write_active_dec(map);
+ }
- if (vma->vm_flags & VM_MAYWRITE)
- bpf_map_write_active_inc(map);
-out:
- mutex_unlock(&map->freeze_mutex);
return err;
}
--
2.47.1
For stable-6.6.
("x86/xen: fix memblock_reserve() usage on PVH") is the fix, but it
depends on ("x86/xen: move xen_reserve_extra_memory()") as a
prerequisite. Both need fixups as they predate the removal of the Xen
hypercall_page.
Roger Pau Monne (2):
x86/xen: move xen_reserve_extra_memory()
x86/xen: fix memblock_reserve() usage on PVH
arch/x86/include/asm/xen/hypervisor.h | 5 --
arch/x86/platform/pvh/enlighten.c | 3 -
arch/x86/xen/enlighten_pvh.c | 93 +++++++++++++++------------
3 files changed, 51 insertions(+), 50 deletions(-)
--
2.49.0
From: Xu Kuohai <xukuohai(a)huawei.com>
[ Upstream commit 28ead3eaabc16ecc907cfb71876da028080f6356 ]
bpf progs can be attached to kernel functions, and the attached functions
can take different parameters or return different return values. If
prog attached to one kernel function tail calls prog attached to another
kernel function, the ctx access or return value verification could be
bypassed.
For example, if prog1 is attached to func1 which takes only 1 parameter
and prog2 is attached to func2 which takes two parameters. Since verifier
assumes the bpf ctx passed to prog2 is constructed based on func2's
prototype, verifier allows prog2 to access the second parameter from
the bpf ctx passed to it. The problem is that verifier does not prevent
prog1 from passing its bpf ctx to prog2 via tail call. In this case,
the bpf ctx passed to prog2 is constructed from func1 instead of func2,
that is, the assumption for ctx access verification is bypassed.
Another example, if BPF LSM prog1 is attached to hook file_alloc_security,
and BPF LSM prog2 is attached to hook bpf_lsm_audit_rule_known. Verifier
knows the return value rules for these two hooks, e.g. it is legal for
bpf_lsm_audit_rule_known to return positive number 1, and it is illegal
for file_alloc_security to return positive number. So verifier allows
prog2 to return positive number 1, but does not allow prog1 to return
positive number. The problem is that verifier does not prevent prog1
from calling prog2 via tail call. In this case, prog2's return value 1
will be used as the return value for prog1's hook file_alloc_security.
That is, the return value rule is bypassed.
This patch adds restriction for tail call to prevent such bypasses.
Signed-off-by: Xu Kuohai <xukuohai(a)huawei.com>
Link: https://lore.kernel.org/r/20240719110059.797546-4-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
[Minor conflict resolved due to code context change.]
Signed-off-by: Jianqi Ren <jianqi.ren.cn(a)windriver.com>
Signed-off-by: He Zhe <zhe.he(a)windriver.com>
---
Verified the build test
---
include/linux/bpf.h | 1 +
kernel/bpf/core.c | 19 +++++++++++++++++--
2 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2189c0d18fa7..e9c1338851e3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -250,6 +250,7 @@ struct bpf_map {
* same prog type, JITed flag and xdp_has_frags flag.
*/
struct {
+ const struct btf_type *attach_func_proto;
spinlock_t lock;
enum bpf_prog_type type;
bool jited;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 83b416af4da1..c281f5b8705e 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2121,6 +2121,7 @@ bool bpf_prog_map_compatible(struct bpf_map *map,
{
enum bpf_prog_type prog_type = resolve_prog_type(fp);
bool ret;
+ struct bpf_prog_aux *aux = fp->aux;
if (fp->kprobe_override)
return false;
@@ -2132,12 +2133,26 @@ bool bpf_prog_map_compatible(struct bpf_map *map,
*/
map->owner.type = prog_type;
map->owner.jited = fp->jited;
- map->owner.xdp_has_frags = fp->aux->xdp_has_frags;
+ map->owner.xdp_has_frags = aux->xdp_has_frags;
+ map->owner.attach_func_proto = aux->attach_func_proto;
ret = true;
} else {
ret = map->owner.type == prog_type &&
map->owner.jited == fp->jited &&
- map->owner.xdp_has_frags == fp->aux->xdp_has_frags;
+ map->owner.xdp_has_frags == aux->xdp_has_frags;
+ if (ret &&
+ map->owner.attach_func_proto != aux->attach_func_proto) {
+ switch (prog_type) {
+ case BPF_PROG_TYPE_TRACING:
+ case BPF_PROG_TYPE_LSM:
+ case BPF_PROG_TYPE_EXT:
+ case BPF_PROG_TYPE_STRUCT_OPS:
+ ret = false;
+ break;
+ default:
+ break;
+ }
+ }
}
spin_unlock(&map->owner.lock);
--
2.34.1
From: Andrii Nakryiko <andrii(a)kernel.org>
[ Upstream commit bc27c52eea189e8f7492d40739b7746d67b65beb ]
We use map->freeze_mutex to prevent races between map_freeze() and
memory mapping BPF map contents with writable permissions. The way we
naively do this means we'll hold freeze_mutex for entire duration of all
the mm and VMA manipulations, which is completely unnecessary. This can
potentially also lead to deadlocks, as reported by syzbot in [0].
So, instead, hold freeze_mutex only during writeability checks, bump
(proactively) "write active" count for the map, unlock the mutex and
proceed with mmap logic. And only if something went wrong during mmap
logic, then undo that "write active" counter increment.
[0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@google.com/
Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Reported-by: syzbot+4dc041c686b7c816a71e(a)syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Link: https://lore.kernel.org/r/20250129012246.1515826-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
Signed-off-by: David Sauerwein <dssauerw(a)amazon.de>
---
kernel/bpf/syscall.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7a4004f09bae..27fdf1b2fc46 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -813,7 +813,7 @@ static const struct vm_operations_struct bpf_map_default_vmops = {
static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
{
struct bpf_map *map = filp->private_data;
- int err;
+ int err = 0;
if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
map_value_has_timer(map) || map_value_has_kptrs(map))
@@ -838,7 +838,12 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
err = -EACCES;
goto out;
}
+ bpf_map_write_active_inc(map);
}
+out:
+ mutex_unlock(&map->freeze_mutex);
+ if (err)
+ return err;
/* set default open/close callbacks */
vma->vm_ops = &bpf_map_default_vmops;
@@ -849,13 +854,11 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
vma->vm_flags &= ~VM_MAYWRITE;
err = map->ops->map_mmap(map, vma);
- if (err)
- goto out;
+ if (err) {
+ if (vma->vm_flags & VM_WRITE)
+ bpf_map_write_active_dec(map);
+ }
- if (vma->vm_flags & VM_MAYWRITE)
- bpf_map_write_active_inc(map);
-out:
- mutex_unlock(&map->freeze_mutex);
return err;
}
--
2.47.1
From: Jeff Layton <jlayton(a)kernel.org>
[ Upstream commit 930b64ca0c511521f0abdd1d57ce52b2a6e3476b ]
Currently, nfsd_proc_stat_init() ignores the return value of
svc_proc_register(). If the procfile creation fails, then the kernel
will WARN when it tries to remove the entry later.
Fix nfsd_proc_stat_init() to return the same type of pointer as
svc_proc_register(), and fix up nfsd_net_init() to check that and fail
the nfsd_net construction if it occurs.
svc_proc_register() can fail if the dentry can't be allocated, or if an
identical dentry already exists. The second case is pretty unlikely in
the nfsd_net construction codepath, so if this happens, return -ENOMEM.
Reported-by: syzbot+e34ad04f27991521104c(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-nfs/67a47501.050a0220.19061f.05f9.GAE@google.…
Cc: stable(a)vger.kernel.org # v6.9
Signed-off-by: Jeff Layton <jlayton(a)kernel.org>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
---
fs/nfsd/nfsctl.c | 9 ++++++++-
fs/nfsd/stats.c | 4 ++--
fs/nfsd/stats.h | 2 +-
3 files changed, 11 insertions(+), 4 deletions(-)
I did not have any problem cherry-picking 930b64 onto v6.13.11. This
built and ran some simple NFSD tests in my lab.
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index e83629f39604..2e835e7c107e 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -2244,8 +2244,14 @@ static __net_init int nfsd_net_init(struct net *net)
NFSD_STATS_COUNTERS_NUM);
if (retval)
goto out_repcache_error;
+
memset(&nn->nfsd_svcstats, 0, sizeof(nn->nfsd_svcstats));
nn->nfsd_svcstats.program = &nfsd_programs[0];
+ if (!nfsd_proc_stat_init(net)) {
+ retval = -ENOMEM;
+ goto out_proc_error;
+ }
+
for (i = 0; i < sizeof(nn->nfsd_versions); i++)
nn->nfsd_versions[i] = nfsd_support_version(i);
for (i = 0; i < sizeof(nn->nfsd4_minorversions); i++)
@@ -2255,12 +2261,13 @@ static __net_init int nfsd_net_init(struct net *net)
nfsd4_init_leases_net(nn);
get_random_bytes(&nn->siphash_key, sizeof(nn->siphash_key));
seqlock_init(&nn->writeverf_lock);
- nfsd_proc_stat_init(net);
#if IS_ENABLED(CONFIG_NFS_LOCALIO)
INIT_LIST_HEAD(&nn->local_clients);
#endif
return 0;
+out_proc_error:
+ percpu_counter_destroy_many(nn->counter, NFSD_STATS_COUNTERS_NUM);
out_repcache_error:
nfsd_idmap_shutdown(net);
out_idmap_error:
diff --git a/fs/nfsd/stats.c b/fs/nfsd/stats.c
index bb22893f1157..f7eaf95e20fc 100644
--- a/fs/nfsd/stats.c
+++ b/fs/nfsd/stats.c
@@ -73,11 +73,11 @@ static int nfsd_show(struct seq_file *seq, void *v)
DEFINE_PROC_SHOW_ATTRIBUTE(nfsd);
-void nfsd_proc_stat_init(struct net *net)
+struct proc_dir_entry *nfsd_proc_stat_init(struct net *net)
{
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
- svc_proc_register(net, &nn->nfsd_svcstats, &nfsd_proc_ops);
+ return svc_proc_register(net, &nn->nfsd_svcstats, &nfsd_proc_ops);
}
void nfsd_proc_stat_shutdown(struct net *net)
diff --git a/fs/nfsd/stats.h b/fs/nfsd/stats.h
index 04aacb6c36e2..e4efb0e4e56d 100644
--- a/fs/nfsd/stats.h
+++ b/fs/nfsd/stats.h
@@ -10,7 +10,7 @@
#include <uapi/linux/nfsd/stats.h>
#include <linux/percpu_counter.h>
-void nfsd_proc_stat_init(struct net *net);
+struct proc_dir_entry *nfsd_proc_stat_init(struct net *net);
void nfsd_proc_stat_shutdown(struct net *net);
static inline void nfsd_stats_rc_hits_inc(struct nfsd_net *nn)
--
2.47.0
From: Andrii Nakryiko <andrii(a)kernel.org>
[ Upstream commit bc27c52eea189e8f7492d40739b7746d67b65beb ]
We use map->freeze_mutex to prevent races between map_freeze() and
memory mapping BPF map contents with writable permissions. The way we
naively do this means we'll hold freeze_mutex for entire duration of all
the mm and VMA manipulations, which is completely unnecessary. This can
potentially also lead to deadlocks, as reported by syzbot in [0].
So, instead, hold freeze_mutex only during writeability checks, bump
(proactively) "write active" count for the map, unlock the mutex and
proceed with mmap logic. And only if something went wrong during mmap
logic, then undo that "write active" counter increment.
[0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@google.com/
Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Reported-by: syzbot+4dc041c686b7c816a71e(a)syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Link: https://lore.kernel.org/r/20250129012246.1515826-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
Signed-off-by: David Sauerwein <dssauerw(a)amazon.de>
---
kernel/bpf/syscall.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 008bb4e5c4dd..dc3d9a111cf1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -647,7 +647,7 @@ static const struct vm_operations_struct bpf_map_default_vmops = {
static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
{
struct bpf_map *map = filp->private_data;
- int err;
+ int err = 0;
if (!map->ops->map_mmap || map_value_has_spin_lock(map))
return -ENOTSUPP;
@@ -671,7 +671,12 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
err = -EACCES;
goto out;
}
+ bpf_map_write_active_inc(map);
}
+out:
+ mutex_unlock(&map->freeze_mutex);
+ if (err)
+ return err;
/* set default open/close callbacks */
vma->vm_ops = &bpf_map_default_vmops;
@@ -682,13 +687,11 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
vma->vm_flags &= ~VM_MAYWRITE;
err = map->ops->map_mmap(map, vma);
- if (err)
- goto out;
+ if (err) {
+ if (vma->vm_flags & VM_WRITE)
+ bpf_map_write_active_dec(map);
+ }
- if (vma->vm_flags & VM_MAYWRITE)
- bpf_map_write_active_inc(map);
-out:
- mutex_unlock(&map->freeze_mutex);
return err;
}
--
2.47.1
From: Christoph Hellwig <hch(a)lst.de>
commit 3eb96946f0be6bf447cbdf219aba22bc42672f92 upstream.
This patch is a backport.
Since the dawn of time bio_check_eod has a check for a non-zero size of
the device. This doesn't really make any sense as we never want to send
I/O to a device that's been set to zero size, or never moved out of that.
I am a bit surprised we haven't caught this for a long time, but the
removal of the extra validation inside of zram caused syzbot to trip
over this issue recently. I've added a Fixes tag for that commit, but
the issue really goes back way before git history.
Fixes: 9fe95babc742 ("zram: remove valid_io_request")
Reported-by: syzbot+2aca91e1d3ae43aef10c(a)syzkaller.appspotmail.com
Bug: https://syzkaller.appspot.com/bug?extid=2aca91e1d3ae43aef10c
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
Link: https://lore.kernel.org/r/20230524060538.1593686-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
(cherry picked from commit 3eb96946f0be6bf447cbdf219aba22bc42672f92)
Signed-off-by: Miguel García <miguelgarciaroman8(a)gmail.com>
---
block/blk-core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 94941e3ce219..6a66f4f6912f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -515,7 +515,7 @@ static inline int bio_check_eod(struct bio *bio)
sector_t maxsector = bdev_nr_sectors(bio->bi_bdev);
unsigned int nr_sectors = bio_sectors(bio);
- if (nr_sectors && maxsector &&
+ if (nr_sectors &&
(nr_sectors > maxsector ||
bio->bi_iter.bi_sector > maxsector - nr_sectors)) {
pr_info_ratelimited("%s: attempt to access beyond end of device\n"
--
2.34.1
We are writing to inform you that you have been identified as the sole
beneficiary of a substantial inheritance left by a deceased relative, who
was a client of our firm. The estate is valued at USD$6,500,000.00, and we
are handling the legal proceedings to transfer the funds to you. Please
reply for more details.
From: Emanuele Ghidoli <emanuele.ghidoli(a)toradex.com>
If an input changes state during wake-up and is used as an interrupt
source, the IRQ handler reads the volatile input register to clear the
interrupt mask and deassert the IRQ line. However, the IRQ handler is
triggered before access to the register is granted, causing the read
operation to fail.
As a result, the IRQ handler enters a loop, repeatedly printing the
"failed reading register" message, until `pca953x_resume` is eventually
called, which restores the driver context and enables access to
registers.
Fix by using DEFINE_NOIRQ_DEV_PM_OPS which ensures that `pca953x_resume`
is called before the IRQ handler is called.
Fixes: b76574300504 ("gpio: pca953x: Restore registers after suspend/resume cycle")
Cc: stable(a)vger.kernel.org
Signed-off-by: Emanuele Ghidoli <emanuele.ghidoli(a)toradex.com>
Signed-off-by: Francesco Dolcini <francesco.dolcini(a)toradex.com>
---
drivers/gpio/gpio-pca953x.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpio/gpio-pca953x.c b/drivers/gpio/gpio-pca953x.c
index d63c1030e6ac..d39bdc125cfc 100644
--- a/drivers/gpio/gpio-pca953x.c
+++ b/drivers/gpio/gpio-pca953x.c
@@ -1252,7 +1252,7 @@ static int pca953x_resume(struct device *dev)
return ret;
}
-static DEFINE_SIMPLE_DEV_PM_OPS(pca953x_pm_ops, pca953x_suspend, pca953x_resume);
+static DEFINE_NOIRQ_DEV_PM_OPS(pca953x_pm_ops, pca953x_suspend, pca953x_resume);
/* convenience to stop overlong match-table lines */
#define OF_653X(__nrgpio, __int) ((void *)(__nrgpio | PCAL653X_TYPE | __int))
--
2.39.5
From: Nam Cao <namcao(a)linutronix.de>
When rv_is_container_monitor() is called on the last monitor in
rv_monitors_list, KASAN yells:
BUG: KASAN: global-out-of-bounds in rv_is_container_monitor+0x101/0x110
Read of size 8 at addr ffffffff97c7c798 by task setup/221
The buggy address belongs to the variable:
rv_monitors_list+0x18/0x40
This is due to list_next_entry() is called on the last entry in the list.
It wraps around to the first list_head, and the first list_head is not
embedded in struct rv_monitor_def.
Fix it by checking if the monitor is last in the list.
Cc: stable(a)vger.kernel.org
Cc: Gabriele Monaco <gmonaco(a)redhat.com>
Fixes: cb85c660fcd4 ("rv: Add option for nested monitors and include sched")
Link: https://lore.kernel.org/e85b5eeb7228bfc23b8d7d4ab5411472c54ae91b.1744355018…
Signed-off-by: Nam Cao <namcao(a)linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/rv/rv.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
index 968c5c3b0246..e4077500a91d 100644
--- a/kernel/trace/rv/rv.c
+++ b/kernel/trace/rv/rv.c
@@ -225,7 +225,12 @@ bool rv_is_nested_monitor(struct rv_monitor_def *mdef)
*/
bool rv_is_container_monitor(struct rv_monitor_def *mdef)
{
- struct rv_monitor_def *next = list_next_entry(mdef, list);
+ struct rv_monitor_def *next;
+
+ if (list_is_last(&mdef->list, &rv_monitors_list))
+ return false;
+
+ next = list_next_entry(mdef, list);
return next->parent == mdef->monitor || !mdef->monitor->enable;
}
--
2.47.2
From: Steven Rostedt <rostedt(a)goodmis.org>
The following causes a vsnprintf fault:
# echo 's:wake_lat char[] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events
# echo 'hist:keys=pid:ts=common_timestamp.usecs if !(common_flags & 0x18)' > /sys/kernel/tracing/events/sched/sched_waking/trigger
# echo 'hist:keys=next_pid:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,next_comm,$delta)' > /sys/kernel/tracing/events/sched/sched_switch/trigger
Because the synthetic event's "wakee" field is created as a dynamic string
(even though the string copied is not). The print format to print the
dynamic string changed from "%*s" to "%s" because another location
(__set_synth_event_print_fmt()) exported this to user space, and user
space did not need that. But it is still used in print_synth_event(), and
the output looks like:
<idle>-0 [001] d..5. 193.428167: wake_lat: wakee=(efault)sshd-sessiondelta=155
sshd-session-879 [001] d..5. 193.811080: wake_lat: wakee=(efault)kworker/u34:5delta=58
<idle>-0 [002] d..5. 193.811198: wake_lat: wakee=(efault)bashdelta=91
bash-880 [002] d..5. 193.811371: wake_lat: wakee=(efault)kworker/u35:2delta=21
<idle>-0 [001] d..5. 193.811516: wake_lat: wakee=(efault)sshd-sessiondelta=129
sshd-session-879 [001] d..5. 193.967576: wake_lat: wakee=(efault)kworker/u34:5delta=50
The length isn't needed as the string is always nul terminated. Just print
the string and not add the length (which was hard coded to the max string
length anyway).
Cc: stable(a)vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Tom Zanussi <zanussi(a)kernel.org>
Cc: Douglas Raillard <douglas.raillard(a)arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Link: https://lore.kernel.org/20250407154139.69955768@gandalf.local.home
Fixes: 4d38328eb442d ("tracing: Fix synth event printk format for str fields");
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_events_synth.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
index 969f48742d72..33cfbd4ed76d 100644
--- a/kernel/trace/trace_events_synth.c
+++ b/kernel/trace/trace_events_synth.c
@@ -370,7 +370,6 @@ static enum print_line_t print_synth_event(struct trace_iterator *iter,
union trace_synth_field *data = &entry->fields[n_u64];
trace_seq_printf(s, print_fmt, se->fields[i]->name,
- STR_VAR_LEN_MAX,
(char *)entry + data->as_dynamic.offset,
i == se->n_fields - 1 ? "" : " ");
n_u64++;
--
2.47.2
From: Jarkko Sakkinen <jarkko.sakkinen(a)opinsys.com>
Add an isolated list of unreferenced keys to be queued for deletion, and
try to pin the keys in the garbage collector before processing anything.
Skip unpinnable keys.
Use this list for blocking the reaping process during the teardown:
1. First off, the keys added to `keys_graveyard` are snapshotted, and the
list is flushed. This the very last step in `key_put()`.
2. `key_put()` reaches zero. This will mark key as busy for the garbage
collector.
3. `key_garbage_collector()` will try to increase refcount, which won't go
above zero. Whenever this happens, the key will be skipped.
Cc: stable(a)vger.kernel.org # v6.1+
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen(a)opinsys.com>
---
v8:
- One more rebasing error (2x list_splice_init, reported by Marek Szyprowski)
v7:
- Fixed multiple definitions (from rebasing).
v6:
- Rebase went wrong in v5.
v5:
- Rebased on top of v6.15-rc
- Updated commit message to explain how spin lock and refcount
isolate the time window in key_put().
v4:
- Pin the key while processing key type teardown. Skip dead keys.
- Revert key_gc_graveyard back key_gc_unused_keys.
- Rewrote the commit message.
- "unsigned long flags" declaration somehow did make to the previous
patch (sorry).
v3:
- Using spin_lock() fails since key_put() is executed inside IRQs.
Using spin_lock_irqsave() would neither work given the lock is
acquired for /proc/keys. Therefore, separate the lock for
graveyard and key_graveyard before reaping key_serial_tree.
v2:
- Rename key_gc_unused_keys as key_gc_graveyard, and re-document the
function.
---
include/linux/key.h | 7 ++-----
security/keys/gc.c | 36 ++++++++++++++++++++----------------
security/keys/internal.h | 5 +++++
security/keys/key.c | 7 +++++--
4 files changed, 32 insertions(+), 23 deletions(-)
diff --git a/include/linux/key.h b/include/linux/key.h
index ba05de8579ec..c50659184bdf 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -195,10 +195,8 @@ enum key_state {
struct key {
refcount_t usage; /* number of references */
key_serial_t serial; /* key serial number */
- union {
- struct list_head graveyard_link;
- struct rb_node serial_node;
- };
+ struct list_head graveyard_link; /* key->usage == 0 */
+ struct rb_node serial_node;
#ifdef CONFIG_KEY_NOTIFICATIONS
struct watch_list *watchers; /* Entities watching this key for changes */
#endif
@@ -236,7 +234,6 @@ struct key {
#define KEY_FLAG_ROOT_CAN_INVAL 7 /* set if key can be invalidated by root without permission */
#define KEY_FLAG_KEEP 8 /* set if key should not be removed */
#define KEY_FLAG_UID_KEYRING 9 /* set if key is a user or user session keyring */
-#define KEY_FLAG_FINAL_PUT 10 /* set if final put has happened on key */
/* the key type and key description string
* - the desc is used to match a key against search criteria
diff --git a/security/keys/gc.c b/security/keys/gc.c
index f27223ea4578..9ccd8ee6fcdb 100644
--- a/security/keys/gc.c
+++ b/security/keys/gc.c
@@ -189,6 +189,7 @@ static void key_garbage_collector(struct work_struct *work)
struct rb_node *cursor;
struct key *key;
time64_t new_timer, limit, expiry;
+ unsigned long flags;
kenter("[%lx,%x]", key_gc_flags, gc_state);
@@ -206,21 +207,35 @@ static void key_garbage_collector(struct work_struct *work)
new_timer = TIME64_MAX;
+ spin_lock_irqsave(&key_graveyard_lock, flags);
+ list_splice_init(&key_graveyard, &graveyard);
+ spin_unlock_irqrestore(&key_graveyard_lock, flags);
+
+ list_for_each_entry(key, &graveyard, graveyard_link) {
+ spin_lock(&key_serial_lock);
+ kdebug("unrefd key %d", key->serial);
+ rb_erase(&key->serial_node, &key_serial_tree);
+ spin_unlock(&key_serial_lock);
+ }
+
/* As only this function is permitted to remove things from the key
* serial tree, if cursor is non-NULL then it will always point to a
* valid node in the tree - even if lock got dropped.
*/
spin_lock(&key_serial_lock);
+ key = NULL;
cursor = rb_first(&key_serial_tree);
continue_scanning:
+ key_put(key);
while (cursor) {
key = rb_entry(cursor, struct key, serial_node);
cursor = rb_next(cursor);
-
- if (test_bit(KEY_FLAG_FINAL_PUT, &key->flags)) {
- smp_mb(); /* Clobber key->user after FINAL_PUT seen. */
- goto found_unreferenced_key;
+ /* key_get(), unless zero: */
+ if (!refcount_inc_not_zero(&key->usage)) {
+ key = NULL;
+ gc_state |= KEY_GC_REAP_AGAIN;
+ goto skip_dead_key;
}
if (unlikely(gc_state & KEY_GC_REAPING_DEAD_1)) {
@@ -274,6 +289,7 @@ static void key_garbage_collector(struct work_struct *work)
spin_lock(&key_serial_lock);
goto continue_scanning;
}
+ key_put(key);
/* We've completed the pass. Set the timer if we need to and queue a
* new cycle if necessary. We keep executing cycles until we find one
@@ -328,18 +344,6 @@ static void key_garbage_collector(struct work_struct *work)
kleave(" [end %x]", gc_state);
return;
- /* We found an unreferenced key - once we've removed it from the tree,
- * we can safely drop the lock.
- */
-found_unreferenced_key:
- kdebug("unrefd key %d", key->serial);
- rb_erase(&key->serial_node, &key_serial_tree);
- spin_unlock(&key_serial_lock);
-
- list_add_tail(&key->graveyard_link, &graveyard);
- gc_state |= KEY_GC_REAP_AGAIN;
- goto maybe_resched;
-
/* We found a restricted keyring and need to update the restriction if
* it is associated with the dead key type.
*/
diff --git a/security/keys/internal.h b/security/keys/internal.h
index 2cffa6dc8255..4e3d9b322390 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -63,9 +63,14 @@ struct key_user {
int qnbytes; /* number of bytes allocated to this user */
};
+extern struct list_head key_graveyard;
+extern spinlock_t key_graveyard_lock;
+
extern struct rb_root key_user_tree;
extern spinlock_t key_user_lock;
extern struct key_user root_key_user;
+extern struct list_head key_graveyard;
+extern spinlock_t key_graveyard_lock;
extern struct key_user *key_user_lookup(kuid_t uid);
extern void key_user_put(struct key_user *user);
diff --git a/security/keys/key.c b/security/keys/key.c
index 7198cd2ac3a3..7511f2017b6b 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -22,6 +22,8 @@ DEFINE_SPINLOCK(key_serial_lock);
struct rb_root key_user_tree; /* tree of quota records indexed by UID */
DEFINE_SPINLOCK(key_user_lock);
+LIST_HEAD(key_graveyard);
+DEFINE_SPINLOCK(key_graveyard_lock);
unsigned int key_quota_root_maxkeys = 1000000; /* root's key count quota */
unsigned int key_quota_root_maxbytes = 25000000; /* root's key space quota */
@@ -658,8 +660,9 @@ void key_put(struct key *key)
key->user->qnbytes -= key->quotalen;
spin_unlock_irqrestore(&key->user->lock, flags);
}
- smp_mb(); /* key->user before FINAL_PUT set. */
- set_bit(KEY_FLAG_FINAL_PUT, &key->flags);
+ spin_lock_irqsave(&key_graveyard_lock, flags);
+ list_add_tail(&key->graveyard_link, &key_graveyard);
+ spin_unlock_irqrestore(&key_graveyard_lock, flags);
schedule_work(&key_gc_work);
}
}
--
2.39.5
Hi,
Can I report an issue with 6.12 LTS?
This backport in 6.12.23
[805e3ce5e0e32b31dcecc0774c57c17a1f13cef6][1]
also needs this upstream commit as well
[22cc5ca5de52bbfc36a7d4a55323f91fb4492264][2]
If it is missing and you don't have XEN enabled the build fails:
```
arch/x86/coco/tdx/tdx.c:1080:13: error: no member named 'safe_halt' in
'struct pv_irq_ops'
1080 | pv_ops.irq.safe_halt = tdx_safe_halt;
| ~~~~~~~~~~ ^
arch/x86/coco/tdx/tdx.c:1081:13: error: no member named 'halt' in
'struct pv_irq_ops'
1081 | pv_ops.irq.halt = tdx_halt;
| ~~~~~~~~~~ ^
2 errors generated.
make[5]: *** [scripts/Makefile.build:229: arch/x86/coco/tdx/tdx.o] Error 1
make[4]: *** [scripts/Makefile.build:478: arch/x86/coco/tdx] Error 2
make[3]: *** [scripts/Makefile.build:478: arch/x86/coco] Error 2
make[2]: *** [scripts/Makefile.build:478: arch/x86] Error 2
```
To make it work I have added the backport of
[805e3ce5e0e32b31dcecc0774c57c17a1f13cef6][1] as patch in my local build[3].
Best regards,
Ike
[1]:
https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit…
[2]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
[3]:
https://gitlab.com/herecura/packages/linux-bede-lts/-/blob/0d7e313f13fdae9b…
Hello,
This series disables the "serdes_wiz0" and "serdes_wiz1" device-tree
nodes in the J722S SoC file and enables them in the board files where
they are required along with "serdes0" and "serdes1". There are two
reasons behind this change:
1. To follow the existing convention of disabling nodes in the SoC file
and enabling them in the board file as required.
2. To address situations where a board file hasn't explicitly disabled
"serdes_wiz0" and "serdes_wiz1" (example: am67a-beagley-ai.dts)
as a result of which booting the board displays the following errors:
wiz bus@f0000:phy@f000000: probe with driver wiz failed with error -12
...
wiz bus@f0000:phy@f010000: probe with driver wiz failed with error -12
Series is based on linux-next tagged next-20250408.
v1 of this series is at:
https://lore.kernel.org/r/20250408060636.3413856-1-s-vadapalli@ti.com/
Changes since v1:
- Added "Fixes" tag and updated commit message accordingly.
Regards,
Siddharth.
Siddharth Vadapalli (2):
arm64: dts: ti: k3-j722s-evm: Enable "serdes_wiz0" and "serdes_wiz1"
arm64: dts: ti: k3-j722s-main: Disable "serdes_wiz0" and "serdes_wiz1"
arch/arm64/boot/dts/ti/k3-j722s-evm.dts | 8 ++++++++
arch/arm64/boot/dts/ti/k3-j722s-main.dtsi | 4 ++++
2 files changed, 12 insertions(+)
--
2.34.1
The quilt patch titled
Subject: mm: fix apply_to_existing_page_range()
has been removed from the -mm tree. Its filename was
mm-fix-apply_to_existing_page_range.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Subject: mm: fix apply_to_existing_page_range()
Date: Wed, 9 Apr 2025 12:40:43 +0300
In the case of apply_to_existing_page_range(), apply_to_pte_range() is
reached with 'create' set to false. When !create, the loop over the PTE
page table is broken.
apply_to_pte_range() will only move to the next PTE entry if 'create' is
true or if the current entry is not pte_none().
This means that the user of apply_to_existing_page_range() will not have
'fn' called for any entries after the first pte_none() in the PTE page
table.
Fix the loop logic in apply_to_pte_range().
There are no known runtime issues from this, but the fix is trivial enough
for stable@ even without a known buggy user.
Link: https://lkml.kernel.org/r/20250409094043.1629234-1-kirill.shutemov@linux.in…
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Fixes: be1db4753ee6 ("mm/memory.c: add apply_to_existing_page_range() helper")
Cc: Daniel Axtens <dja(a)axtens.net>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/mm/memory.c~mm-fix-apply_to_existing_page_range
+++ a/mm/memory.c
@@ -2938,11 +2938,11 @@ static int apply_to_pte_range(struct mm_
if (fn) {
do {
if (create || !pte_none(ptep_get(pte))) {
- err = fn(pte++, addr, data);
+ err = fn(pte, addr, data);
if (err)
break;
}
- } while (addr += PAGE_SIZE, addr != end);
+ } while (pte++, addr += PAGE_SIZE, addr != end);
}
*mask |= PGTBL_PTE_MODIFIED;
_
Patches currently in -mm which might be from kirill.shutemov(a)linux.intel.com are
mm-page_alloc-fix-deadlock-on-cpu_hotplug_lock-in-__accept_page.patch
The quilt patch titled
Subject: alloc_tag: handle incomplete bulk allocations in vm_module_tags_populate
has been removed from the -mm tree. Its filename was
alloc_tag-handle-incomplete-bulk-allocations-in-vm_module_tags_populate.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: "T.J. Mercier" <tjmercier(a)google.com>
Subject: alloc_tag: handle incomplete bulk allocations in vm_module_tags_populate
Date: Wed, 9 Apr 2025 22:51:11 +0000
alloc_pages_bulk_node() may partially succeed and allocate fewer than the
requested nr_pages. There are several conditions under which this can
occur, but we have encountered the case where CONFIG_PAGE_OWNER is enabled
causing all bulk allocations to always fallback to single page allocations
due to commit 187ad460b841 ("mm/page_alloc: avoid page allocator recursion
with pagesets.lock held").
Currently vm_module_tags_populate() immediately fails when
alloc_pages_bulk_node() returns fewer than the requested number of pages.
When this happens memory allocation profiling gets disabled, for example
[ 14.297583] [9: modprobe: 465] Failed to allocate memory for allocation tags in the module scsc_wlan. Memory allocation profiling is disabled!
[ 14.299339] [9: modprobe: 465] modprobe: Failed to insmod '/vendor/lib/modules/scsc_wlan.ko' with args '': Out of memory
This patch causes vm_module_tags_populate() to retry bulk allocations for
the remaining memory instead of failing immediately which will avoid the
disablement of memory allocation profiling.
Link: https://lkml.kernel.org/r/20250409225111.3770347-1-tjmercier@google.com
Fixes: 0f9b685626da ("alloc_tag: populate memory for module tags as needed")
Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
Reported-by: Janghyuck Kim <janghyuck.kim(a)samsung.com>
Acked-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/alloc_tag.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
--- a/lib/alloc_tag.c~alloc_tag-handle-incomplete-bulk-allocations-in-vm_module_tags_populate
+++ a/lib/alloc_tag.c
@@ -422,11 +422,20 @@ static int vm_module_tags_populate(void)
unsigned long old_shadow_end = ALIGN(phys_end, MODULE_ALIGN);
unsigned long new_shadow_end = ALIGN(new_end, MODULE_ALIGN);
unsigned long more_pages;
- unsigned long nr;
+ unsigned long nr = 0;
more_pages = ALIGN(new_end - phys_end, PAGE_SIZE) >> PAGE_SHIFT;
- nr = alloc_pages_bulk_node(GFP_KERNEL | __GFP_NOWARN,
- NUMA_NO_NODE, more_pages, next_page);
+ while (nr < more_pages) {
+ unsigned long allocated;
+
+ allocated = alloc_pages_bulk_node(GFP_KERNEL | __GFP_NOWARN,
+ NUMA_NO_NODE, more_pages - nr, next_page + nr);
+
+ if (!allocated)
+ break;
+ nr += allocated;
+ }
+
if (nr < more_pages ||
vmap_pages_range(phys_end, phys_end + (nr << PAGE_SHIFT), PAGE_KERNEL,
next_page, PAGE_SHIFT) < 0) {
_
Patches currently in -mm which might be from tjmercier(a)google.com are
The quilt patch titled
Subject: mm: fix filemap_get_folios_contig returning batches of identical folios
has been removed from the -mm tree. Its filename was
mm-fix-filemap_get_folios_contig-returning-batches-of-identical-folios.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: "Vishal Moola (Oracle)" <vishal.moola(a)gmail.com>
Subject: mm: fix filemap_get_folios_contig returning batches of identical folios
Date: Thu, 3 Apr 2025 16:54:17 -0700
filemap_get_folios_contig() is supposed to return distinct folios found
within [start, end]. Large folios in the Xarray become multi-index
entries. xas_next() can iterate through the sub-indexes before finding a
sibling entry and breaking out of the loop.
This can result in a returned folio_batch containing an indeterminate
number of duplicate folios, which forces the callers to skeptically handle
the returned batch. This is inefficient and incurs a large maintenance
overhead.
We can fix this by calling xas_advance() after we have successfully adding
a folio to the batch to ensure our Xarray is positioned such that it will
correctly find the next folio - similar to filemap_get_read_batch().
Link: https://lkml.kernel.org/r/Z-8s1-kiIDkzgRbc@fedora
Fixes: 35b471467f88 ("filemap: add filemap_get_folios_contig()")
Signed-off-by: Vishal Moola (Oracle) <vishal.moola(a)gmail.com>
Reported-by: Qu Wenruo <quwenruo.btrfs(a)gmx.com>
Closes: https://lkml.kernel.org/r/b714e4de-2583-4035-b829-72cfb5eb6fc6@gmx.com
Tested-by: Qu Wenruo <quwenruo.btrfs(a)gmx.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Vivek Kasireddy <vivek.kasireddy(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/filemap.c | 1 +
1 file changed, 1 insertion(+)
--- a/mm/filemap.c~mm-fix-filemap_get_folios_contig-returning-batches-of-identical-folios
+++ a/mm/filemap.c
@@ -2244,6 +2244,7 @@ unsigned filemap_get_folios_contig(struc
*start = folio->index + nr;
goto out;
}
+ xas_advance(&xas, folio_next_index(folio) - 1);
continue;
put_folio:
folio_put(folio);
_
Patches currently in -mm which might be from vishal.moola(a)gmail.com are
mm-compaction-use-folio-in-hugetlb-pathway.patch
The quilt patch titled
Subject: mm: page_alloc: speed up fallbacks in rmqueue_bulk()
has been removed from the -mm tree. Its filename was
mm-page_alloc-speed-up-fallbacks-in-rmqueue_bulk.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Johannes Weiner <hannes(a)cmpxchg.org>
Subject: mm: page_alloc: speed up fallbacks in rmqueue_bulk()
Date: Mon, 7 Apr 2025 14:01:53 -0400
The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal
single pages from biggest buddy") as the root cause of a 56.4% regression
in vm-scalability::lru-file-mmap-read.
Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix
freelist movement during block conversion"), as the root cause for a
regression in worst-case zone->lock+irqoff hold times.
Both of these patches modify the page allocator's fallback path to be less
greedy in an effort to stave off fragmentation. The flip side of this is
that fallbacks are also less productive each time around, which means the
fallback search can run much more frequently.
Carlos' traces point to rmqueue_bulk() specifically, which tries to refill
the percpu cache by allocating a large batch of pages in a loop. It
highlights how once the native freelists are exhausted, the fallback code
first scans orders top-down for whole blocks to claim, then falls back to
a bottom-up search for the smallest buddy to steal. For the next batch
page, it goes through the same thing again.
This can be made more efficient. Since rmqueue_bulk() holds the
zone->lock over the entire batch, the freelists are not subject to outside
changes; when the search for a block to claim has already failed, there is
no point in trying again for the next page.
Modify __rmqueue() to remember the last successful fallback mode, and
restart directly from there on the next rmqueue_bulk() iteration.
Oliver confirms that this improves beyond the regression that the test
robot reported against c2f6ea38fc1b:
commit:
f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap")
c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy")
acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- your patch
f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5
---------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
25525364 �� 3% -56.4% 11135467 -57.8% 10779336 +31.6% 33581409 vm-scalability.throughput
Carlos confirms that worst-case times are almost fully recovered
compared to before the earlier culprit patch:
2dd482ba627d (before freelist hygiene): 1ms
c0cd6f557b90 (after freelist hygiene): 90ms
next-20250319 (steal smallest buddy): 280ms
this patch : 8ms
[jackmanb(a)google.com: comment updates]
Link: https://lkml.kernel.org/r/D92AC0P9594X.3BML64MUKTF8Z@google.com
[hannes(a)cmpxchg.org: reset rmqueue_mode in rmqueue_buddy() error loop, per Yunsheng Lin]
Link: https://lkml.kernel.org/r/20250409140023.GA2313@cmpxchg.org
Link: https://lkml.kernel.org/r/20250407180154.63348-1-hannes@cmpxchg.org
Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion")
Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy")
Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org>
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Reported-by: Carlos Song <carlos.song(a)nxp.com>
Tested-by: Carlos Song <carlos.song(a)nxp.com>
Tested-by: kernel test robot <oliver.sang(a)intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com
Reviewed-by: Brendan Jackman <jackmanb(a)google.com>
Tested-by: Shivank Garg <shivankg(a)amd.com>
Acked-by: Zi Yan <ziy(a)nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org> [6.10+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 113 ++++++++++++++++++++++++++++++++--------------
1 file changed, 79 insertions(+), 34 deletions(-)
--- a/mm/page_alloc.c~mm-page_alloc-speed-up-fallbacks-in-rmqueue_bulk
+++ a/mm/page_alloc.c
@@ -2183,23 +2183,15 @@ try_to_claim_block(struct zone *zone, st
}
/*
- * Try finding a free buddy page on the fallback list.
- *
- * This will attempt to claim a whole pageblock for the requested type
- * to ensure grouping of such requests in the future.
- *
- * If a whole block cannot be claimed, steal an individual page, regressing to
- * __rmqueue_smallest() logic to at least break up as little contiguity as
- * possible.
+ * Try to allocate from some fallback migratetype by claiming the entire block,
+ * i.e. converting it to the allocation's start migratetype.
*
* The use of signed ints for order and current_order is a deliberate
* deviation from the rest of this file, to make the for loop
* condition simpler.
- *
- * Return the stolen page, or NULL if none can be found.
*/
static __always_inline struct page *
-__rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
+__rmqueue_claim(struct zone *zone, int order, int start_migratetype,
unsigned int alloc_flags)
{
struct free_area *area;
@@ -2237,14 +2229,29 @@ __rmqueue_fallback(struct zone *zone, in
page = try_to_claim_block(zone, page, current_order, order,
start_migratetype, fallback_mt,
alloc_flags);
- if (page)
- goto got_one;
+ if (page) {
+ trace_mm_page_alloc_extfrag(page, order, current_order,
+ start_migratetype, fallback_mt);
+ return page;
+ }
}
- if (alloc_flags & ALLOC_NOFRAGMENT)
- return NULL;
+ return NULL;
+}
+
+/*
+ * Try to steal a single page from some fallback migratetype. Leave the rest of
+ * the block as its current migratetype, potentially causing fragmentation.
+ */
+static __always_inline struct page *
+__rmqueue_steal(struct zone *zone, int order, int start_migratetype)
+{
+ struct free_area *area;
+ int current_order;
+ struct page *page;
+ int fallback_mt;
+ bool claim_block;
- /* No luck claiming pageblock. Find the smallest fallback page */
for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) {
area = &(zone->free_area[current_order]);
fallback_mt = find_suitable_fallback(area, current_order,
@@ -2254,25 +2261,28 @@ __rmqueue_fallback(struct zone *zone, in
page = get_page_from_free_area(area, fallback_mt);
page_del_and_expand(zone, page, order, current_order, fallback_mt);
- goto got_one;
+ trace_mm_page_alloc_extfrag(page, order, current_order,
+ start_migratetype, fallback_mt);
+ return page;
}
return NULL;
-
-got_one:
- trace_mm_page_alloc_extfrag(page, order, current_order,
- start_migratetype, fallback_mt);
-
- return page;
}
+enum rmqueue_mode {
+ RMQUEUE_NORMAL,
+ RMQUEUE_CMA,
+ RMQUEUE_CLAIM,
+ RMQUEUE_STEAL,
+};
+
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
static __always_inline struct page *
__rmqueue(struct zone *zone, unsigned int order, int migratetype,
- unsigned int alloc_flags)
+ unsigned int alloc_flags, enum rmqueue_mode *mode)
{
struct page *page;
@@ -2291,16 +2301,48 @@ __rmqueue(struct zone *zone, unsigned in
}
}
- page = __rmqueue_smallest(zone, order, migratetype);
- if (unlikely(!page)) {
- if (alloc_flags & ALLOC_CMA)
+ /*
+ * First try the freelists of the requested migratetype, then try
+ * fallbacks modes with increasing levels of fragmentation risk.
+ *
+ * The fallback logic is expensive and rmqueue_bulk() calls in
+ * a loop with the zone->lock held, meaning the freelists are
+ * not subject to any outside changes. Remember in *mode where
+ * we found pay dirt, to save us the search on the next call.
+ */
+ switch (*mode) {
+ case RMQUEUE_NORMAL:
+ page = __rmqueue_smallest(zone, order, migratetype);
+ if (page)
+ return page;
+ fallthrough;
+ case RMQUEUE_CMA:
+ if (alloc_flags & ALLOC_CMA) {
page = __rmqueue_cma_fallback(zone, order);
-
- if (!page)
- page = __rmqueue_fallback(zone, order, migratetype,
- alloc_flags);
+ if (page) {
+ *mode = RMQUEUE_CMA;
+ return page;
+ }
+ }
+ fallthrough;
+ case RMQUEUE_CLAIM:
+ page = __rmqueue_claim(zone, order, migratetype, alloc_flags);
+ if (page) {
+ /* Replenished preferred freelist, back to normal mode. */
+ *mode = RMQUEUE_NORMAL;
+ return page;
+ }
+ fallthrough;
+ case RMQUEUE_STEAL:
+ if (!(alloc_flags & ALLOC_NOFRAGMENT)) {
+ page = __rmqueue_steal(zone, order, migratetype);
+ if (page) {
+ *mode = RMQUEUE_STEAL;
+ return page;
+ }
+ }
}
- return page;
+ return NULL;
}
/*
@@ -2312,6 +2354,7 @@ static int rmqueue_bulk(struct zone *zon
unsigned long count, struct list_head *list,
int migratetype, unsigned int alloc_flags)
{
+ enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
unsigned long flags;
int i;
@@ -2323,7 +2366,7 @@ static int rmqueue_bulk(struct zone *zon
}
for (i = 0; i < count; ++i) {
struct page *page = __rmqueue(zone, order, migratetype,
- alloc_flags);
+ alloc_flags, &rmqm);
if (unlikely(page == NULL))
break;
@@ -2948,7 +2991,9 @@ struct page *rmqueue_buddy(struct zone *
if (alloc_flags & ALLOC_HIGHATOMIC)
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if (!page) {
- page = __rmqueue(zone, order, migratetype, alloc_flags);
+ enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
+
+ page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm);
/*
* If the allocation fails, allow OOM handling and
_
Patches currently in -mm which might be from hannes(a)cmpxchg.org are
mm-page_alloc-tighten-up-find_suitable_fallback.patch
The quilt patch titled
Subject: mm/vma: add give_up_on_oom option on modify/merge, use in uffd release
has been removed from the -mm tree. Its filename was
mm-vma-add-give_up_on_oom-option-on-modify-merge-use-in-uffd-release.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: mm/vma: add give_up_on_oom option on modify/merge, use in uffd release
Date: Fri, 21 Mar 2025 10:09:37 +0000
Currently, if a VMA merge fails due to an OOM condition arising on commit
merge or a failure to duplicate anon_vma's, we report this so the caller
can handle it.
However there are cases where the caller is only ostensibly trying a
merge, and doesn't mind if it fails due to this condition.
Since we do not want to introduce an implicit assumption that we only
actually modify VMAs after OOM conditions might arise, add a 'give up on
oom' option and make an explicit contract that, should this flag be set, we
absolutely will not modify any VMAs should OOM arise and just bail out.
Since it'd be very unusual for a user to try to vma_modify() with this flag
set but be specifying a range within a VMA which ends up being split (which
can fail due to rlimit issues, not only OOM), we add a debug warning for
this condition.
The motivating reason for this is uffd release - syzkaller (and Pedro
Falcato's VERY astute analysis) found a way in which an injected fault on
allocation, triggering an OOM condition on commit merge, would result in
uffd code becoming confused and treating an error value as if it were a VMA
pointer.
To avoid this, we make use of this new VMG flag to ensure that this never
occurs, utilising the fact that, should we be clearing entire VMAs, we do
not wish an OOM event to be reported to us.
Many thanks to Pedro Falcato for his excellent analysis and Jann Horn for
his insightful and intelligent analysis of the situation, both of whom were
instrumental in this fix.
Link: https://lkml.kernel.org/r/20250321100937.46634-1-lorenzo.stoakes@oracle.com
Reported-by: syzbot+20ed41006cf9d842c2b5(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001e.GAE@google.com/
Fixes: 47b16d0462a4 ("mm: abort vma_modify() on merge out of memory failure")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Suggested-by: Pedro Falcato <pfalcato(a)suse.de>
Suggested-by: Jann Horn <jannh(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/userfaultfd.c | 13 +++++++++--
mm/vma.c | 51 +++++++++++++++++++++++++++++++++++++++++----
mm/vma.h | 9 +++++++
3 files changed, 66 insertions(+), 7 deletions(-)
--- a/mm/userfaultfd.c~mm-vma-add-give_up_on_oom-option-on-modify-merge-use-in-uffd-release
+++ a/mm/userfaultfd.c
@@ -1902,6 +1902,14 @@ struct vm_area_struct *userfaultfd_clear
unsigned long end)
{
struct vm_area_struct *ret;
+ bool give_up_on_oom = false;
+
+ /*
+ * If we are modifying only and not splitting, just give up on the merge
+ * if OOM prevents us from merging successfully.
+ */
+ if (start == vma->vm_start && end == vma->vm_end)
+ give_up_on_oom = true;
/* Reset ptes for the whole vma range if wr-protected */
if (userfaultfd_wp(vma))
@@ -1909,7 +1917,7 @@ struct vm_area_struct *userfaultfd_clear
ret = vma_modify_flags_uffd(vmi, prev, vma, start, end,
vma->vm_flags & ~__VM_UFFD_FLAGS,
- NULL_VM_UFFD_CTX);
+ NULL_VM_UFFD_CTX, give_up_on_oom);
/*
* In the vma_merge() successful mprotect-like case 8:
@@ -1960,7 +1968,8 @@ int userfaultfd_register_range(struct us
new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
vma = vma_modify_flags_uffd(&vmi, prev, vma, start, vma_end,
new_flags,
- (struct vm_userfaultfd_ctx){ctx});
+ (struct vm_userfaultfd_ctx){ctx},
+ /* give_up_on_oom = */false);
if (IS_ERR(vma))
return PTR_ERR(vma);
--- a/mm/vma.c~mm-vma-add-give_up_on_oom-option-on-modify-merge-use-in-uffd-release
+++ a/mm/vma.c
@@ -666,6 +666,9 @@ static void vmg_adjust_set_range(struct
/*
* Actually perform the VMA merge operation.
*
+ * IMPORTANT: We guarantee that, should vmg->give_up_on_oom is set, to not
+ * modify any VMAs or cause inconsistent state should an OOM condition arise.
+ *
* Returns 0 on success, or an error value on failure.
*/
static int commit_merge(struct vma_merge_struct *vmg)
@@ -685,6 +688,12 @@ static int commit_merge(struct vma_merge
init_multi_vma_prep(&vp, vma, vmg);
+ /*
+ * If vmg->give_up_on_oom is set, we're safe, because we don't actually
+ * manipulate any VMAs until we succeed at preallocation.
+ *
+ * Past this point, we will not return an error.
+ */
if (vma_iter_prealloc(vmg->vmi, vma))
return -ENOMEM;
@@ -915,7 +924,13 @@ static __must_check struct vm_area_struc
if (anon_dup)
unlink_anon_vmas(anon_dup);
- vmg->state = VMA_MERGE_ERROR_NOMEM;
+ /*
+ * We've cleaned up any cloned anon_vma's, no VMAs have been
+ * modified, no harm no foul if the user requests that we not
+ * report this and just give up, leaving the VMAs unmerged.
+ */
+ if (!vmg->give_up_on_oom)
+ vmg->state = VMA_MERGE_ERROR_NOMEM;
return NULL;
}
@@ -926,7 +941,15 @@ static __must_check struct vm_area_struc
abort:
vma_iter_set(vmg->vmi, start);
vma_iter_load(vmg->vmi);
- vmg->state = VMA_MERGE_ERROR_NOMEM;
+
+ /*
+ * This means we have failed to clone anon_vma's correctly, but no
+ * actual changes to VMAs have occurred, so no harm no foul - if the
+ * user doesn't want this reported and instead just wants to give up on
+ * the merge, allow it.
+ */
+ if (!vmg->give_up_on_oom)
+ vmg->state = VMA_MERGE_ERROR_NOMEM;
return NULL;
}
@@ -1068,6 +1091,10 @@ int vma_expand(struct vma_merge_struct *
/* This should already have been checked by this point. */
VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg);
vma_start_write(next);
+ /*
+ * In this case we don't report OOM, so vmg->give_up_on_mm is
+ * safe.
+ */
ret = dup_anon_vma(middle, next, &anon_dup);
if (ret)
return ret;
@@ -1090,9 +1117,15 @@ int vma_expand(struct vma_merge_struct *
return 0;
nomem:
- vmg->state = VMA_MERGE_ERROR_NOMEM;
if (anon_dup)
unlink_anon_vmas(anon_dup);
+ /*
+ * If the user requests that we just give upon OOM, we are safe to do so
+ * here, as commit merge provides this contract to us. Nothing has been
+ * changed - no harm no foul, just don't report it.
+ */
+ if (!vmg->give_up_on_oom)
+ vmg->state = VMA_MERGE_ERROR_NOMEM;
return -ENOMEM;
}
@@ -1534,6 +1567,13 @@ static struct vm_area_struct *vma_modify
if (vmg_nomem(vmg))
return ERR_PTR(-ENOMEM);
+ /*
+ * Split can fail for reasons other than OOM, so if the user requests
+ * this it's probably a mistake.
+ */
+ VM_WARN_ON(vmg->give_up_on_oom &&
+ (vma->vm_start != start || vma->vm_end != end));
+
/* Split any preceding portion of the VMA. */
if (vma->vm_start < start) {
int err = split_vma(vmg->vmi, vma, start, 1);
@@ -1602,12 +1642,15 @@ struct vm_area_struct
struct vm_area_struct *vma,
unsigned long start, unsigned long end,
unsigned long new_flags,
- struct vm_userfaultfd_ctx new_ctx)
+ struct vm_userfaultfd_ctx new_ctx,
+ bool give_up_on_oom)
{
VMG_VMA_STATE(vmg, vmi, prev, vma, start, end);
vmg.flags = new_flags;
vmg.uffd_ctx = new_ctx;
+ if (give_up_on_oom)
+ vmg.give_up_on_oom = true;
return vma_modify(&vmg);
}
--- a/mm/vma.h~mm-vma-add-give_up_on_oom-option-on-modify-merge-use-in-uffd-release
+++ a/mm/vma.h
@@ -114,6 +114,12 @@ struct vma_merge_struct {
*/
bool just_expand :1;
+ /*
+ * If a merge is possible, but an OOM error occurs, give up and don't
+ * execute the merge, returning NULL.
+ */
+ bool give_up_on_oom :1;
+
/* Internal flags set during merge process: */
/*
@@ -255,7 +261,8 @@ __must_check struct vm_area_struct
struct vm_area_struct *vma,
unsigned long start, unsigned long end,
unsigned long new_flags,
- struct vm_userfaultfd_ctx new_ctx);
+ struct vm_userfaultfd_ctx new_ctx,
+ bool give_up_on_oom);
__must_check struct vm_area_struct
*vma_merge_new_range(struct vma_merge_struct *vmg);
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
maintainers-add-memory-advice-section.patch
mm-vma-fix-incorrectly-disallowed-anonymous-vma-merges.patch
tools-testing-add-procmap_query-helper-functions-in-mm-self-tests.patch
tools-testing-selftests-assert-that-anon-merge-cases-behave-as-expected.patch
The quilt patch titled
Subject: selftests/mm: generate a temporary mountpoint for cgroup filesystem
has been removed from the -mm tree. Its filename was
selftests-mm-generate-a-temporary-mountpoint-for-cgroup-filesystem.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Mark Brown <broonie(a)kernel.org>
Subject: selftests/mm: generate a temporary mountpoint for cgroup filesystem
Date: Fri, 04 Apr 2025 17:42:32 +0100
Currently if the filesystem for the cgroups version it wants to use is not
mounted charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh tests
will attempt to mount it on the hard coded path /dev/cgroup/memory,
deleting that directory when the test finishes. This will fail if there
is not a preexisting directory at that path, and since the directory is
deleted subsequent runs of the test will fail. Instead of relying on this
hard coded directory name use mktemp to generate a temporary directory to
use as a mountpoint, fixing both the assumption and the disruption caused
by deleting a preexisting directory.
This means that if the relevant cgroup filesystem is not already mounted
then we rely on having coreutils (which provides mktemp) installed. I
suspect that many current users are relying on having things automounted
by default, and given that the script relies on bash it's probably not an
unreasonable requirement.
Link: https://lkml.kernel.org/r/20250404-kselftest-mm-cgroup2-detection-v1-1-3dba…
Fixes: 209376ed2a84 ("selftests/vm: make charge_reserved_hugetlb.sh work with existing cgroup setting")
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Cc: Aishwarya TCV <aishwarya.tcv(a)arm.com>
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Mina Almasry <almasrymina(a)google.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Waiman Long <longman(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
tools/testing/selftests/mm/charge_reserved_hugetlb.sh | 4 ++--
tools/testing/selftests/mm/hugetlb_reparenting_test.sh | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
--- a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh~selftests-mm-generate-a-temporary-mountpoint-for-cgroup-filesystem
+++ a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh
@@ -29,7 +29,7 @@ fi
if [[ $cgroup2 ]]; then
cgroup_path=$(mount -t cgroup2 | head -1 | awk '{print $3}')
if [[ -z "$cgroup_path" ]]; then
- cgroup_path=/dev/cgroup/memory
+ cgroup_path=$(mktemp -d)
mount -t cgroup2 none $cgroup_path
do_umount=1
fi
@@ -37,7 +37,7 @@ if [[ $cgroup2 ]]; then
else
cgroup_path=$(mount -t cgroup | grep ",hugetlb" | awk '{print $3}')
if [[ -z "$cgroup_path" ]]; then
- cgroup_path=/dev/cgroup/memory
+ cgroup_path=$(mktemp -d)
mount -t cgroup memory,hugetlb $cgroup_path
do_umount=1
fi
--- a/tools/testing/selftests/mm/hugetlb_reparenting_test.sh~selftests-mm-generate-a-temporary-mountpoint-for-cgroup-filesystem
+++ a/tools/testing/selftests/mm/hugetlb_reparenting_test.sh
@@ -23,7 +23,7 @@ fi
if [[ $cgroup2 ]]; then
CGROUP_ROOT=$(mount -t cgroup2 | head -1 | awk '{print $3}')
if [[ -z "$CGROUP_ROOT" ]]; then
- CGROUP_ROOT=/dev/cgroup/memory
+ CGROUP_ROOT=$(mktemp -d)
mount -t cgroup2 none $CGROUP_ROOT
do_umount=1
fi
_
Patches currently in -mm which might be from broonie(a)kernel.org are
The quilt patch titled
Subject: mm/compaction: fix bug in hugetlb handling pathway
has been removed from the -mm tree. Its filename was
mm-compaction-fix-bug-in-hugetlb-handling-pathway.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: "Vishal Moola (Oracle)" <vishal.moola(a)gmail.com>
Subject: mm/compaction: fix bug in hugetlb handling pathway
Date: Mon, 31 Mar 2025 19:10:24 -0700
The compaction code doesn't take references on pages until we're certain
we should attempt to handle it.
In the hugetlb case, isolate_or_dissolve_huge_page() may return -EBUSY
without taking a reference to the folio associated with our pfn. If our
folio's refcount drops to 0, compound_nr() becomes unpredictable, making
low_pfn and nr_scanned unreliable. The user-visible effect is minimal -
this should rarely happen (if ever).
Fix this by storing the folio statistics earlier on the stack (just like
the THP and Buddy cases).
Also revert commit 66fe1cf7f581 ("mm: compaction: use helper compound_nr
in isolate_migratepages_block") to make backporting easier.
Link: https://lkml.kernel.org/r/20250401021025.637333-1-vishal.moola@gmail.com
Fixes: 369fa227c219 ("mm: make alloc_contig_range handle free hugetlb pages")
Signed-off-by: Vishal Moola (Oracle) <vishal.moola(a)gmail.com>
Acked-by: Oscar Salvador <osalvador(a)suse.de>
Reviewed-by: Zi Yan <ziy(a)nvidia.com>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/compaction.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/mm/compaction.c~mm-compaction-fix-bug-in-hugetlb-handling-pathway
+++ a/mm/compaction.c
@@ -981,13 +981,13 @@ isolate_migratepages_block(struct compac
}
if (PageHuge(page)) {
+ const unsigned int order = compound_order(page);
/*
* skip hugetlbfs if we are not compacting for pages
* bigger than its order. THPs and other compound pages
* are handled below.
*/
if (!cc->alloc_contig) {
- const unsigned int order = compound_order(page);
if (order <= MAX_PAGE_ORDER) {
low_pfn += (1UL << order) - 1;
@@ -1011,8 +1011,8 @@ isolate_migratepages_block(struct compac
/* Do not report -EBUSY down the chain */
if (ret == -EBUSY)
ret = 0;
- low_pfn += compound_nr(page) - 1;
- nr_scanned += compound_nr(page) - 1;
+ low_pfn += (1UL << order) - 1;
+ nr_scanned += (1UL << order) - 1;
goto isolate_fail;
}
_
Patches currently in -mm which might be from vishal.moola(a)gmail.com are
mm-compaction-use-folio-in-hugetlb-pathway.patch
The quilt patch titled
Subject: lib/iov_iter: fix to increase non slab folio refcount
has been removed from the -mm tree. Its filename was
lib-iov_iter-fix-to-increase-non-slab-folio-refcount.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Sheng Yong <shengyong1(a)xiaomi.com>
Subject: lib/iov_iter: fix to increase non slab folio refcount
Date: Tue, 1 Apr 2025 22:47:12 +0800
When testing EROFS file-backed mount over v9fs on qemu, I encountered a
folio UAF issue. The page sanity check reports the following call trace.
The root cause is that pages in bvec are coalesced across a folio bounary.
The refcount of all non-slab folios should be increased to ensure
p9_releas_pages can put them correctly.
BUG: Bad page state in process md5sum pfn:18300
page: refcount:0 mapcount:0 mapping:00000000d5ad8e4e index:0x60 pfn:0x18300
head: order:0 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
aops:z_erofs_aops ino:30b0f dentry name(?):"GoogleExtServicesCn.apk"
flags: 0x100000000000041(locked|head|node=0|zone=1)
raw: 0100000000000041 dead000000000100 dead000000000122 ffff888014b13bd0
raw: 0000000000000060 0000000000000020 00000000ffffffff 0000000000000000
head: 0100000000000041 dead000000000100 dead000000000122 ffff888014b13bd0
head: 0000000000000060 0000000000000020 00000000ffffffff 0000000000000000
head: 0100000000000000 0000000000000000 ffffffffffffffff 0000000000000000
head: 0000000000000010 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
Call Trace:
dump_stack_lvl+0x53/0x70
bad_page+0xd4/0x220
__free_pages_ok+0x76d/0xf30
__folio_put+0x230/0x320
p9_release_pages+0x179/0x1f0
p9_virtio_zc_request+0xa2a/0x1230
p9_client_zc_rpc.constprop.0+0x247/0x700
p9_client_read_once+0x34d/0x810
p9_client_read+0xf3/0x150
v9fs_issue_read+0x111/0x360
netfs_unbuffered_read_iter_locked+0x927/0x1390
netfs_unbuffered_read_iter+0xa2/0xe0
vfs_iocb_iter_read+0x2c7/0x460
erofs_fileio_rq_submit+0x46b/0x5b0
z_erofs_runqueue+0x1203/0x21e0
z_erofs_readahead+0x579/0x8b0
read_pages+0x19f/0xa70
page_cache_ra_order+0x4ad/0xb80
filemap_readahead.isra.0+0xe7/0x150
filemap_get_pages+0x7aa/0x1890
filemap_read+0x320/0xc80
vfs_read+0x6c6/0xa30
ksys_read+0xf9/0x1c0
do_syscall_64+0x9e/0x1a0
entry_SYSCALL_64_after_hwframe+0x71/0x79
Link: https://lkml.kernel.org/r/20250401144712.1377719-1-shengyong1@xiaomi.com
Fixes: b9c0e49abfca ("mm: decline to manipulate the refcount on a slab page")
Signed-off-by: Sheng Yong <shengyong1(a)xiaomi.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/iov_iter.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/lib/iov_iter.c~lib-iov_iter-fix-to-increase-non-slab-folio-refcount
+++ a/lib/iov_iter.c
@@ -1191,7 +1191,7 @@ static ssize_t __iov_iter_get_pages_allo
return -ENOMEM;
p = *pages;
for (int k = 0; k < n; k++) {
- struct folio *folio = page_folio(page);
+ struct folio *folio = page_folio(page + k);
p[k] = page + k;
if (!folio_test_slab(folio))
folio_get(folio);
_
Patches currently in -mm which might be from shengyong1(a)xiaomi.com are
The patch titled
Subject: mm: hugetlb: fix incorrect fallback for subpool
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-hugetlb-fix-incorrect-fallback-for-subpool.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Wupeng Ma <mawupeng1(a)huawei.com>
Subject: mm: hugetlb: fix incorrect fallback for subpool
Date: Thu, 10 Apr 2025 14:26:33 +0800
During our testing with hugetlb subpool enabled, we observe that
hstate->resv_huge_pages may underflow into negative values. Root cause
analysis reveals a race condition in subpool reservation fallback handling
as follow:
hugetlb_reserve_pages()
/* Attempt subpool reservation */
gbl_reserve = hugepage_subpool_get_pages(spool, chg);
/* Global reservation may fail after subpool allocation */
if (hugetlb_acct_memory(h, gbl_reserve) < 0)
goto out_put_pages;
out_put_pages:
/* This incorrectly restores reservation to subpool */
hugepage_subpool_put_pages(spool, chg);
When hugetlb_acct_memory() fails after subpool allocation, the current
implementation over-commits subpool reservations by returning the full
'chg' value instead of the actual allocated 'gbl_reserve' amount. This
discrepancy propagates to global reservations during subsequent releases,
eventually causing resv_huge_pages underflow.
This problem can be trigger easily with the following steps:
1. reverse hugepage for hugeltb allocation
2. mount hugetlbfs with min_size to enable hugetlb subpool
3. alloc hugepages with two task(make sure the second will fail due to
insufficient amount of hugepages)
4. with for a few seconds and repeat step 3 which will make
hstate->resv_huge_pages to go below zero.
To fix this problem, return corrent amount of pages to subpool during the
fallback after hugepage_subpool_get_pages is called.
Link: https://lkml.kernel.org/r/20250410062633.3102457-1-mawupeng1@huawei.com
Fixes: 1c5ecae3a93f ("hugetlbfs: add minimum size accounting to subpools")
Signed-off-by: Wupeng Ma <mawupeng1(a)huawei.com>
Tested-by: Joshua Hahn <joshua.hahnjy(a)gmail.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Ma Wupeng <mawupeng1(a)huawei.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)
--- a/mm/hugetlb.c~mm-hugetlb-fix-incorrect-fallback-for-subpool
+++ a/mm/hugetlb.c
@@ -3010,7 +3010,7 @@ struct folio *alloc_hugetlb_folio(struct
struct hugepage_subpool *spool = subpool_vma(vma);
struct hstate *h = hstate_vma(vma);
struct folio *folio;
- long retval, gbl_chg;
+ long retval, gbl_chg, gbl_reserve;
map_chg_state map_chg;
int ret, idx;
struct hugetlb_cgroup *h_cg = NULL;
@@ -3163,8 +3163,16 @@ out_uncharge_cgroup_reservation:
hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
h_cg);
out_subpool_put:
- if (map_chg)
- hugepage_subpool_put_pages(spool, 1);
+ /*
+ * put page to subpool iff the quota of subpool's rsv_hpages is used
+ * during hugepage_subpool_get_pages.
+ */
+ if (map_chg && !gbl_chg) {
+ gbl_reserve = hugepage_subpool_put_pages(spool, 1);
+ hugetlb_acct_memory(h, -gbl_reserve);
+ }
+
+
out_end_reservation:
if (map_chg != MAP_CHG_ENFORCED)
vma_end_reservation(h, vma, addr);
@@ -7233,7 +7241,7 @@ bool hugetlb_reserve_pages(struct inode
struct vm_area_struct *vma,
vm_flags_t vm_flags)
{
- long chg = -1, add = -1;
+ long chg = -1, add = -1, spool_resv, gbl_resv;
struct hstate *h = hstate_inode(inode);
struct hugepage_subpool *spool = subpool_inode(inode);
struct resv_map *resv_map;
@@ -7368,8 +7376,16 @@ bool hugetlb_reserve_pages(struct inode
return true;
out_put_pages:
- /* put back original number of pages, chg */
- (void)hugepage_subpool_put_pages(spool, chg);
+ spool_resv = chg - gbl_reserve;
+ if (spool_resv) {
+ /* put sub pool's reservation back, chg - gbl_reserve */
+ gbl_resv = hugepage_subpool_put_pages(spool, spool_resv);
+ /*
+ * subpool's reserved pages can not be put back due to race,
+ * return to hstate.
+ */
+ hugetlb_acct_memory(h, -gbl_resv);
+ }
out_uncharge_cgroup:
hugetlb_cgroup_uncharge_cgroup_rsvd(hstate_index(h),
chg * pages_per_huge_page(h), h_cg);
_
Patches currently in -mm which might be from mawupeng1(a)huawei.com are
mm-hugetlb-fix-incorrect-fallback-for-subpool.patch
From: Aurabindo Pillai <aurabindo.pillai(a)amd.com>
With HostVM enabled, DCN31 fails to pass validation for 3x4k60. Some Linux
userspace does not downgrade one of the monitors to 4k30, and the result
is that the monitor does not light up. Disable it until the bandwidth
calculation failure is resolved.
Reviewed-by: Sun peng Li <sunpeng.li(a)amd.com>
Signed-off-by: Aurabindo Pillai <aurabindo.pillai(a)amd.com>
Signed-off-by: Zaeem Mohamed <zaeem.mohamed(a)amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit ba93dddfc92084a1e28ea447ec4f8315f3d8d3fd)
Cc: stable(a)vger.kernel.org
---
drivers/gpu/drm/amd/display/dc/resource/dcn31/dcn31_resource.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/display/dc/resource/dcn31/dcn31_resource.c b/drivers/gpu/drm/amd/display/dc/resource/dcn31/dcn31_resource.c
index 911bd60d4fbc..3c42ba8566cf 100644
--- a/drivers/gpu/drm/amd/display/dc/resource/dcn31/dcn31_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/resource/dcn31/dcn31_resource.c
@@ -890,7 +890,7 @@ static const struct dc_debug_options debug_defaults_drv = {
.disable_z10 = true,
.enable_legacy_fast_update = true,
.enable_z9_disable_interface = true, /* Allow support for the PMFW interface for disable Z9*/
- .dml_hostvm_override = DML_HOSTVM_NO_OVERRIDE,
+ .dml_hostvm_override = DML_HOSTVM_OVERRIDE_FALSE,
.using_dml2 = false,
};
--
2.49.0
Dear,
Send your Ref: FSG2025 / Name / Phone Number / Country to Mr. Andrej
Mahecic on un.grant(a)socialworker.net, +1 888 673 0430 for your £100,000.00.
Sincerely
Mr. C. Gunness
On behalf of the UN.
If we finds a vq without a name in our input array in
virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer
to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq.
Consequently, we create only a queue if it actually exists (name != NULL)
and assign an incremental queue index to each such existing queue.
However, in virtio_ccw_register_adapter_ind()->get_airq_indicator() we
will not ignore these "non-existing queues", but instead assign an airq
indicator to them.
Besides never releasing them in virtio_ccw_drop_indicators() (because
there is no virtqueue), the bigger issue seems to be that there will be a
disagreement between the device and the Linux guest about the airq
indicator to be used for notifying a queue, because the indicator bit
for adapter I/O interrupt is derived from the queue index.
The virtio spec states under "Setting Up Two-Stage Queue Indicators":
... indicator contains the guest address of an area wherein the
indicators for the devices are contained, starting at bit_nr, one
bit per virtqueue of the device.
And further in "Notification via Adapter I/O Interrupts":
For notifying the driver of virtqueue buffers, the device sets the
bit in the guest-provided indicator area at the corresponding
offset.
For example, QEMU uses in virtio_ccw_notify() the queue index (passed as
"vector") to select the relevant indicator bit. If a queue does not exist,
it does not have a corresponding indicator bit assigned, because it
effectively doesn't have a queue index.
Using a virtio-balloon-ccw device under QEMU with free-page-hinting
disabled ("free-page-hint=off") but free-page-reporting enabled
("free-page-reporting=on") will result in free page reporting
not working as expected: in the virtio_balloon driver, we'll be stuck
forever in virtballoon_free_page_report()->wait_event(), because the
waitqueue will not be woken up as the notification from the device is
lost: it would use the wrong indicator bit.
Free page reporting stops working and we get splats (when configured to
detect hung wqs) like:
INFO: task kworker/1:3:463 blocked for more than 61 seconds.
Not tainted 6.14.0 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/1:3 [...]
Workqueue: events page_reporting_process
Call Trace:
[<000002f404e6dfb2>] __schedule+0x402/0x1640
[<000002f404e6f22e>] schedule+0x3e/0xe0
[<000002f3846a88fa>] virtballoon_free_page_report+0xaa/0x110 [virtio_balloon]
[<000002f40435c8a4>] page_reporting_process+0x2e4/0x740
[<000002f403fd3ee2>] process_one_work+0x1c2/0x400
[<000002f403fd4b96>] worker_thread+0x296/0x420
[<000002f403fe10b4>] kthread+0x124/0x290
[<000002f403f4e0dc>] __ret_from_fork+0x3c/0x60
[<000002f404e77272>] ret_from_fork+0xa/0x38
There was recently a discussion [1] whether the "holes" should be
treated differently again, effectively assigning also non-existing
queues a queue index: that should also fix the issue, but requires other
workarounds to not break existing setups.
Let's fix it without affecting existing setups for now by properly ignoring
the non-existing queues, so the indicator bits will match the queue
indexes.
[1] https://lore.kernel.org/all/cover.1720611677.git.mst@redhat.com/
Fixes: a229989d975e ("virtio: don't allocate vqs when names[i] = NULL")
Reported-by: Chandra Merla <cmerla(a)redhat.com>
Cc: <Stable(a)vger.kernel.org>
Cc: Cornelia Huck <cohuck(a)redhat.com>
Cc: Thomas Huth <thuth(a)redhat.com>
Cc: Halil Pasic <pasic(a)linux.ibm.com>
Cc: Eric Farman <farman(a)linux.ibm.com>
Cc: Heiko Carstens <hca(a)linux.ibm.com>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Christian Borntraeger <borntraeger(a)linux.ibm.com>
Cc: Sven Schnelle <svens(a)linux.ibm.com>
Cc: "Michael S. Tsirkin" <mst(a)redhat.com>
Cc: Wei Wang <wei.w.wang(a)intel.com>
Signed-off-by: David Hildenbrand <david(a)redhat.com>
---
drivers/s390/virtio/virtio_ccw.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
index 21fa7ac849e5c..4904b831c0a75 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -302,11 +302,17 @@ static struct airq_info *new_airq_info(int index)
static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs,
u64 *first, void **airq_info)
{
- int i, j;
+ int i, j, queue_idx, highest_queue_idx = -1;
struct airq_info *info;
unsigned long *indicator_addr = NULL;
unsigned long bit, flags;
+ /* Array entries without an actual queue pointer must be ignored. */
+ for (i = 0; i < nvqs; i++) {
+ if (vqs[i])
+ highest_queue_idx++;
+ }
+
for (i = 0; i < MAX_AIRQ_AREAS && !indicator_addr; i++) {
mutex_lock(&airq_areas_lock);
if (!airq_areas[i])
@@ -316,7 +322,7 @@ static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs,
if (!info)
return NULL;
write_lock_irqsave(&info->lock, flags);
- bit = airq_iv_alloc(info->aiv, nvqs);
+ bit = airq_iv_alloc(info->aiv, highest_queue_idx + 1);
if (bit == -1UL) {
/* Not enough vacancies. */
write_unlock_irqrestore(&info->lock, flags);
@@ -325,8 +331,10 @@ static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs,
*first = bit;
*airq_info = info;
indicator_addr = info->aiv->vector;
- for (j = 0; j < nvqs; j++) {
- airq_iv_set_ptr(info->aiv, bit + j,
+ for (j = 0, queue_idx = 0; j < nvqs; j++) {
+ if (!vqs[j])
+ continue;
+ airq_iv_set_ptr(info->aiv, bit + queue_idx++,
(unsigned long)vqs[j]);
}
write_unlock_irqrestore(&info->lock, flags);
--
2.48.1
From: Frode Isaksen <frode(a)meta.com>
Invalidate io_data by setting context to NULL when USB request is
dequeued or interrupted, and check for NULL io_data in epfile_io_complete().
The invalidation of io_data in req->context is done when exiting
epfile_io(), since then io_data will become invalid as it is allocated
on the stack.
The epfile_io_complete() may be called after ffs_epfile_io() returns
in case the wait_for_completion_interruptible() is interrupted.
This fixes a use-after-free error with the following call stack:
Unable to handle kernel paging request at virtual address ffffffc02f7bbcc0
pc : ffs_epfile_io_complete+0x30/0x48
lr : usb_gadget_giveback_request+0x30/0xf8
Call trace:
ffs_epfile_io_complete+0x30/0x48
usb_gadget_giveback_request+0x30/0xf8
dwc3_remove_requests+0x264/0x2e8
dwc3_gadget_pullup+0x1d0/0x250
kretprobe_trampoline+0x0/0xc4
usb_gadget_remove_driver+0x40/0xf4
usb_gadget_unregister_driver+0xdc/0x178
unregister_gadget_item+0x40/0x6c
ffs_closed+0xd4/0x10c
ffs_data_clear+0x2c/0xf0
ffs_data_closed+0x178/0x1ec
ffs_ep0_release+0x24/0x38
__fput+0xe8/0x27c
Signed-off-by: Frode Isaksen <frode(a)meta.com>
Cc: stable(a)vger.kernel.org
---
v1 -> v2:
Removed WARN_ON() in ffs_epfile_io_complete().
Clarified commit message.
Added stable Cc tag.
drivers/usb/gadget/function/f_fs.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c
index 2dea9e42a0f8..e35d32e7be58 100644
--- a/drivers/usb/gadget/function/f_fs.c
+++ b/drivers/usb/gadget/function/f_fs.c
@@ -738,6 +738,9 @@ static void ffs_epfile_io_complete(struct usb_ep *_ep, struct usb_request *req)
{
struct ffs_io_data *io_data = req->context;
+ if (io_data == NULL)
+ return;
+
if (req->status)
io_data->status = req->status;
else
@@ -1126,6 +1129,7 @@ static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data)
spin_lock_irq(&epfile->ffs->eps_lock);
if (epfile->ep != ep) {
ret = -ESHUTDOWN;
+ req->context = NULL;
goto error_lock;
}
/*
@@ -1140,6 +1144,7 @@ static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data)
interrupted = io_data->status < 0;
}
+ req->context = NULL;
if (interrupted)
ret = -EINTR;
else if (io_data->read && io_data->status > 0)
--
2.49.0
The current implementation uses bias_pad_enable as a reference count to
manage the shared bias pad for all UTMI PHYs. However, during system
suspension with connected USB devices, multiple power-down requests for
the UTMI pad result in a mismatch in the reference count, which in turn
produces warnings such as:
[ 237.762967] WARNING: CPU: 10 PID: 1618 at tegra186_utmi_pad_power_down+0x160/0x170
[ 237.763103] Call trace:
[ 237.763104] tegra186_utmi_pad_power_down+0x160/0x170
[ 237.763107] tegra186_utmi_phy_power_off+0x10/0x30
[ 237.763110] phy_power_off+0x48/0x100
[ 237.763113] tegra_xusb_enter_elpg+0x204/0x500
[ 237.763119] tegra_xusb_suspend+0x48/0x140
[ 237.763122] platform_pm_suspend+0x2c/0xb0
[ 237.763125] dpm_run_callback.isra.0+0x20/0xa0
[ 237.763127] __device_suspend+0x118/0x330
[ 237.763129] dpm_suspend+0x10c/0x1f0
[ 237.763130] dpm_suspend_start+0x88/0xb0
[ 237.763132] suspend_devices_and_enter+0x120/0x500
[ 237.763135] pm_suspend+0x1ec/0x270
The root cause was traced back to the dynamic power-down changes
introduced in commit a30951d31b25 ("xhci: tegra: USB2 pad power controls"),
where the UTMI pad was being powered down without verifying its current
state. This unbalanced behavior led to discrepancies in the reference
count.
To rectify this issue, this patch replaces the single reference counter
with a bitmask, renamed to utmi_pad_enabled. Each bit in the mask
corresponds to one of the four USB2 PHYs, allowing us to track each pad's
enablement status individually.
With this change:
- The bias pad is powered on only when the mask is clear.
- Each UTMI pad is powered on or down based on its corresponding bit
in the mask, preventing redundant operations.
- The overall power state of the shared bias pad is maintained
correctly during suspend/resume cycles.
The mutex used to prevent race conditions during UTMI pad enable/disable
operations has been moved from the tegra186_utmi_bias_pad_power_on/off
functions to the parent functions tegra186_utmi_pad_power_on/down. This
change ensures that there are no race conditions when updating the bitmask.
Cc: stable(a)vger.kernel.org
Fixes: a30951d31b25 ("xhci: tegra: USB2 pad power controls")
Signed-off-by: Wayne Chang <waynec(a)nvidia.com>
---
V1 -> V2: holding the padctl->lock to protect shared bitmask
V2 -> V3: updating the commit message with the mutex changes
drivers/phy/tegra/xusb-tegra186.c | 44 +++++++++++++++++++------------
1 file changed, 27 insertions(+), 17 deletions(-)
diff --git a/drivers/phy/tegra/xusb-tegra186.c b/drivers/phy/tegra/xusb-tegra186.c
index fae6242aa730..cc7b8a6a999f 100644
--- a/drivers/phy/tegra/xusb-tegra186.c
+++ b/drivers/phy/tegra/xusb-tegra186.c
@@ -237,6 +237,8 @@
#define DATA0_VAL_PD BIT(1)
#define USE_XUSB_AO BIT(4)
+#define TEGRA_UTMI_PAD_MAX 4
+
#define TEGRA186_LANE(_name, _offset, _shift, _mask, _type) \
{ \
.name = _name, \
@@ -269,7 +271,7 @@ struct tegra186_xusb_padctl {
/* UTMI bias and tracking */
struct clk *usb2_trk_clk;
- unsigned int bias_pad_enable;
+ DECLARE_BITMAP(utmi_pad_enabled, TEGRA_UTMI_PAD_MAX);
/* padctl context */
struct tegra186_xusb_padctl_context context;
@@ -603,12 +605,8 @@ static void tegra186_utmi_bias_pad_power_on(struct tegra_xusb_padctl *padctl)
u32 value;
int err;
- mutex_lock(&padctl->lock);
-
- if (priv->bias_pad_enable++ > 0) {
- mutex_unlock(&padctl->lock);
+ if (!bitmap_empty(priv->utmi_pad_enabled, TEGRA_UTMI_PAD_MAX))
return;
- }
err = clk_prepare_enable(priv->usb2_trk_clk);
if (err < 0)
@@ -667,17 +665,8 @@ static void tegra186_utmi_bias_pad_power_off(struct tegra_xusb_padctl *padctl)
struct tegra186_xusb_padctl *priv = to_tegra186_xusb_padctl(padctl);
u32 value;
- mutex_lock(&padctl->lock);
-
- if (WARN_ON(priv->bias_pad_enable == 0)) {
- mutex_unlock(&padctl->lock);
- return;
- }
-
- if (--priv->bias_pad_enable > 0) {
- mutex_unlock(&padctl->lock);
+ if (!bitmap_empty(priv->utmi_pad_enabled, TEGRA_UTMI_PAD_MAX))
return;
- }
value = padctl_readl(padctl, XUSB_PADCTL_USB2_BIAS_PAD_CTL1);
value |= USB2_PD_TRK;
@@ -690,13 +679,13 @@ static void tegra186_utmi_bias_pad_power_off(struct tegra_xusb_padctl *padctl)
clk_disable_unprepare(priv->usb2_trk_clk);
}
- mutex_unlock(&padctl->lock);
}
static void tegra186_utmi_pad_power_on(struct phy *phy)
{
struct tegra_xusb_lane *lane = phy_get_drvdata(phy);
struct tegra_xusb_padctl *padctl = lane->pad->padctl;
+ struct tegra186_xusb_padctl *priv = to_tegra186_xusb_padctl(padctl);
struct tegra_xusb_usb2_port *port;
struct device *dev = padctl->dev;
unsigned int index = lane->index;
@@ -705,9 +694,16 @@ static void tegra186_utmi_pad_power_on(struct phy *phy)
if (!phy)
return;
+ mutex_lock(&padctl->lock);
+ if (test_bit(index, priv->utmi_pad_enabled)) {
+ mutex_unlock(&padctl->lock);
+ return;
+ }
+
port = tegra_xusb_find_usb2_port(padctl, index);
if (!port) {
dev_err(dev, "no port found for USB2 lane %u\n", index);
+ mutex_unlock(&padctl->lock);
return;
}
@@ -724,18 +720,28 @@ static void tegra186_utmi_pad_power_on(struct phy *phy)
value = padctl_readl(padctl, XUSB_PADCTL_USB2_OTG_PADX_CTL1(index));
value &= ~USB2_OTG_PD_DR;
padctl_writel(padctl, value, XUSB_PADCTL_USB2_OTG_PADX_CTL1(index));
+
+ set_bit(index, priv->utmi_pad_enabled);
+ mutex_unlock(&padctl->lock);
}
static void tegra186_utmi_pad_power_down(struct phy *phy)
{
struct tegra_xusb_lane *lane = phy_get_drvdata(phy);
struct tegra_xusb_padctl *padctl = lane->pad->padctl;
+ struct tegra186_xusb_padctl *priv = to_tegra186_xusb_padctl(padctl);
unsigned int index = lane->index;
u32 value;
if (!phy)
return;
+ mutex_lock(&padctl->lock);
+ if (!test_bit(index, priv->utmi_pad_enabled)) {
+ mutex_unlock(&padctl->lock);
+ return;
+ }
+
dev_dbg(padctl->dev, "power down UTMI pad %u\n", index);
value = padctl_readl(padctl, XUSB_PADCTL_USB2_OTG_PADX_CTL0(index));
@@ -748,7 +754,11 @@ static void tegra186_utmi_pad_power_down(struct phy *phy)
udelay(2);
+ clear_bit(index, priv->utmi_pad_enabled);
+
tegra186_utmi_bias_pad_power_off(padctl);
+
+ mutex_unlock(&padctl->lock);
}
static int tegra186_xusb_padctl_vbus_override(struct tegra_xusb_padctl *padctl,
--
2.25.1
The value returned by acpi_evaluate_integer() is not checked,
but the result is not always successful, so it is necessary to
add a check of the returned value.
If the result remains negative during three iterations of the loop,
then the uninitialized variable 'val' will be used in the clamp_val()
macro, so it must be initialized with the current value of the 'curr'
variable.
In this case, the algorithm should be less noisy.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: b23910c2194e ("asus-laptop: Pegatron Lucid accelerometer")
Cc: stable(a)vger.kernel.org
Signed-off-by: Denis Arefev <arefev(a)swemel.ru>
---
V1 -> V2:
Added check of the return value it as Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com> suggested.
Changed initialization of 'val' variable it as Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com> suggested.
drivers/platform/x86/asus-laptop.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/platform/x86/asus-laptop.c b/drivers/platform/x86/asus-laptop.c
index d460dd194f19..ff674c6d0bbb 100644
--- a/drivers/platform/x86/asus-laptop.c
+++ b/drivers/platform/x86/asus-laptop.c
@@ -427,10 +427,12 @@ static int asus_pega_lucid_set(struct asus_laptop *asus, int unit, bool enable)
static int pega_acc_axis(struct asus_laptop *asus, int curr, char *method)
{
int i, delta;
- unsigned long long val;
+ acpi_status status;
+ unsigned long long val = (unsigned long long)curr;
for (i = 0; i < PEGA_ACC_RETRIES; i++) {
- acpi_evaluate_integer(asus->handle, method, NULL, &val);
-
+ status = acpi_evaluate_integer(asus->handle, method, NULL, &val);
+ if (ACPI_FAILURE(status))
+ continue;
/* The output is noisy. From reading the ASL
* dissassembly, timeout errors are returned with 1's
* in the high word, and the lack of locking around
--
2.43.0
Nouveau currently relies on the assumption that dma_fences will only
ever get signaled through nouveau_fence_signal(), which takes care of
removing a signaled fence from the list nouveau_fence_chan.pending.
This self-imposed rule is violated in nouveau_fence_done(), where
dma_fence_is_signaled() (somewhat surprisingly, considering its name)
can signal the fence without removing it from the list. This enables
accesses to already signaled fences through the list, which is a bug.
In particular, it can race with nouveau_fence_context_kill(), which
would then attempt to set an error code on an already signaled fence,
which is illegal.
In nouveau_fence_done(), the call to nouveau_fence_update() already
ensures to signal all ready fences. Thus, the signaling potentially
performed by dma_fence_is_signaled() is actually not necessary.
Replace the call to dma_fence_is_signaled() with
nouveau_fence_base_is_signaled().
Cc: <stable(a)vger.kernel.org> # 4.10+, precise commit not to be determined
Signed-off-by: Philipp Stanner <phasta(a)kernel.org>
---
drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
index 7cc84472cece..33535987d8ed 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
@@ -274,7 +274,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
nvif_event_block(&fctx->event);
spin_unlock_irqrestore(&fctx->lock, flags);
}
- return dma_fence_is_signaled(&fence->base);
+ return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags);
}
static long
--
2.48.1
This requirement was overeagerly loosened in commit 2f83e38a095f
("tty: Permit some TIOCL_SETSEL modes without CAP_SYS_ADMIN"), but as
it turns out,
(1) the logic I implemented there was inconsistent (apologies!),
(2) TIOCL_SELMOUSEREPORT might actually be a small security risk
after all, and
(3) TIOCL_SELMOUSEREPORT is only meant to be used by the mouse
daemon (GPM or Consolation), which runs as CAP_SYS_ADMIN
already.
In more detail:
1. The previous patch has inconsistent logic:
In commit 2f83e38a095f ("tty: Permit some TIOCL_SETSEL modes
without CAP_SYS_ADMIN"), we checked for sel_mode ==
TIOCL_SELMOUSEREPORT, but overlooked that the lower four bits of
this "mode" parameter were actually used as an additional way to
pass an argument. So the patch did actually still require
CAP_SYS_ADMIN, if any of the mouse button bits are set, but did not
require it if none of the mouse buttons bits are set.
This logic is inconsistent and was not intentional. We should have
the same policies for using TIOCL_SELMOUSEREPORT independent of the
value of the "hidden" mouse button argument.
I sent a separate documentation patch to the man page list with
more details on TIOCL_SELMOUSEREPORT:
https://lore.kernel.org/all/20250223091342.35523-2-gnoack3000@gmail.com/
2. TIOCL_SELMOUSEREPORT is indeed a potential security risk which can
let an attacker simulate "keyboard" input to command line
applications on the same terminal, like TIOCSTI and some other
TIOCLINUX "selection mode" IOCTLs.
By enabling mouse reporting on a terminal and then injecting mouse
reports through TIOCL_SELMOUSEREPORT, an attacker can simulate
mouse movements on the same terminal, similar to the TIOCSTI
keystroke injection attacks that were previously possible with
TIOCSTI and other TIOCL_SETSEL selection modes.
Many programs (including libreadline/bash) are then prone to
misinterpret these mouse reports as normal keyboard input because
they do not expect input in the X11 mouse protocol form. The
attacker does not have complete control over the escape sequence,
but they can at least control the values of two consecutive bytes
in the binary mouse reporting escape sequence.
I went into more detail on that in the discussion at
https://lore.kernel.org/all/20250221.0a947528d8f3@gnoack.org/
It is not equally trivial to simulate arbitrary keystrokes as it
was with TIOCSTI (commit 83efeeeb3d04 ("tty: Allow TIOCSTI to be
disabled")), but the general mechanism is there, and together with
the small number of existing legit use cases (see below), it would
be better to revert back to requiring CAP_SYS_ADMIN for
TIOCL_SELMOUSEREPORT, as it was already the case before
commit 2f83e38a095f ("tty: Permit some TIOCL_SETSEL modes without
CAP_SYS_ADMIN").
3. TIOCL_SELMOUSEREPORT is only used by the mouse daemons (GPM or
Consolation), and they are the only legit use case:
To quote console_codes(4):
The mouse tracking facility is intended to return
xterm(1)-compatible mouse status reports. Because the console
driver has no way to know the device or type of the mouse, these
reports are returned in the console input stream only when the
virtual terminal driver receives a mouse update ioctl. These
ioctls must be generated by a mouse-aware user-mode application
such as the gpm(8) daemon.
Jared Finder has also confirmed in
https://lore.kernel.org/all/491f3df9de6593df8e70dbe77614b026@finder.org/
that Emacs does not call TIOCL_SELMOUSEREPORT directly, and it
would be difficult to find good reasons for doing that, given that
it would interfere with the reports that GPM is sending.
More information on the interaction between GPM, terminals and the
kernel with additional pointers is also available in this patch:
https://lore.kernel.org/all/a773e48920aa104a65073671effbdee665c105fc.160396…
For background on who else uses TIOCL_SELMOUSEREPORT: Debian Code
search finds one page of results, the only two known callers are
the two mouse daemons GPM and Consolation. (GPM does not show up
in the search results because it uses literal numbers to refer to
TIOCLINUX-related enums. I looked through GPM by hand instead.
TIOCL_SELMOUSEREPORT is also not used from libgpm.)
https://codesearch.debian.net/search?q=TIOCL_SELMOUSEREPORT
Cc: Jared Finder <jared(a)finder.org>
Cc: Jann Horn <jannh(a)google.com>
Cc: Hanno Böck <hanno(a)hboeck.de>
Cc: Jiri Slaby <jirislaby(a)kernel.org>
Cc: Kees Cook <kees(a)kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: 2f83e38a095f ("tty: Permit some TIOCL_SETSEL modes without CAP_SYS_ADMIN")
Signed-off-by: Günther Noack <gnoack3000(a)gmail.com>
---
drivers/tty/vt/selection.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/tty/vt/selection.c b/drivers/tty/vt/selection.c
index 0bd6544e30a6b..791e2f1f7c0b6 100644
--- a/drivers/tty/vt/selection.c
+++ b/drivers/tty/vt/selection.c
@@ -193,13 +193,12 @@ int set_selection_user(const struct tiocl_selection __user *sel,
return -EFAULT;
/*
- * TIOCL_SELCLEAR, TIOCL_SELPOINTER and TIOCL_SELMOUSEREPORT are OK to
- * use without CAP_SYS_ADMIN as they do not modify the selection.
+ * TIOCL_SELCLEAR and TIOCL_SELPOINTER are OK to use without
+ * CAP_SYS_ADMIN as they do not modify the selection.
*/
switch (v.sel_mode) {
case TIOCL_SELCLEAR:
case TIOCL_SELPOINTER:
- case TIOCL_SELMOUSEREPORT:
break;
default:
if (!capable(CAP_SYS_ADMIN))
base-commit: 27102b38b8ca7ffb1622f27bcb41475d121fb67f
--
2.48.1
When rv_is_container_monitor() is called on the last monitor in
rv_monitors_list, KASAN yells:
BUG: KASAN: global-out-of-bounds in rv_is_container_monitor+0x101/0x110
Read of size 8 at addr ffffffff97c7c798 by task setup/221
The buggy address belongs to the variable:
rv_monitors_list+0x18/0x40
This is due to list_next_entry() is called on the last entry in the list.
It wraps around to the first list_head, and the first list_head is not
embedded in struct rv_monitor_def.
Fix it by checking if the monitor is last in the list.
Fixes: cb85c660fcd4 ("rv: Add option for nested monitors and include sched")
Signed-off-by: Nam Cao <namcao(a)linutronix.de>
Cc: stable(a)vger.kernel.org
---
kernel/trace/rv/rv.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
index 50344aa9f7f9..544acb1f6a33 100644
--- a/kernel/trace/rv/rv.c
+++ b/kernel/trace/rv/rv.c
@@ -225,7 +225,12 @@ bool rv_is_nested_monitor(struct rv_monitor_def *mdef)
*/
bool rv_is_container_monitor(struct rv_monitor_def *mdef)
{
- struct rv_monitor_def *next = list_next_entry(mdef, list);
+ struct rv_monitor_def *next;
+
+ if (list_is_last(&mdef->list, &rv_monitors_list))
+ return false;
+
+ next = list_next_entry(mdef, list);
return next->parent == mdef->monitor || !mdef->monitor->enable;
}
--
2.39.5
This series adds fine grained trap control in EL2 required for FEAT_PMUv3p9
registers like PMICNTR_EL0, PMICFILTR_EL0, and PMUACR_EL1 which are already
being used in the kernel. This is required to prevent their EL1 access trap
into EL2.
The following commits that enabled access into FEAT_PMUv3p9 registers have
already been merged upstream from 6.13 onwards.
d8226d8cfbaf ("perf: arm_pmuv3: Add support for Armv9.4 PMU instruction counter")
0bbff9ed8165 ("perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control")
The sysreg patches in this series are required for the final patch which
fixes the actual problem.
Anshuman Khandual (7):
arm64/sysreg: Update register fields for ID_AA64MMFR0_EL1
arm64/sysreg: Add register fields for HDFGRTR2_EL2
arm64/sysreg: Add register fields for HDFGWTR2_EL2
arm64/sysreg: Add register fields for HFGITR2_EL2
arm64/sysreg: Add register fields for HFGRTR2_EL2
arm64/sysreg: Add register fields for HFGWTR2_EL2
arm64/boot: Enable EL2 requirements for FEAT_PMUv3p9
Documentation/arch/arm64/booting.rst | 22 ++++++
arch/arm64/include/asm/el2_setup.h | 25 +++++++
arch/arm64/tools/sysreg | 103 +++++++++++++++++++++++++++
3 files changed, 150 insertions(+)
--
2.30.2
The late init call just writes to omap4 registers as soon as
CONFIG_MFD_CPCAP is enabled without checking whether the
cpcap driver is actually there or the SoC is indeed an
OMAP4.
Rather do these things only with the right device combination.
Fixes booting the BT200 with said configuration enabled and non-factory
X-Loader and probably also some surprising behavior on other devices.
Fixes: c145649bf262 ("ARM: OMAP2+: Configure voltage controller for cpcap to low-speed")
CC: <stable(a)vger.kernel.org>
Signed-off-by: Andreas Kemnade <andreas(a)kemnade.info>
---
arch/arm/mach-omap2/pmic-cpcap.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/arm/mach-omap2/pmic-cpcap.c b/arch/arm/mach-omap2/pmic-cpcap.c
index 4f31e61c0c90..9f9a20274db8 100644
--- a/arch/arm/mach-omap2/pmic-cpcap.c
+++ b/arch/arm/mach-omap2/pmic-cpcap.c
@@ -264,7 +264,11 @@ int __init omap4_cpcap_init(void)
static int __init cpcap_late_init(void)
{
- omap4_vc_set_pmic_signaling(PWRDM_POWER_RET);
+ if (!of_find_compatible_node(NULL, NULL, "motorola,cpcap"))
+ return 0;
+
+ if (soc_is_omap443x() || soc_is_omap446x() || soc_is_omap447x())
+ omap4_vc_set_pmic_signaling(PWRDM_POWER_RET);
return 0;
}
--
2.39.5
Previously, commit ed129ec9057f ("KVM: x86: forcibly leave nested mode
on vCPU reset") addressed an issue where a triple fault occurring in
nested mode could lead to use-after-free scenarios. However, the commit
did not handle the analogous situation for System Management Mode (SMM).
This omission results in triggering a WARN when a vCPU reset occurs
while still in SMM mode, due to the check in kvm_vcpu_reset(). This
situation was reprodused using Syzkaller by:
1) Creating a KVM VM and vCPU
2) Sending a KVM_SMI ioctl to explicitly enter SMM
3) Executing invalid instructions causing consecutive exceptions and
eventually a triple fault
The issue manifests as follows:
WARNING: CPU: 0 PID: 25506 at arch/x86/kvm/x86.c:12112
kvm_vcpu_reset+0x1d2/0x1530 arch/x86/kvm/x86.c:12112
Modules linked in:
CPU: 0 PID: 25506 Comm: syz-executor.0 Not tainted
6.1.130-syzkaller-00157-g164fe5dde9b6 #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-1 04/01/2014
RIP: 0010:kvm_vcpu_reset+0x1d2/0x1530 arch/x86/kvm/x86.c:12112
Call Trace:
<TASK>
shutdown_interception+0x66/0xb0 arch/x86/kvm/svm/svm.c:2136
svm_invoke_exit_handler+0x110/0x530 arch/x86/kvm/svm/svm.c:3395
svm_handle_exit+0x424/0x920 arch/x86/kvm/svm/svm.c:3457
vcpu_enter_guest arch/x86/kvm/x86.c:10959 [inline]
vcpu_run+0x2c43/0x5a90 arch/x86/kvm/x86.c:11062
kvm_arch_vcpu_ioctl_run+0x50f/0x1cf0 arch/x86/kvm/x86.c:11283
kvm_vcpu_ioctl+0x570/0xf00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4122
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x19a/0x210 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Considering that hardware CPUs exit SMM mode completely upon receiving
a triple fault by triggering a hardware reset (which inherently leads
to exiting SMM), explicitly perform SMM exit prior to the WARN check.
Although subsequent code clears vCPU hflags, including the SMM flag,
calling kvm_smm_changed ensures the exit from SMM is handled correctly
and explicitly, aligning precisely with hardware behavior.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: ed129ec9057f ("KVM: x86: forcibly leave nested mode on vCPU reset")
Cc: stable(a)vger.kernel.org
Signed-off-by: Mikhail Lobanov <m.lobanov(a)rosa.ru>
---
arch/x86/kvm/x86.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b64ab350bcd..f1c95c21703a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12409,6 +12409,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
if (is_guest_mode(vcpu))
kvm_leave_nested(vcpu);
+ if (is_smm(vcpu))
+ kvm_smm_changed(vcpu, false);
+
kvm_lapic_reset(vcpu, init_event);
WARN_ON_ONCE(is_guest_mode(vcpu) || is_smm(vcpu));
--
2.47.2
Hi All,
Chages since v1:
- left fixes only, improvements will be posted separately;
- Fixes: and -stable tags added to patch descriptions;
This series is an attempt to fix the violation of lazy MMU mode context
requirement as described for arch_enter_lazy_mmu_mode():
This mode can only be entered and left under the protection of
the page table locks for all page tables which may be modified.
On s390 if I make arch_enter_lazy_mmu_mode() -> preempt_enable() and
arch_leave_lazy_mmu_mode() -> preempt_disable() I am getting this:
[ 553.332108] preempt_count: 1, expected: 0
[ 553.332117] no locks held by multipathd/2116.
[ 553.332128] CPU: 24 PID: 2116 Comm: multipathd Kdump: loaded Tainted:
[ 553.332139] Hardware name: IBM 3931 A01 701 (LPAR)
[ 553.332146] Call Trace:
[ 553.332152] [<00000000158de23a>] dump_stack_lvl+0xfa/0x150
[ 553.332167] [<0000000013e10d12>] __might_resched+0x57a/0x5e8
[ 553.332178] [<00000000144eb6c2>] __alloc_pages+0x2ba/0x7c0
[ 553.332189] [<00000000144d5cdc>] __get_free_pages+0x2c/0x88
[ 553.332198] [<00000000145663f6>] kasan_populate_vmalloc_pte+0x4e/0x110
[ 553.332207] [<000000001447625c>] apply_to_pte_range+0x164/0x3c8
[ 553.332218] [<000000001448125a>] apply_to_pmd_range+0xda/0x318
[ 553.332226] [<000000001448181c>] __apply_to_page_range+0x384/0x768
[ 553.332233] [<0000000014481c28>] apply_to_page_range+0x28/0x38
[ 553.332241] [<00000000145665da>] kasan_populate_vmalloc+0x82/0x98
[ 553.332249] [<00000000144c88d0>] alloc_vmap_area+0x590/0x1c90
[ 553.332257] [<00000000144ca108>] __get_vm_area_node.constprop.0+0x138/0x260
[ 553.332265] [<00000000144d17fc>] __vmalloc_node_range+0x134/0x360
[ 553.332274] [<0000000013d5dbf2>] alloc_thread_stack_node+0x112/0x378
[ 553.332284] [<0000000013d62726>] dup_task_struct+0x66/0x430
[ 553.332293] [<0000000013d63962>] copy_process+0x432/0x4b80
[ 553.332302] [<0000000013d68300>] kernel_clone+0xf0/0x7d0
[ 553.332311] [<0000000013d68bd6>] __do_sys_clone+0xae/0xc8
[ 553.332400] [<0000000013d68dee>] __s390x_sys_clone+0xd6/0x118
[ 553.332410] [<0000000013c9d34c>] do_syscall+0x22c/0x328
[ 553.332419] [<00000000158e7366>] __do_syscall+0xce/0xf0
[ 553.332428] [<0000000015913260>] system_call+0x70/0x98
This exposes a KASAN issue fixed with patch 1 and apply_to_pte_range()
issue fixed with patch 3, while patch 2 is a prerequisite.
Commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu
mode") looks like powerpc-only fix, yet not entirely conforming to the
above provided requirement (page tables itself are still not protected).
If I am not mistaken, xen and sparc are alike.
Thanks!
Alexander Gordeev (3):
kasan: Avoid sleepable page allocation from atomic context
mm: Cleanup apply_to_pte_range() routine
mm: Protect kernel pgtables in apply_to_pte_range()
mm/kasan/shadow.c | 9 +++------
mm/memory.c | 33 +++++++++++++++++++++------------
2 files changed, 24 insertions(+), 18 deletions(-)
--
2.45.2
The quilt patch titled
Subject: mm: protect kernel pgtables in apply_to_pte_range()
has been removed from the -mm tree. Its filename was
mm-protect-kernel-pgtables-in-apply_to_pte_range.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: Alexander Gordeev <agordeev(a)linux.ibm.com>
Subject: mm: protect kernel pgtables in apply_to_pte_range()
Date: Tue, 8 Apr 2025 18:07:32 +0200
The lazy MMU mode can only be entered and left under the protection of the
page table locks for all page tables which may be modified. Yet, when it
comes to kernel mappings apply_to_pte_range() does not take any locks.
That does not conform arch_enter|leave_lazy_mmu_mode() semantics and could
potentially lead to re-schedulling a process while in lazy MMU mode or
racing on a kernel page table updates.
Link: https://lkml.kernel.org/r/ef8f6538b83b7fc3372602f90375348f9b4f3596.17441281…
Fixes: 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy updates")
Signed-off-by: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Guenetr Roeck <linux(a)roeck-us.net>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jeremy Fitzhardinge <jeremy(a)goop.org>
Cc: Juegren Gross <jgross(a)suse.com>
Cc: Nicholas Piggin <npiggin(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kasan/shadow.c | 7 ++-----
mm/memory.c | 5 ++++-
2 files changed, 6 insertions(+), 6 deletions(-)
--- a/mm/kasan/shadow.c~mm-protect-kernel-pgtables-in-apply_to_pte_range
+++ a/mm/kasan/shadow.c
@@ -308,14 +308,14 @@ static int kasan_populate_vmalloc_pte(pt
__memset((void *)page, KASAN_VMALLOC_INVALID, PAGE_SIZE);
pte = pfn_pte(PFN_DOWN(__pa(page)), PAGE_KERNEL);
- spin_lock(&init_mm.page_table_lock);
if (likely(pte_none(ptep_get(ptep)))) {
set_pte_at(&init_mm, addr, ptep, pte);
page = 0;
}
- spin_unlock(&init_mm.page_table_lock);
+
if (page)
free_page(page);
+
return 0;
}
@@ -401,13 +401,10 @@ static int kasan_depopulate_vmalloc_pte(
page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
- spin_lock(&init_mm.page_table_lock);
-
if (likely(!pte_none(ptep_get(ptep)))) {
pte_clear(&init_mm, addr, ptep);
free_page(page);
}
- spin_unlock(&init_mm.page_table_lock);
return 0;
}
--- a/mm/memory.c~mm-protect-kernel-pgtables-in-apply_to_pte_range
+++ a/mm/memory.c
@@ -2926,6 +2926,7 @@ static int apply_to_pte_range(struct mm_
pte = pte_offset_kernel(pmd, addr);
if (!pte)
return err;
+ spin_lock(&init_mm.page_table_lock);
} else {
if (create)
pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
@@ -2951,7 +2952,9 @@ static int apply_to_pte_range(struct mm_
arch_leave_lazy_mmu_mode();
- if (mm != &init_mm)
+ if (mm == &init_mm)
+ spin_unlock(&init_mm.page_table_lock);
+ else
pte_unmap_unlock(mapped_pte, ptl);
*mask |= PGTBL_PTE_MODIFIED;
_
Patches currently in -mm which might be from agordeev(a)linux.ibm.com are
The quilt patch titled
Subject: kasan: avoid sleepable page allocation from atomic context
has been removed from the -mm tree. Its filename was
kasan-avoid-sleepable-page-allocation-from-atomic-context.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: Alexander Gordeev <agordeev(a)linux.ibm.com>
Subject: kasan: avoid sleepable page allocation from atomic context
Date: Tue, 8 Apr 2025 18:07:30 +0200
Patch series "mm: Fix apply_to_pte_range() vs lazy MMU mode", v2.
This series is an attempt to fix the violation of lazy MMU mode context
requirement as described for arch_enter_lazy_mmu_mode():
This mode can only be entered and left under the protection of
the page table locks for all page tables which may be modified.
On s390 if I make arch_enter_lazy_mmu_mode() -> preempt_enable() and
arch_leave_lazy_mmu_mode() -> preempt_disable() I am getting this:
[ 553.332108] preempt_count: 1, expected: 0
[ 553.332117] no locks held by multipathd/2116.
[ 553.332128] CPU: 24 PID: 2116 Comm: multipathd Kdump: loaded Tainted:
[ 553.332139] Hardware name: IBM 3931 A01 701 (LPAR)
[ 553.332146] Call Trace:
[ 553.332152] [<00000000158de23a>] dump_stack_lvl+0xfa/0x150
[ 553.332167] [<0000000013e10d12>] __might_resched+0x57a/0x5e8
[ 553.332178] [<00000000144eb6c2>] __alloc_pages+0x2ba/0x7c0
[ 553.332189] [<00000000144d5cdc>] __get_free_pages+0x2c/0x88
[ 553.332198] [<00000000145663f6>] kasan_populate_vmalloc_pte+0x4e/0x110
[ 553.332207] [<000000001447625c>] apply_to_pte_range+0x164/0x3c8
[ 553.332218] [<000000001448125a>] apply_to_pmd_range+0xda/0x318
[ 553.332226] [<000000001448181c>] __apply_to_page_range+0x384/0x768
[ 553.332233] [<0000000014481c28>] apply_to_page_range+0x28/0x38
[ 553.332241] [<00000000145665da>] kasan_populate_vmalloc+0x82/0x98
[ 553.332249] [<00000000144c88d0>] alloc_vmap_area+0x590/0x1c90
[ 553.332257] [<00000000144ca108>] __get_vm_area_node.constprop.0+0x138/0x260
[ 553.332265] [<00000000144d17fc>] __vmalloc_node_range+0x134/0x360
[ 553.332274] [<0000000013d5dbf2>] alloc_thread_stack_node+0x112/0x378
[ 553.332284] [<0000000013d62726>] dup_task_struct+0x66/0x430
[ 553.332293] [<0000000013d63962>] copy_process+0x432/0x4b80
[ 553.332302] [<0000000013d68300>] kernel_clone+0xf0/0x7d0
[ 553.332311] [<0000000013d68bd6>] __do_sys_clone+0xae/0xc8
[ 553.332400] [<0000000013d68dee>] __s390x_sys_clone+0xd6/0x118
[ 553.332410] [<0000000013c9d34c>] do_syscall+0x22c/0x328
[ 553.332419] [<00000000158e7366>] __do_syscall+0xce/0xf0
[ 553.332428] [<0000000015913260>] system_call+0x70/0x98
This exposes a KASAN issue fixed with patch 1 and apply_to_pte_range()
issue fixed with patch 3, while patch 2 is a prerequisite.
Commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu
mode") looks like powerpc-only fix, yet not entirely conforming to the
above provided requirement (page tables itself are still not protected).
If I am not mistaken, xen and sparc are alike.
This patch (of 3):
apply_to_page_range() enters lazy MMU mode and then invokes
kasan_populate_vmalloc_pte() callback on each page table walk iteration.
The lazy MMU mode may only be entered only under protection of the page
table lock. However, the callback can go into sleep when trying to
allocate a single page.
Change __get_free_page() allocation mode from GFP_KERNEL to GFP_ATOMIC to
avoid scheduling out while in atomic context.
Link: https://lkml.kernel.org/r/cover.1744128123.git.agordeev@linux.ibm.com
Link: https://lkml.kernel.org/r/2d9f4ac4528701b59d511a379a60107fa608ad30.17441281…
Fixes: 3c5c3cfb9ef4 ("kasan: support backing vmalloc space with real shadow memory")
Signed-off-by: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Guenetr Roeck <linux(a)roeck-us.net>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jeremy Fitzhardinge <jeremy(a)goop.org>
Cc: Juegren Gross <jgross(a)suse.com>
Cc: Nicholas Piggin <npiggin(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kasan/shadow.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/kasan/shadow.c~kasan-avoid-sleepable-page-allocation-from-atomic-context
+++ a/mm/kasan/shadow.c
@@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pt
if (likely(!pte_none(ptep_get(ptep))))
return 0;
- page = __get_free_page(GFP_KERNEL);
+ page = __get_free_page(GFP_ATOMIC);
if (!page)
return -ENOMEM;
_
Patches currently in -mm which might be from agordeev(a)linux.ibm.com are
mm-cleanup-apply_to_pte_range-routine.patch
mm-protect-kernel-pgtables-in-apply_to_pte_range.patch
Hello,
I'm investigating if v5.15 and early versions are vulnerable to the following CVEs. Could you please help confirm the following cases?
For CVE-2024-36912, the suggested fix is 211f514ebf1e ("Drivers: hv: vmbus: Track decrypted status in vmbus_gpadl") according to https://www.cve.org/CVERecord?id=CVE-2024-36912
It seems 211f514ebf1e is based on d4dccf353db8 ("Drivers: hv: vmbus: Mark vmbus ring buffer visible to host in Isolation VM") which was introduced since v5.16. For v5.15 and early versions, vmbus ring buffer hadn't been made visible to host, so there's no need to backport 211f514ebf1e to those versions, right?
For CVE-2024-36913, the suggested fix is 03f5a999adba ("Drivers: hv: vmbus: Leak pages if set_memory_encrypted() fails") according to https://www.cve.org/CVERecord?id=CVE-2024-36913
It seems 03f5a999adba is based on f2f136c05fb6 ("Drivers: hv: vmbus: Add SNP support for VMbus channel initiate message") which was introduced since v5.16. For v5.15 and early verions, monitor pages hadn't been made visible to host, so there's no need to backport 03f5a999adba to those versions, right?
Thanks,
Zhe
This series backports some recent fixes for SVE/KVM interactions from
Mark Rutland to v5.15.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v3:
- Explicitly include "KVM: arm64: Always start with clearing SVE flag on
load", it was included previously as part of a conflcit resolution.
- Link to v2: https://lore.kernel.org/r/20250403-stable-sve-5-15-v2-0-30a36a78a20a@kernel…
Changes in v2:
- Resend with Greg and the stable list added.
- Link to v1: https://lore.kernel.org/r/20250402-stable-sve-5-15-v1-0-84d0e5ff1102@kernel…
---
Fuad Tabba (1):
KVM: arm64: Calculate cptr_el2 traps on activating traps
Marc Zyngier (2):
KVM: arm64: Get rid of host SVE tracking/saving
KVM: arm64: Always start with clearing SVE flag on load
Mark Brown (4):
KVM: arm64: Discard any SVE state when entering KVM guests
arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE
arm64/fpsimd: Have KVM explicitly say which FP registers to save
arm64/fpsimd: Stop using TIF_SVE to manage register saving in KVM
Mark Rutland (4):
KVM: arm64: Unconditionally save+flush host FPSIMD/SVE/SME state
KVM: arm64: Remove host FPSIMD saving for non-protected KVM
KVM: arm64: Remove VHE host restore of CPACR_EL1.ZEN
KVM: arm64: Eagerly switch ZCR_EL{1,2}
arch/arm64/include/asm/fpsimd.h | 4 +-
arch/arm64/include/asm/kvm_host.h | 17 +++--
arch/arm64/include/asm/kvm_hyp.h | 7 ++
arch/arm64/include/asm/processor.h | 7 ++
arch/arm64/kernel/fpsimd.c | 117 +++++++++++++++++++++++---------
arch/arm64/kernel/process.c | 3 +
arch/arm64/kernel/ptrace.c | 3 +
arch/arm64/kernel/signal.c | 3 +
arch/arm64/kvm/arm.c | 1 -
arch/arm64/kvm/fpsimd.c | 72 +++++++++-----------
arch/arm64/kvm/hyp/entry.S | 5 ++
arch/arm64/kvm/hyp/include/hyp/switch.h | 86 +++++++++++++++--------
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 9 ++-
arch/arm64/kvm/hyp/nvhe/switch.c | 52 +++++++++-----
arch/arm64/kvm/hyp/vhe/switch.c | 4 ++
arch/arm64/kvm/reset.c | 3 +
16 files changed, 266 insertions(+), 127 deletions(-)
---
base-commit: 0c935c049b5c196b83b968c72d348ae6fff83ea2
change-id: 20250326-stable-sve-5-15-bfd75482dcfa
Best regards,
--
Mark Brown <broonie(a)kernel.org>
From: Chris Wilson <chris.p.wilson(a)intel.com>
commit 78a033433a5ae4fee85511ee075bc9a48312c79e upstream.
If we abort driver initialisation in the middle of gt/engine discovery,
some engines will be fully setup and some not. Those incompletely setup
engines only have 'engine->release == NULL' and so will leak any of the
common objects allocated.
v2:
- Drop the destroy_pinned_context() helper for now. It's not really
worth it with just a single callsite at the moment. (Janusz)
Signed-off-by: Chris Wilson <chris.p.wilson(a)intel.com>
Cc: Janusz Krzysztofik <janusz.krzysztofik(a)linux.intel.com>
Signed-off-by: Matt Roper <matthew.d.roper(a)intel.com>
Reviewed-by: Janusz Krzysztofik <janusz.krzysztofik(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220915232654.3283095-2-matt…
Signed-off-by: Zhi Yang <Zhi.Yang(a)windriver.com>
Signed-off-by: He Zhe <zhe.he(a)windriver.com>
---
Build test passed.
---
drivers/gpu/drm/i915/gt/intel_engine_cs.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index a19537706ed1..eb6f4d7f1e34 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -904,8 +904,13 @@ int intel_engines_init(struct intel_gt *gt)
return err;
err = setup(engine);
- if (err)
+ if (err) {
+ intel_engine_cleanup_common(engine);
return err;
+ }
+
+ /* The backend should now be responsible for cleanup */
+ GEM_BUG_ON(engine->release == NULL);
err = engine_init_common(engine);
if (err)
--
2.34.1
This series adds fine grained trap control in EL2 required for FEAT_PMUv3p9
registers like PMICNTR_EL0, PMICFILTR_EL0, and PMUACR_EL1 which are already
being used in the kernel. This is required to prevent their EL1 access trap
into EL2.
The following commits that enabled access into FEAT_PMUv3p9 registers have
already been merged upstream from 6.13 onwards.
d8226d8cfbaf ("perf: arm_pmuv3: Add support for Armv9.4 PMU instruction counter")
0bbff9ed8165 ("perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control")
The sysreg patches in this series are required for the final patch which
fixes the actual problem.
Anshuman Khandual (7):
arm64/sysreg: Update register fields for ID_AA64MMFR0_EL1
arm64/sysreg: Add register fields for HDFGRTR2_EL2
arm64/sysreg: Add register fields for HDFGWTR2_EL2
arm64/sysreg: Add register fields for HFGITR2_EL2
arm64/sysreg: Add register fields for HFGRTR2_EL2
arm64/sysreg: Add register fields for HFGWTR2_EL2
arm64/boot: Enable EL2 requirements for FEAT_PMUv3p9
Documentation/arch/arm64/booting.rst | 22 ++++++
arch/arm64/include/asm/el2_setup.h | 25 +++++++
arch/arm64/tools/sysreg | 103 +++++++++++++++++++++++++++
3 files changed, 150 insertions(+)
--
2.30.2
From: Chris Wilson <chris.p.wilson(a)intel.com>
commit 78a033433a5ae4fee85511ee075bc9a48312c79e upstream.
If we abort driver initialisation in the middle of gt/engine discovery,
some engines will be fully setup and some not. Those incompletely setup
engines only have 'engine->release == NULL' and so will leak any of the
common objects allocated.
v2:
- Drop the destroy_pinned_context() helper for now. It's not really
worth it with just a single callsite at the moment. (Janusz)
Signed-off-by: Chris Wilson <chris.p.wilson(a)intel.com>
Cc: Janusz Krzysztofik <janusz.krzysztofik(a)linux.intel.com>
Signed-off-by: Matt Roper <matthew.d.roper(a)intel.com>
Reviewed-by: Janusz Krzysztofik <janusz.krzysztofik(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220915232654.3283095-2-matt…
Signed-off-by: Zhi Yang <Zhi.Yang(a)windriver.com>
Signed-off-by: He Zhe <zhe.he(a)windriver.com>
---
Build test passed.
---
drivers/gpu/drm/i915/gt/intel_engine_cs.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index eb99441e0ada..42cb3ad04d89 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -983,8 +983,13 @@ int intel_engines_init(struct intel_gt *gt)
return err;
err = setup(engine);
- if (err)
+ if (err) {
+ intel_engine_cleanup_common(engine);
return err;
+ }
+
+ /* The backend should now be responsible for cleanup */
+ GEM_BUG_ON(engine->release == NULL);
err = engine_init_common(engine);
if (err)
--
2.34.1
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 9f98a4f4e7216dbe366010b4cdcab6b220f229c4
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025040844-busload-dumpling-45ff@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9f98a4f4e7216dbe366010b4cdcab6b220f229c4 Mon Sep 17 00:00:00 2001
From: Vishal Annapurve <vannapurve(a)google.com>
Date: Fri, 28 Feb 2025 01:44:15 +0000
Subject: [PATCH] x86/tdx: Fix arch_safe_halt() execution for TDX VMs
Direct HLT instruction execution causes #VEs for TDX VMs which is routed
to hypervisor via TDCALL. If HLT is executed in STI-shadow, resulting #VE
handler will enable interrupts before TDCALL is routed to hypervisor
leading to missed wakeup events, as current TDX spec doesn't expose
interruptibility state information to allow #VE handler to selectively
enable interrupts.
Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
prevented the idle routines from executing HLT instruction in STI-shadow.
But it missed the paravirt routine which can be reached via this path
as an example:
kvm_wait() =>
safe_halt() =>
raw_safe_halt() =>
arch_safe_halt() =>
irq.safe_halt() =>
pv_native_safe_halt()
To reliably handle arch_safe_halt() for TDX VMs, introduce explicit
dependency on CONFIG_PARAVIRT and override paravirt halt()/safe_halt()
routines with TDX-safe versions that execute direct TDCALL and needed
interrupt flag updates. Executing direct TDCALL brings in additional
benefit of avoiding HLT related #VEs altogether.
As tested by Ryan Afranji:
"Tested with the specjbb2015 benchmark. It has heavy lock contention which leads
to many halt calls. TDX VMs suffered a poor score before this patchset.
Verified the major performance improvement with this patchset applied."
Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
Signed-off-by: Vishal Annapurve <vannapurve(a)google.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Tested-by: Ryan Afranji <afranji(a)google.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Brian Gerst <brgerst(a)gmail.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: H. Peter Anvin <hpa(a)zytor.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20250228014416.3925664-3-vannapurve@google.com
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 05b4eca156cf..f614c0522a0b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -878,6 +878,7 @@ config INTEL_TDX_GUEST
depends on X86_64 && CPU_SUP_INTEL
depends on X86_X2APIC
depends on EFI_STUB
+ depends on PARAVIRT
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
select X86_MCE
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 7772b01ab738..aa0eb4057226 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -14,6 +14,7 @@
#include <asm/ia32.h>
#include <asm/insn.h>
#include <asm/insn-eval.h>
+#include <asm/paravirt_types.h>
#include <asm/pgtable.h>
#include <asm/set_memory.h>
#include <asm/traps.h>
@@ -398,7 +399,7 @@ static int handle_halt(struct ve_info *ve)
return ve_instr_len(ve);
}
-void __cpuidle tdx_safe_halt(void)
+void __cpuidle tdx_halt(void)
{
const bool irq_disabled = false;
@@ -409,6 +410,16 @@ void __cpuidle tdx_safe_halt(void)
WARN_ONCE(1, "HLT instruction emulation failed\n");
}
+static void __cpuidle tdx_safe_halt(void)
+{
+ tdx_halt();
+ /*
+ * "__cpuidle" section doesn't support instrumentation, so stick
+ * with raw_* variant that avoids tracing hooks.
+ */
+ raw_local_irq_enable();
+}
+
static int read_msr(struct pt_regs *regs, struct ve_info *ve)
{
struct tdx_module_args args = {
@@ -1109,6 +1120,19 @@ void __init tdx_early_init(void)
x86_platform.guest.enc_kexec_begin = tdx_kexec_begin;
x86_platform.guest.enc_kexec_finish = tdx_kexec_finish;
+ /*
+ * Avoid "sti;hlt" execution in TDX guests as HLT induces a #VE that
+ * will enable interrupts before HLT TDCALL invocation if executed
+ * in STI-shadow, possibly resulting in missed wakeup events.
+ *
+ * Modify all possible HLT execution paths to use TDX specific routines
+ * that directly execute TDCALL and toggle the interrupt state as
+ * needed after TDCALL completion. This also reduces HLT related #VEs
+ * in addition to having a reliable halt logic execution.
+ */
+ pv_ops.irq.safe_halt = tdx_safe_halt;
+ pv_ops.irq.halt = tdx_halt;
+
/*
* TDX intercepts the RDMSR to read the X2APIC ID in the parallel
* bringup low level code. That raises #VE which cannot be handled
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 65394aa9b49f..4a1922ec80cf 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -58,7 +58,7 @@ void tdx_get_ve_info(struct ve_info *ve);
bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
-void tdx_safe_halt(void);
+void tdx_halt(void);
bool tdx_early_handle_ve(struct pt_regs *regs);
@@ -72,7 +72,7 @@ void __init tdx_dump_td_ctls(u64 td_ctls);
#else
static inline void tdx_early_init(void) { };
-static inline void tdx_safe_halt(void) { };
+static inline void tdx_halt(void) { };
static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 91f6ff618852..962c3ce39323 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -939,7 +939,7 @@ void __init select_idle_routine(void)
static_call_update(x86_idle, mwait_idle);
} else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
pr_info("using TDX aware idle routine\n");
- static_call_update(x86_idle, tdx_safe_halt);
+ static_call_update(x86_idle, tdx_halt);
} else {
static_call_update(x86_idle, default_idle);
}
From: Srinivasan Shanmugam <srinivasan.shanmugam(a)amd.com>
commit 15c2990e0f0108b9c3752d7072a97d45d4283aea upstream.
This commit adds null checks for the 'stream' and 'plane' variables in
the dcn30_apply_idle_power_optimizations function. These variables were
previously assumed to be null at line 922, but they were used later in
the code without checking if they were null. This could potentially lead
to a null pointer dereference, which would cause a crash.
The null checks ensure that 'stream' and 'plane' are not null before
they are used, preventing potential crashes.
Fixes the below static smatch checker:
drivers/gpu/drm/amd/amdgpu/../display/dc/hwss/dcn30/dcn30_hwseq.c:938 dcn30_apply_idle_power_optimizations() error: we previously assumed 'stream' could be null (see line 922)
drivers/gpu/drm/amd/amdgpu/../display/dc/hwss/dcn30/dcn30_hwseq.c:940 dcn30_apply_idle_power_optimizations() error: we previously assumed 'plane' could be null (see line 922)
Cc: Tom Chung <chiahsuan.chung(a)amd.com>
Cc: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
Cc: Bhawanpreet Lakha <Bhawanpreet.Lakha(a)amd.com>
Cc: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
Cc: Roman Li <roman.li(a)amd.com>
Cc: Hersen Wu <hersenxs.wu(a)amd.com>
Cc: Alex Hung <alex.hung(a)amd.com>
Cc: Aurabindo Pillai <aurabindo.pillai(a)amd.com>
Cc: Harry Wentland <harry.wentland(a)amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam(a)amd.com>
Reviewed-by: Aurabindo Pillai <aurabindo.pillai(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Signed-off-by: Zhi Yang <Zhi.Yang(a)windriver.com>
Signed-off-by: He Zhe <zhe.he(a)windriver.com>
---
Build test passed.
---
drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c
index 81547178a934..30716b88136c 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c
@@ -784,6 +784,9 @@ bool dcn30_apply_idle_power_optimizations(struct dc *dc, bool enable)
stream = dc->current_state->streams[0];
plane = (stream ? dc->current_state->stream_status[0].plane_states[0] : NULL);
+ if (!stream || !plane)
+ return false;
+
if (stream && plane) {
cursor_cache_enable = stream->cursor_position.enable &&
plane->address.grph.cursor_cache_addr.quad_part;
--
2.34.1
The patch below does not apply to the 6.13-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.13.y
git checkout FETCH_HEAD
git cherry-pick -x dd4f730b557ce701a2cd4f604bf1e57667bd8b6e
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025040803-womb-decorated-10be@gregkh' --subject-prefix 'PATCH 6.13.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From dd4f730b557ce701a2cd4f604bf1e57667bd8b6e Mon Sep 17 00:00:00 2001
From: Nathan Chancellor <nathan(a)kernel.org>
Date: Mon, 10 Feb 2025 21:28:25 -0500
Subject: [PATCH] ACPI: platform-profile: Fix CFI violation when accessing
sysfs files
When an attribute group is created with sysfs_create_group(), the
->sysfs_ops() callback is set to kobj_sysfs_ops, which sets the ->show()
and ->store() callbacks to kobj_attr_show() and kobj_attr_store()
respectively. These functions use container_of() to get the respective
callback from the passed attribute, meaning that these callbacks need to
be of the same type as the callbacks in 'struct kobj_attribute'.
However, ->show() and ->store() in the platform_profile driver are
defined for struct device_attribute with the help of DEVICE_ATTR_RO()
and DEVICE_ATTR_RW(), which results in a CFI violation when accessing
platform_profile or platform_profile_choices under /sys/firmware/acpi
because the types do not match:
CFI failure at kobj_attr_show+0x19/0x30 (target: platform_profile_choices_show+0x0/0x140; expected type: 0x7a69590c)
There is no functional issue from the type mismatch because the layout
of 'struct kobj_attribute' and 'struct device_attribute' are the same,
so the container_of() cast does not break anything aside from CFI.
Change the type of platform_profile_choices_show() and
platform_profile_{show,store}() to match the callbacks in
'struct kobj_attribute' and update the attribute variables to
match, which resolves the CFI violation.
Cc: All applicable <stable(a)vger.kernel.org>
Fixes: a2ff95e018f1 ("ACPI: platform: Add platform profile support")
Reported-by: John Rowley <lkml(a)johnrowley.me>
Closes: https://github.com/ClangBuiltLinux/linux/issues/2047
Tested-by: John Rowley <lkml(a)johnrowley.me>
Reviewed-by: Sami Tolvanen <samitolvanen(a)google.com>
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Reviewed-by: Mark Pearson <mpearson-lenovo(a)squebb.ca>
Tested-by: Mark Pearson <mpearson-lenovo(a)squebb.ca>
Link: https://patch.msgid.link/20250210-acpi-platform_profile-fix-cfi-violation-v…
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
diff --git a/drivers/acpi/platform_profile.c b/drivers/acpi/platform_profile.c
index fc92e43d0fe9..1b6317f759f9 100644
--- a/drivers/acpi/platform_profile.c
+++ b/drivers/acpi/platform_profile.c
@@ -260,14 +260,14 @@ static int _aggregate_choices(struct device *dev, void *data)
/**
* platform_profile_choices_show - Show the available profile choices for legacy sysfs interface
- * @dev: The device
+ * @kobj: The kobject
* @attr: The attribute
* @buf: The buffer to write to
*
* Return: The number of bytes written
*/
-static ssize_t platform_profile_choices_show(struct device *dev,
- struct device_attribute *attr,
+static ssize_t platform_profile_choices_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
char *buf)
{
unsigned long aggregate[BITS_TO_LONGS(PLATFORM_PROFILE_LAST)];
@@ -333,14 +333,14 @@ static int _store_and_notify(struct device *dev, void *data)
/**
* platform_profile_show - Show the current profile for legacy sysfs interface
- * @dev: The device
+ * @kobj: The kobject
* @attr: The attribute
* @buf: The buffer to write to
*
* Return: The number of bytes written
*/
-static ssize_t platform_profile_show(struct device *dev,
- struct device_attribute *attr,
+static ssize_t platform_profile_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
char *buf)
{
enum platform_profile_option profile = PLATFORM_PROFILE_LAST;
@@ -362,15 +362,15 @@ static ssize_t platform_profile_show(struct device *dev,
/**
* platform_profile_store - Set the profile for legacy sysfs interface
- * @dev: The device
+ * @kobj: The kobject
* @attr: The attribute
* @buf: The buffer to read from
* @count: The number of bytes to read
*
* Return: The number of bytes read
*/
-static ssize_t platform_profile_store(struct device *dev,
- struct device_attribute *attr,
+static ssize_t platform_profile_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
const char *buf, size_t count)
{
unsigned long choices[BITS_TO_LONGS(PLATFORM_PROFILE_LAST)];
@@ -401,12 +401,12 @@ static ssize_t platform_profile_store(struct device *dev,
return count;
}
-static DEVICE_ATTR_RO(platform_profile_choices);
-static DEVICE_ATTR_RW(platform_profile);
+static struct kobj_attribute attr_platform_profile_choices = __ATTR_RO(platform_profile_choices);
+static struct kobj_attribute attr_platform_profile = __ATTR_RW(platform_profile);
static struct attribute *platform_profile_attrs[] = {
- &dev_attr_platform_profile_choices.attr,
- &dev_attr_platform_profile.attr,
+ &attr_platform_profile_choices.attr,
+ &attr_platform_profile.attr,
NULL
};
From: Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
commit b25e11f978b63cb7857890edb3a698599cddb10e upstream.
This aligned BR/EDR JUST_WORKS method with LE which since 92516cd97fd4
("Bluetooth: Always request for user confirmation for Just Works")
always request user confirmation with confirm_hint set since the
likes of bluetoothd have dedicated policy around JUST_WORKS method
(e.g. main.conf:JustWorksRepairing).
CVE: CVE-2024-8805
Cc: stable(a)vger.kernel.org
Fixes: ba15a58b179e ("Bluetooth: Fix SSP acceptor just-works confirmation without MITM")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
Tested-by: Kiran K <kiran.k(a)intel.com>
Signed-off-by: Xiangyu Chen <xiangyu.chen(a)windriver.com>
Signed-off-by: He Zhe <zhe.he(a)windriver.com>
---
Verified the build test.
---
net/bluetooth/hci_event.c | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
index 50e21f67a73d..83af50c3838a 100644
--- a/net/bluetooth/hci_event.c
+++ b/net/bluetooth/hci_event.c
@@ -4859,19 +4859,16 @@ static void hci_user_confirm_request_evt(struct hci_dev *hdev,
goto unlock;
}
- /* If no side requires MITM protection; auto-accept */
+ /* If no side requires MITM protection; use JUST_CFM method */
if ((!loc_mitm || conn->remote_cap == HCI_IO_NO_INPUT_OUTPUT) &&
(!rem_mitm || conn->io_capability == HCI_IO_NO_INPUT_OUTPUT)) {
- /* If we're not the initiators request authorization to
- * proceed from user space (mgmt_user_confirm with
- * confirm_hint set to 1). The exception is if neither
- * side had MITM or if the local IO capability is
- * NoInputNoOutput, in which case we do auto-accept
+ /* If we're not the initiator of request authorization and the
+ * local IO capability is not NoInputNoOutput, use JUST_WORKS
+ * method (mgmt_user_confirm with confirm_hint set to 1).
*/
if (!test_bit(HCI_CONN_AUTH_PEND, &conn->flags) &&
- conn->io_capability != HCI_IO_NO_INPUT_OUTPUT &&
- (loc_mitm || rem_mitm)) {
+ conn->io_capability != HCI_IO_NO_INPUT_OUTPUT) {
BT_DBG("Confirming auto-accept as acceptor");
confirm_hint = 1;
goto confirm;
--
2.34.1
The patch below does not apply to the 6.13-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.13.y
git checkout FETCH_HEAD
git cherry-pick -x e7607f7d6d81af71dcc5171278aadccc94d277cd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025040805-boaster-hazing-36c3@gregkh' --subject-prefix 'PATCH 6.13.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e7607f7d6d81af71dcc5171278aadccc94d277cd Mon Sep 17 00:00:00 2001
From: Nathan Chancellor <nathan(a)kernel.org>
Date: Thu, 20 Mar 2025 22:33:49 +0100
Subject: [PATCH] ARM: 9443/1: Require linker to support KEEP within OVERLAY
for DCE
ld.lld prior to 21.0.0 does not support using the KEEP keyword within an
overlay description, which may be needed to avoid discarding necessary
sections within an overlay with '--gc-sections', which can be enabled
for the kernel via CONFIG_LD_DEAD_CODE_DATA_ELIMINATION.
Disallow CONFIG_LD_DEAD_CODE_DATA_ELIMINATION without support for KEEP
within OVERLAY and introduce a macro, OVERLAY_KEEP, that can be used to
conditionally add KEEP when it is properly supported to avoid breaking
old versions of ld.lld.
Cc: stable(a)vger.kernel.org
Link: https://github.com/llvm/llvm-project/commit/381599f1fe973afad3094e55ec99b16…
Reviewed-by: Linus Walleij <linus.walleij(a)linaro.org>
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel(a)armlinux.org.uk>
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 202dbd17ad2f..25ed6f1a7c7a 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -121,7 +121,7 @@ config ARM
select HAVE_KERNEL_XZ
select HAVE_KPROBES if !XIP_KERNEL && !CPU_ENDIAN_BE32 && !CPU_V7M
select HAVE_KRETPROBES if HAVE_KPROBES
- select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if (LD_VERSION >= 23600 || LD_IS_LLD)
+ select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if (LD_VERSION >= 23600 || LD_CAN_USE_KEEP_IN_OVERLAY)
select HAVE_MOD_ARCH_SPECIFIC
select HAVE_NMI
select HAVE_OPTPROBES if !THUMB2_KERNEL
diff --git a/arch/arm/include/asm/vmlinux.lds.h b/arch/arm/include/asm/vmlinux.lds.h
index 89697f204715..a54db342653a 100644
--- a/arch/arm/include/asm/vmlinux.lds.h
+++ b/arch/arm/include/asm/vmlinux.lds.h
@@ -34,6 +34,12 @@
#define NOCROSSREFS
#endif
+#ifdef CONFIG_LD_CAN_USE_KEEP_IN_OVERLAY
+#define OVERLAY_KEEP(x) KEEP(x)
+#else
+#define OVERLAY_KEEP(x) x
+#endif
+
/* Set start/end symbol names to the LMA for the section */
#define ARM_LMA(sym, section) \
sym##_start = LOADADDR(section); \
diff --git a/init/Kconfig b/init/Kconfig
index d0d021b3fa3b..fc994f5cd5db 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -129,6 +129,11 @@ config CC_HAS_COUNTED_BY
# https://github.com/llvm/llvm-project/pull/112636
depends on !(CC_IS_CLANG && CLANG_VERSION < 190103)
+config LD_CAN_USE_KEEP_IN_OVERLAY
+ # ld.lld prior to 21.0.0 did not support KEEP within an overlay description
+ # https://github.com/llvm/llvm-project/pull/130661
+ def_bool LD_IS_BFD || LLD_VERSION >= 210000
+
config RUSTC_HAS_COERCE_POINTEE
def_bool RUSTC_VERSION >= 108400
From: Chunguang Xu <chunguang.xu(a)shopee.com>
[ Upstream commit e5d574ab37f5f2e7937405613d9b1a724811e5ad ]
If a discard request needs to be retried, and that retry may fail before
a new special payload is added, a double free will result. Clear the
RQF_SPECIAL_LOAD when the request is cleaned.
Signed-off-by: Chunguang Xu <chunguang.xu(a)shopee.com>
Reviewed-by: Sagi Grimberg <sagi(a)grimberg.me>
Reviewed-by: Max Gurtovoy <mgurtovoy(a)nvidia.com>
Signed-off-by: Keith Busch <kbusch(a)kernel.org>
[Minor context change fixed]
Signed-off-by: Cliff Liu <donghua.liu(a)windriver.com>
Signed-off-by: He Zhe <Zhe.He(a)windriver.com>
---
Verified the build test.
---
drivers/nvme/host/core.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 019a6dbdcbc2..7d6aab68446e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -852,6 +852,7 @@ void nvme_cleanup_cmd(struct request *req)
clear_bit_unlock(0, &ns->ctrl->discard_page_busy);
else
kfree(page_address(page) + req->special_vec.bv_offset);
+ req->rq_flags &= ~RQF_SPECIAL_PAYLOAD;
}
}
EXPORT_SYMBOL_GPL(nvme_cleanup_cmd);
--
2.34.1
Current xhci bus resume implementation prevents xHC host from generating
interrupts during high-speed USB 2 and super-speed USB 3 bus resume.
Only reason to disable interrupts during bus resume would be to prevent
the interrupt handler from interfering with the resume process of USB 2
ports.
Host initiated resume of USB 2 ports is done in two stages.
The xhci driver first transitions the port from 'U3' to 'Resume' state,
then wait in Resume for 20ms, and finally moves port to U0 state.
xhci driver can't prevent interrupts by keeping the xhci spinlock
due to this 20ms sleep.
Limit interrupt disabling to the USB 2 port resume case only.
resuming USB 2 ports in bus resume is only done in special cases where
USB 2 ports had to be forced to suspend during bus suspend.
The current way of preventing interrupts by clearing the 'Interrupt
Enable' (INTE) bit in USBCMD register won't prevent the Interrupter
registers 'Interrupt Pending' (IP), 'Event Handler Busy' (EHB) and
USBSTS register Event Interrupt (EINT) bits from being set.
New interrupts can't be issued before those bits are properly clered.
Disable interrupts by clearing the interrupter register 'Interrupt
Enable' (IE) bit instead. This way IP, EHB and INTE won't be set
before IE is enabled again and a new interrupt is triggered.
Reported-by: Devyn Liu <liudingyuan(a)huawei.com>
Closes: https://lore.kernel.org/linux-usb/b1a9e2d51b4d4ff7a304f77c5be8164e@huawei.c…
Cc: stable(a)vger.kernel.org
Tested-by: Devyn Liu <liudingyuan(a)huawei.com>
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci-hub.c | 30 ++++++++++++++++--------------
drivers/usb/host/xhci.c | 4 ++--
drivers/usb/host/xhci.h | 2 ++
3 files changed, 20 insertions(+), 16 deletions(-)
diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
index c0f226584a40..486347776cb2 100644
--- a/drivers/usb/host/xhci-hub.c
+++ b/drivers/usb/host/xhci-hub.c
@@ -1878,9 +1878,10 @@ int xhci_bus_resume(struct usb_hcd *hcd)
int max_ports, port_index;
int sret;
u32 next_state;
- u32 temp, portsc;
+ u32 portsc;
struct xhci_hub *rhub;
struct xhci_port **ports;
+ bool disabled_irq = false;
rhub = xhci_get_rhub(hcd);
ports = rhub->ports;
@@ -1896,17 +1897,20 @@ int xhci_bus_resume(struct usb_hcd *hcd)
return -ESHUTDOWN;
}
- /* delay the irqs */
- temp = readl(&xhci->op_regs->command);
- temp &= ~CMD_EIE;
- writel(temp, &xhci->op_regs->command);
-
/* bus specific resume for ports we suspended at bus_suspend */
- if (hcd->speed >= HCD_USB3)
+ if (hcd->speed >= HCD_USB3) {
next_state = XDEV_U0;
- else
+ } else {
next_state = XDEV_RESUME;
-
+ if (bus_state->bus_suspended) {
+ /*
+ * prevent port event interrupts from interfering
+ * with usb2 port resume process
+ */
+ xhci_disable_interrupter(xhci->interrupters[0]);
+ disabled_irq = true;
+ }
+ }
port_index = max_ports;
while (port_index--) {
portsc = readl(ports[port_index]->addr);
@@ -1974,11 +1978,9 @@ int xhci_bus_resume(struct usb_hcd *hcd)
(void) readl(&xhci->op_regs->command);
bus_state->next_statechange = jiffies + msecs_to_jiffies(5);
- /* re-enable irqs */
- temp = readl(&xhci->op_regs->command);
- temp |= CMD_EIE;
- writel(temp, &xhci->op_regs->command);
- temp = readl(&xhci->op_regs->command);
+ /* re-enable interrupter */
+ if (disabled_irq)
+ xhci_enable_interrupter(xhci->interrupters[0]);
spin_unlock_irqrestore(&xhci->lock, flags);
return 0;
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index ca390beda85b..90eb491267b5 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -322,7 +322,7 @@ static void xhci_zero_64b_regs(struct xhci_hcd *xhci)
xhci_info(xhci, "Fault detected\n");
}
-static int xhci_enable_interrupter(struct xhci_interrupter *ir)
+int xhci_enable_interrupter(struct xhci_interrupter *ir)
{
u32 iman;
@@ -335,7 +335,7 @@ static int xhci_enable_interrupter(struct xhci_interrupter *ir)
return 0;
}
-static int xhci_disable_interrupter(struct xhci_interrupter *ir)
+int xhci_disable_interrupter(struct xhci_interrupter *ir)
{
u32 iman;
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 28b6264f8b87..242ab9fbc8ae 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1890,6 +1890,8 @@ int xhci_alloc_tt_info(struct xhci_hcd *xhci,
struct usb_tt *tt, gfp_t mem_flags);
int xhci_set_interrupter_moderation(struct xhci_interrupter *ir,
u32 imod_interval);
+int xhci_enable_interrupter(struct xhci_interrupter *ir);
+int xhci_disable_interrupter(struct xhci_interrupter *ir);
/* xHCI ring, segment, TRB, and TD functions */
dma_addr_t xhci_trb_virt_to_dma(struct xhci_segment *seg, union xhci_trb *trb);
--
2.43.0
From: Michal Pecio <michal.pecio(a)gmail.com>
This check is performed before prepare_transfer() and prepare_ring(), so
enqueue can already point at the final link TRB of a segment. And indeed
it will, some 0.4% of times this code is called.
Then enqueue + 1 is an invalid pointer. It will crash the kernel right
away or load some junk which may look like a link TRB and cause the real
link TRB to be replaced with a NOOP. This wouldn't end well.
Use a functionally equivalent test which doesn't dereference the pointer
and always gives correct result.
Something has crashed my machine twice in recent days while playing with
an Etron HC, and a control transfer stress test ran for confirmation has
just crashed it again. The same test passes with this patch applied.
Fixes: 5e1c67abc930 ("xhci: Fix control transfer error on Etron xHCI host")
Cc: stable(a)vger.kernel.org
Signed-off-by: Michal Pecio <michal.pecio(a)gmail.com>
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci-ring.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 4e975caca235..b906bc2eea5f 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -3777,7 +3777,7 @@ int xhci_queue_ctrl_tx(struct xhci_hcd *xhci, gfp_t mem_flags,
* enqueue a No Op TRB, this can prevent the Setup and Data Stage
* TRB to be breaked by the Link TRB.
*/
- if (trb_is_link(ep_ring->enqueue + 1)) {
+ if (last_trb_on_seg(ep_ring->enq_seg, ep_ring->enqueue + 1)) {
field = TRB_TYPE(TRB_TR_NOOP) | ep_ring->cycle_state;
queue_trb(xhci, ep_ring, false, 0, 0,
TRB_INTR_TARGET(0), field);
--
2.43.0
commit c929d08df8bee855528b9d15b853c892c54e1eee upstream.
If the warning mode with disabled mitigation mode is used, then on each
CPU where the split lock occurred detection will be disabled in order to
make progress and delayed work will be scheduled, which then will enable
detection back.
Now it turns out that all CPUs use one global delayed work structure.
This leads to the fact that if a split lock occurs on several CPUs
at the same time (within 2 jiffies), only one CPU will schedule delayed
work, but the rest will not.
The return value of schedule_delayed_work_on() would have shown this,
but it is not checked in the code.
A diagram that can help to understand the bug reproduction:
- sld_update_msr() enables/disables SLD on both CPUs on the same core
- schedule_delayed_work_on() internally checks WORK_STRUCT_PENDING_BIT.
If a work has the 'pending' status, then schedule_delayed_work_on()
will return an error code and, most importantly, the work will not
be placed in the workqueue.
Let's say we have a multicore system on which split_lock_mitigate=0 and
a multithreaded application is running that calls splitlock in multiple
threads. Due to the fact that sld_update_msr() affects the entire core
(both CPUs), we will consider 2 CPUs from different cores. Let the 2
threads of this application schedule to CPU0 (core 0) and to CPU 2
(core 1), then:
| || |
| CPU 0 (core 0) || CPU 2 (core 1) |
|_________________________________||___________________________________|
| || |
| 1) SPLIT LOCK occured || |
| || |
| 2) split_lock_warn() || |
| || |
| 3) sysctl_sld_mitigate == 0 || |
| (work = &sl_reenable) || |
| || |
| 4) schedule_delayed_work_on() || |
| (reenable will be called || |
| after 2 jiffies on CPU 0) || |
| || |
| 5) disable SLD for core 0 || |
| || |
| ------------------------- || |
| || |
| || 6) SPLIT LOCK occured |
| || |
| || 7) split_lock_warn() |
| || |
| || 8) sysctl_sld_mitigate == 0 |
| || (work = &sl_reenable, |
| || the same address as in 3) ) |
| || |
| 2 jiffies || 9) schedule_delayed_work_on() |
| || fials because the work is in |
| || the pending state since 4). |
| || The work wasn't placed to the |
| || workqueue. reenable won't be |
| || called on CPU 2 |
| || |
| || 10) disable SLD for core 0 |
| || |
| || From now on SLD will |
| || never be reenabled on core 1 |
| || |
| ------------------------- || |
| || |
| 11) enable SLD for core 0 by || |
| __split_lock_reenable || |
| || |
If the application threads can be scheduled to all processor cores,
then over time there will be only one core left, on which SLD will be
enabled and split lock will be able to be detected; and on all other
cores SLD will be disabled all the time.
Most likely, this bug has not been noticed for so long because
sysctl_sld_mitigate default value is 1, and in this case a semaphore
is used that does not allow 2 different cores to have SLD disabled at
the same time, that is, strictly only one work is placed in the
workqueue.
In order to fix the warning mode with disabled mitigation mode,
delayed work has to be per-CPU. Implement it.
Fixes: 727209376f49 ("x86/split_lock: Add sysctl to control the misery mode")
Signed-off-by: Maksim Davydov <davydov-max(a)yandex-team.ru>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Tested-by: Guilherme G. Piccoli <gpiccoli(a)igalia.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ravi Bangoria <ravi.bangoria(a)amd.com>
Cc: Tom Lendacky <thomas.lendacky(a)amd.com>
Link: https://lore.kernel.org/r/20250115131704.132609-1-davydov-max@yandex-team.ru
---
arch/x86/kernel/cpu/intel.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index b91f3d72bcdd..2c43a1423b09 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1204,7 +1204,13 @@ static void __split_lock_reenable(struct work_struct *work)
{
sld_update_msr(true);
}
-static DECLARE_DELAYED_WORK(sl_reenable, __split_lock_reenable);
+/*
+ * In order for each CPU to schedule its delayed work independently of the
+ * others, delayed work struct must be per-CPU. This is not required when
+ * sysctl_sld_mitigate is enabled because of the semaphore that limits
+ * the number of simultaneously scheduled delayed works to 1.
+ */
+static DEFINE_PER_CPU(struct delayed_work, sl_reenable);
/*
* If a CPU goes offline with pending delayed work to re-enable split lock
@@ -1225,7 +1231,7 @@ static int splitlock_cpu_offline(unsigned int cpu)
static void split_lock_warn(unsigned long ip)
{
- struct delayed_work *work;
+ struct delayed_work *work = NULL;
int cpu;
if (!current->reported_split_lock)
@@ -1247,11 +1253,17 @@ static void split_lock_warn(unsigned long ip)
if (down_interruptible(&buslock_sem) == -EINTR)
return;
work = &sl_reenable_unlock;
- } else {
- work = &sl_reenable;
}
cpu = get_cpu();
+
+ if (!work) {
+ work = this_cpu_ptr(&sl_reenable);
+ /* Deferred initialization of per-CPU struct */
+ if (!work->work.func)
+ INIT_DELAYED_WORK(work, __split_lock_reenable);
+ }
+
schedule_delayed_work_on(cpu, work, 2);
/* Disable split lock detection on this CPU to make progress */
--
2.34.1
commit c929d08df8bee855528b9d15b853c892c54e1eee upstream.
If the warning mode with disabled mitigation mode is used, then on each
CPU where the split lock occurred detection will be disabled in order to
make progress and delayed work will be scheduled, which then will enable
detection back.
Now it turns out that all CPUs use one global delayed work structure.
This leads to the fact that if a split lock occurs on several CPUs
at the same time (within 2 jiffies), only one CPU will schedule delayed
work, but the rest will not.
The return value of schedule_delayed_work_on() would have shown this,
but it is not checked in the code.
A diagram that can help to understand the bug reproduction:
- sld_update_msr() enables/disables SLD on both CPUs on the same core
- schedule_delayed_work_on() internally checks WORK_STRUCT_PENDING_BIT.
If a work has the 'pending' status, then schedule_delayed_work_on()
will return an error code and, most importantly, the work will not
be placed in the workqueue.
Let's say we have a multicore system on which split_lock_mitigate=0 and
a multithreaded application is running that calls splitlock in multiple
threads. Due to the fact that sld_update_msr() affects the entire core
(both CPUs), we will consider 2 CPUs from different cores. Let the 2
threads of this application schedule to CPU0 (core 0) and to CPU 2
(core 1), then:
| || |
| CPU 0 (core 0) || CPU 2 (core 1) |
|_________________________________||___________________________________|
| || |
| 1) SPLIT LOCK occured || |
| || |
| 2) split_lock_warn() || |
| || |
| 3) sysctl_sld_mitigate == 0 || |
| (work = &sl_reenable) || |
| || |
| 4) schedule_delayed_work_on() || |
| (reenable will be called || |
| after 2 jiffies on CPU 0) || |
| || |
| 5) disable SLD for core 0 || |
| || |
| ------------------------- || |
| || |
| || 6) SPLIT LOCK occured |
| || |
| || 7) split_lock_warn() |
| || |
| || 8) sysctl_sld_mitigate == 0 |
| || (work = &sl_reenable, |
| || the same address as in 3) ) |
| || |
| 2 jiffies || 9) schedule_delayed_work_on() |
| || fials because the work is in |
| || the pending state since 4). |
| || The work wasn't placed to the |
| || workqueue. reenable won't be |
| || called on CPU 2 |
| || |
| || 10) disable SLD for core 0 |
| || |
| || From now on SLD will |
| || never be reenabled on core 1 |
| || |
| ------------------------- || |
| || |
| 11) enable SLD for core 0 by || |
| __split_lock_reenable || |
| || |
If the application threads can be scheduled to all processor cores,
then over time there will be only one core left, on which SLD will be
enabled and split lock will be able to be detected; and on all other
cores SLD will be disabled all the time.
Most likely, this bug has not been noticed for so long because
sysctl_sld_mitigate default value is 1, and in this case a semaphore
is used that does not allow 2 different cores to have SLD disabled at
the same time, that is, strictly only one work is placed in the
workqueue.
In order to fix the warning mode with disabled mitigation mode,
delayed work has to be per-CPU. Implement it.
Fixes: 727209376f49 ("x86/split_lock: Add sysctl to control the misery mode")
Signed-off-by: Maksim Davydov <davydov-max(a)yandex-team.ru>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Tested-by: Guilherme G. Piccoli <gpiccoli(a)igalia.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ravi Bangoria <ravi.bangoria(a)amd.com>
Cc: Tom Lendacky <thomas.lendacky(a)amd.com>
Link: https://lore.kernel.org/r/20250115131704.132609-1-davydov-max@yandex-team.ru
---
arch/x86/kernel/cpu/intel.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 38eeff91109f..6c0e0619e6d3 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1168,7 +1168,13 @@ static void __split_lock_reenable(struct work_struct *work)
{
sld_update_msr(true);
}
-static DECLARE_DELAYED_WORK(sl_reenable, __split_lock_reenable);
+/*
+ * In order for each CPU to schedule its delayed work independently of the
+ * others, delayed work struct must be per-CPU. This is not required when
+ * sysctl_sld_mitigate is enabled because of the semaphore that limits
+ * the number of simultaneously scheduled delayed works to 1.
+ */
+static DEFINE_PER_CPU(struct delayed_work, sl_reenable);
/*
* If a CPU goes offline with pending delayed work to re-enable split lock
@@ -1189,7 +1195,7 @@ static int splitlock_cpu_offline(unsigned int cpu)
static void split_lock_warn(unsigned long ip)
{
- struct delayed_work *work;
+ struct delayed_work *work = NULL;
int cpu;
if (!current->reported_split_lock)
@@ -1211,11 +1217,17 @@ static void split_lock_warn(unsigned long ip)
if (down_interruptible(&buslock_sem) == -EINTR)
return;
work = &sl_reenable_unlock;
- } else {
- work = &sl_reenable;
}
cpu = get_cpu();
+
+ if (!work) {
+ work = this_cpu_ptr(&sl_reenable);
+ /* Deferred initialization of per-CPU struct */
+ if (!work->work.func)
+ INIT_DELAYED_WORK(work, __split_lock_reenable);
+ }
+
schedule_delayed_work_on(cpu, work, 2);
/* Disable split lock detection on this CPU to make progress */
--
2.34.1
startup()/shutdown() callbacks access SIFIVE_SERIAL_IE_OFFS.
The register is also accessed from write() callback.
If console were printing and startup()/shutdown() callback
gets called, its access to the register could be overwritten.
Add port->lock to startup()/shutdown() callbacks to make sure
their access to SIFIVE_SERIAL_IE_OFFS is synchronized against
write() callback.
Fixes: 45c054d0815b ("tty: serial: add driver for the SiFive UART")
Signed-off-by: Ryo Takakura <ryotkkr98(a)gmail.com>
Cc: stable(a)vger.kernel.org
---
This patch used be part of a series for converting sifive driver to
nbcon[0]. It's now sent seperatly as the rest of the series does not
need be applied to the stable branch.
Sincerely,
Ryo Takakura
[0] https://lore.kernel.org/all/20250405043833.397020-1-ryotkkr98@gmail.com/
---
drivers/tty/serial/sifive.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/tty/serial/sifive.c b/drivers/tty/serial/sifive.c
index 5904a2d4c..054a8e630 100644
--- a/drivers/tty/serial/sifive.c
+++ b/drivers/tty/serial/sifive.c
@@ -563,8 +563,11 @@ static void sifive_serial_break_ctl(struct uart_port *port, int break_state)
static int sifive_serial_startup(struct uart_port *port)
{
struct sifive_serial_port *ssp = port_to_sifive_serial_port(port);
+ unsigned long flags;
+ uart_port_lock_irqsave(&ssp->port, &flags);
__ssp_enable_rxwm(ssp);
+ uart_port_unlock_irqrestore(&ssp->port, flags);
return 0;
}
@@ -572,9 +575,12 @@ static int sifive_serial_startup(struct uart_port *port)
static void sifive_serial_shutdown(struct uart_port *port)
{
struct sifive_serial_port *ssp = port_to_sifive_serial_port(port);
+ unsigned long flags;
+ uart_port_lock_irqsave(&ssp->port, &flags);
__ssp_disable_rxwm(ssp);
__ssp_disable_txwm(ssp);
+ uart_port_unlock_irqrestore(&ssp->port, flags);
}
/**
--
2.34.1
In commit 7e119cff9d0a, "ocfs2: convert w_pages to w_folios" the
chunk page allocations became order 0 folio allocations. If an
allocation failed, the folio array entry should be NULL so the
error path can skip the entry. In the port it is -ENOMEM and
the error path panics trying to free this bad value.
Signed-off-by: Mark Tinguely <mark.tinguely(a)oracle.com>
Cc: stable(a)vger.kernel.org
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
---
fs/ocfs2/aops.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 40b6bce12951..89aadc6cdd87 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -1071,6 +1071,7 @@ static int ocfs2_grab_folios_for_write(struct
address_space *mapping,
if (IS_ERR(wc->w_folios[i])) {
ret = PTR_ERR(wc->w_folios[i]);
mlog_errno(ret);
+ wc->w_folios[i] = NULL;
goto out;
}
}
--
2.39.5 (Apple Git-154)
Hi,
I would like to request a backport of
8b36447c9ae1 ("ASoC: Intel: adl: add 2xrt1316 audio configuration")
to 6.12 LTS and 6.14 stable kernel as we have at least one affected user:
https://github.com/thesofproject/linux/issues/5274
The topology file has been already released.
Thanks,
Péter
The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal
single pages from biggest buddy") as the root cause of a 56.4%
regression in vm-scalability::lru-file-mmap-read.
Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix
freelist movement during block conversion"), as the root cause for a
regression in worst-case zone->lock+irqoff hold times.
Both of these patches modify the page allocator's fallback path to be
less greedy in an effort to stave off fragmentation. The flip side of
this is that fallbacks are also less productive each time around,
which means the fallback search can run much more frequently.
Carlos' traces point to rmqueue_bulk() specifically, which tries to
refill the percpu cache by allocating a large batch of pages in a
loop. It highlights how once the native freelists are exhausted, the
fallback code first scans orders top-down for whole blocks to claim,
then falls back to a bottom-up search for the smallest buddy to steal.
For the next batch page, it goes through the same thing again.
This can be made more efficient. Since rmqueue_bulk() holds the
zone->lock over the entire batch, the freelists are not subject to
outside changes; when the search for a block to claim has already
failed, there is no point in trying again for the next page.
Modify __rmqueue() to remember the last successful fallback mode, and
restart directly from there on the next rmqueue_bulk() iteration.
Oliver confirms that this improves beyond the regression that the test
robot reported against c2f6ea38fc1b:
commit:
f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap")
c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy")
acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- your patch
f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5
---------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
25525364 ± 3% -56.4% 11135467 -57.8% 10779336 +31.6% 33581409 vm-scalability.throughput
Carlos confirms that worst-case times are almost fully recovered
compared to before the earlier culprit patch:
2dd482ba627d (before freelist hygiene): 1ms
c0cd6f557b90 (after freelist hygiene): 90ms
next-20250319 (steal smallest buddy): 280ms
this patch : 8ms
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Reported-by: Carlos Song <carlos.song(a)nxp.com>
Tested-by: kernel test robot <oliver.sang(a)intel.com>
Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion")
Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy")
Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com
Cc: stable(a)vger.kernel.org # 6.10+
Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org>
---
mm/page_alloc.c | 100 +++++++++++++++++++++++++++++++++++-------------
1 file changed, 74 insertions(+), 26 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f51aa6051a99..03b0d45ed45a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2194,11 +2194,11 @@ try_to_claim_block(struct zone *zone, struct page *page,
* The use of signed ints for order and current_order is a deliberate
* deviation from the rest of this file, to make the for loop
* condition simpler.
- *
- * Return the stolen page, or NULL if none can be found.
*/
+
+/* Try to claim a whole foreign block, take a page, expand the remainder */
static __always_inline struct page *
-__rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
+__rmqueue_claim(struct zone *zone, int order, int start_migratetype,
unsigned int alloc_flags)
{
struct free_area *area;
@@ -2236,14 +2236,26 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
page = try_to_claim_block(zone, page, current_order, order,
start_migratetype, fallback_mt,
alloc_flags);
- if (page)
- goto got_one;
+ if (page) {
+ trace_mm_page_alloc_extfrag(page, order, current_order,
+ start_migratetype, fallback_mt);
+ return page;
+ }
}
- if (alloc_flags & ALLOC_NOFRAGMENT)
- return NULL;
+ return NULL;
+}
+
+/* Try to steal a single page from a foreign block */
+static __always_inline struct page *
+__rmqueue_steal(struct zone *zone, int order, int start_migratetype)
+{
+ struct free_area *area;
+ int current_order;
+ struct page *page;
+ int fallback_mt;
+ bool claim_block;
- /* No luck claiming pageblock. Find the smallest fallback page */
for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) {
area = &(zone->free_area[current_order]);
fallback_mt = find_suitable_fallback(area, current_order,
@@ -2253,25 +2265,28 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
page = get_page_from_free_area(area, fallback_mt);
page_del_and_expand(zone, page, order, current_order, fallback_mt);
- goto got_one;
+ trace_mm_page_alloc_extfrag(page, order, current_order,
+ start_migratetype, fallback_mt);
+ return page;
}
return NULL;
-
-got_one:
- trace_mm_page_alloc_extfrag(page, order, current_order,
- start_migratetype, fallback_mt);
-
- return page;
}
+enum rmqueue_mode {
+ RMQUEUE_NORMAL,
+ RMQUEUE_CMA,
+ RMQUEUE_CLAIM,
+ RMQUEUE_STEAL,
+};
+
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
static __always_inline struct page *
__rmqueue(struct zone *zone, unsigned int order, int migratetype,
- unsigned int alloc_flags)
+ unsigned int alloc_flags, enum rmqueue_mode *mode)
{
struct page *page;
@@ -2290,16 +2305,47 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
}
}
- page = __rmqueue_smallest(zone, order, migratetype);
- if (unlikely(!page)) {
- if (alloc_flags & ALLOC_CMA)
+ /*
+ * Try the different freelists, native then foreign.
+ *
+ * The fallback logic is expensive and rmqueue_bulk() calls in
+ * a loop with the zone->lock held, meaning the freelists are
+ * not subject to any outside changes. Remember in *mode where
+ * we found pay dirt, to save us the search on the next call.
+ */
+ switch (*mode) {
+ case RMQUEUE_NORMAL:
+ page = __rmqueue_smallest(zone, order, migratetype);
+ if (page)
+ return page;
+ fallthrough;
+ case RMQUEUE_CMA:
+ if (alloc_flags & ALLOC_CMA) {
page = __rmqueue_cma_fallback(zone, order);
-
- if (!page)
- page = __rmqueue_fallback(zone, order, migratetype,
- alloc_flags);
+ if (page) {
+ *mode = RMQUEUE_CMA;
+ return page;
+ }
+ }
+ fallthrough;
+ case RMQUEUE_CLAIM:
+ page = __rmqueue_claim(zone, order, migratetype, alloc_flags);
+ if (page) {
+ /* Replenished native freelist, back to normal mode */
+ *mode = RMQUEUE_NORMAL;
+ return page;
+ }
+ fallthrough;
+ case RMQUEUE_STEAL:
+ if (!(alloc_flags & ALLOC_NOFRAGMENT)) {
+ page = __rmqueue_steal(zone, order, migratetype);
+ if (page) {
+ *mode = RMQUEUE_STEAL;
+ return page;
+ }
+ }
}
- return page;
+ return NULL;
}
/*
@@ -2311,6 +2357,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
unsigned long count, struct list_head *list,
int migratetype, unsigned int alloc_flags)
{
+ enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
unsigned long flags;
int i;
@@ -2321,7 +2368,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
}
for (i = 0; i < count; ++i) {
struct page *page = __rmqueue(zone, order, migratetype,
- alloc_flags);
+ alloc_flags, &rmqm);
if (unlikely(page == NULL))
break;
@@ -2934,6 +2981,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
{
struct page *page;
unsigned long flags;
+ enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
do {
page = NULL;
@@ -2945,7 +2993,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
if (alloc_flags & ALLOC_HIGHATOMIC)
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if (!page) {
- page = __rmqueue(zone, order, migratetype, alloc_flags);
+ page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm);
/*
* If the allocation fails, allow OOM handling and
--
2.49.0
From: yangge <yangge1116(a)126.com>
If a large number of CMA memory are configured in system (for example, the
CMA memory accounts for 50% of the system memory), starting a virtual
virtual machine with device passthrough, it will
call pin_user_pages_remote(..., FOLL_LONGTERM, ...) to pin memory.
Normally if a page is present and in CMA area, pin_user_pages_remote()
will migrate the page from CMA area to non-CMA area because of
FOLL_LONGTERM flag. But the current code will cause the migration failure
due to unexpected page refcounts, and eventually cause the virtual machine
fail to start.
If a page is added in LRU batch, its refcount increases one, remove the
page from LRU batch decreases one. Page migration requires the page is not
referenced by others except page mapping. Before migrating a page, we
should try to drain the page from LRU batch in case the page is in it,
however, folio_test_lru() is not sufficient to tell whether the page is
in LRU batch or not, if the page is in LRU batch, the migration will fail.
To solve the problem above, we modify the logic of adding to LRU batch.
Before adding a page to LRU batch, we clear the LRU flag of the page so
that we can check whether the page is in LRU batch by folio_test_lru(page).
It's quite valuable, because likely we don't want to blindly drain the LRU
batch simply because there is some unexpected reference on a page, as
described above.
This change makes the LRU flag of a page invisible for longer, which
may impact some programs. For example, as long as a page is on a LRU
batch, we cannot isolate it, and we cannot check if it's an LRU page.
Further, a page can now only be on exactly one LRU batch. This doesn't
seem to matter much, because a new page is allocated from buddy and
added to the lru batch, or be isolated, it's LRU flag may also be
invisible for a long time.
Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: yangge <yangge1116(a)126.com>
---
mm/swap.c | 43 +++++++++++++++++++++++++++++++------------
1 file changed, 31 insertions(+), 12 deletions(-)
V4:
Adjust commit message according to David's comments
V3:
Add fixes tag
V2:
Adjust code and commit message according to David's comments
diff --git a/mm/swap.c b/mm/swap.c
index dc205bd..9caf6b0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -211,10 +211,6 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
for (i = 0; i < folio_batch_count(fbatch); i++) {
struct folio *folio = fbatch->folios[i];
- /* block memcg migration while the folio moves between lru */
- if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
- continue;
-
folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
move_fn(lruvec, folio);
@@ -255,11 +251,16 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
void folio_rotate_reclaimable(struct folio *folio)
{
if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
- !folio_test_unevictable(folio) && folio_test_lru(folio)) {
+ !folio_test_unevictable(folio)) {
struct folio_batch *fbatch;
unsigned long flags;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock_irqsave(&lru_rotate.lock, flags);
fbatch = this_cpu_ptr(&lru_rotate.fbatch);
folio_batch_add_and_move(fbatch, folio, lru_move_tail_fn);
@@ -352,11 +353,15 @@ static void folio_activate_drain(int cpu)
void folio_activate(struct folio *folio)
{
- if (folio_test_lru(folio) && !folio_test_active(folio) &&
- !folio_test_unevictable(folio)) {
+ if (!folio_test_active(folio) && !folio_test_unevictable(folio)) {
struct folio_batch *fbatch;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.activate);
folio_batch_add_and_move(fbatch, folio, folio_activate_fn);
@@ -700,6 +705,11 @@ void deactivate_file_folio(struct folio *folio)
return;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate_file);
folio_batch_add_and_move(fbatch, folio, lru_deactivate_file_fn);
@@ -716,11 +726,16 @@ void deactivate_file_folio(struct folio *folio)
*/
void folio_deactivate(struct folio *folio)
{
- if (folio_test_lru(folio) && !folio_test_unevictable(folio) &&
- (folio_test_active(folio) || lru_gen_enabled())) {
+ if (!folio_test_unevictable(folio) && (folio_test_active(folio) ||
+ lru_gen_enabled())) {
struct folio_batch *fbatch;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate);
folio_batch_add_and_move(fbatch, folio, lru_deactivate_fn);
@@ -737,12 +752,16 @@ void folio_deactivate(struct folio *folio)
*/
void folio_mark_lazyfree(struct folio *folio)
{
- if (folio_test_lru(folio) && folio_test_anon(folio) &&
- folio_test_swapbacked(folio) && !folio_test_swapcache(folio) &&
- !folio_test_unevictable(folio)) {
+ if (folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+ !folio_test_swapcache(folio) && !folio_test_unevictable(folio)) {
struct folio_batch *fbatch;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.lru_lazyfree);
folio_batch_add_and_move(fbatch, folio, lru_lazyfree_fn);
--
2.7.4
Hello ,
These are available for sale. If you’re interested in purchasing
these, please email me
brand new and original
Brand New ST8000NM017B $70 EA
Brand New ST20000NM007D $100 EACH
Brand New ST4000NM000A $30 EA
Brand New WD80EFPX $60 EA
Brand New WD101PURZ $70 EA
Brand New CISCO C9300-48UXM-E
Available 5
$21800 EACH
Intel Xeon Gold 5418Y Processors
QTY28 $780 each
Brand New C9200L-48T-4X-E $1000 EAC
Brand New N9K-C93108TC-FX-24 Nexus
9300-FX w/ 24p 100M/1/10GT & 6p 40/100G
Available 4
$3000 each
Brand New NVIDIA GeForce RTX 4090 Founders
Edition 24GB - QTY: 56 - $700 each
Brand new Palit NVIDIA GeForce RTX 5080
GamingPro OC card with full manufacturer
QTY 48 $750 EAC
BRAND NEW - ASUS TUF Gaming GeForce RTX ™ 5080 16GB
GDDR7 OC Edition Gaming Graphics Card SEALED
QTY50 $700 EACH
Condition: Grade A
Used HP EliteBook 840 G7 i7-10610U 16GB RAM 512GB
SSD Windows 11 Pro TOUCH Screen
QTY 30 USD 100 each
Condition: Grade A
Used HP EliteBook 850 G8 15.6" FHD,
INTEL I7, 256GB SSD, 8GB RAM Win11
QTY50 $240 EACH
SK Hynix 48GB DDR5 4800 1Rx4 PC5-4800B-
PF0-1010-XT 288pin Server EC4 RDIMM RAM
QTY 239 $50 EACH
-----------------------------------------------------------------
---------------
Best Regards,
Winston Taylor
300 Laird St, Wilkes-Barre, PA 18702, USA
Mobile: +1 (570) 890-5512
Email: winstontaylor(a)theleadingone.net
www.theleadingone.net
Hello ,
These are available for sale. If you’re interested in purchasing
these, please email me
brand new and original
Brand New ST8000NM017B $70 EA
Brand New ST20000NM007D $100 EACH
Brand New ST4000NM000A $30 EA
Brand New WD80EFPX $60 EA
Brand New WD101PURZ $70 EA
Brand New CISCO C9300-48UXM-E
Available 5
$21800 EACH
Intel Xeon Gold 5418Y Processors
QTY28 $780 each
Brand New C9200L-48T-4X-E $1000 EAC
Brand New N9K-C93108TC-FX-24 Nexus
9300-FX w/ 24p 100M/1/10GT & 6p 40/100G
Available 4
$3000 each
Brand New NVIDIA GeForce RTX 4090 Founders
Edition 24GB - QTY: 56 - $700 each
Brand new Palit NVIDIA GeForce RTX 5080
GamingPro OC card with full manufacturer
QTY 48 $750 EAC
BRAND NEW - ASUS TUF Gaming GeForce RTX ™ 5080 16GB
GDDR7 OC Edition Gaming Graphics Card SEALED
QTY50 $700 EACH
Condition: Grade A
Used HP EliteBook 840 G7 i7-10610U 16GB RAM 512GB
SSD Windows 11 Pro TOUCH Screen
QTY 30 USD 100 each
Condition: Grade A
Used HP EliteBook 850 G8 15.6" FHD,
INTEL I7, 256GB SSD, 8GB RAM Win11
QTY50 $240 EACH
SK Hynix 48GB DDR5 4800 1Rx4 PC5-4800B-
PF0-1010-XT 288pin Server EC4 RDIMM RAM
QTY 239 $50 EACH
-----------------------------------------------------------------
---------------
Best Regards,
Winston Taylor
300 Laird St, Wilkes-Barre, PA 18702, USA
Mobile: +1 (570) 890-5512
Email: winstontaylor(a)theleadingone.net
www.theleadingone.net
Hello ,
These are available for sale. If you’re interested in purchasing
these, please email me
brand new and original
Brand New ST8000NM017B $70 EA
Brand New ST20000NM007D $100 EACH
Brand New ST4000NM000A $30 EA
Brand New WD80EFPX $60 EA
Brand New WD101PURZ $70 EA
Brand New CISCO C9300-48UXM-E
Available 5
$21800 EACH
Intel Xeon Gold 5418Y Processors
QTY28 $780 each
Brand New C9200L-48T-4X-E $1000 EAC
Brand New N9K-C93108TC-FX-24 Nexus
9300-FX w/ 24p 100M/1/10GT & 6p 40/100G
Available 4
$3000 each
Brand New NVIDIA GeForce RTX 4090 Founders
Edition 24GB - QTY: 56 - $700 each
Brand new Palit NVIDIA GeForce RTX 5080
GamingPro OC card with full manufacturer
QTY 48 $750 EAC
BRAND NEW - ASUS TUF Gaming GeForce RTX ™ 5080 16GB
GDDR7 OC Edition Gaming Graphics Card SEALED
QTY50 $700 EACH
Condition: Grade A
Used HP EliteBook 840 G7 i7-10610U 16GB RAM 512GB
SSD Windows 11 Pro TOUCH Screen
QTY 30 USD 100 each
Condition: Grade A
Used HP EliteBook 850 G8 15.6" FHD,
INTEL I7, 256GB SSD, 8GB RAM Win11
QTY50 $240 EACH
SK Hynix 48GB DDR5 4800 1Rx4 PC5-4800B-
PF0-1010-XT 288pin Server EC4 RDIMM RAM
QTY 239 $50 EACH
-----------------------------------------------------------------
---------------
Best Regards,
Winston Taylor
300 Laird St, Wilkes-Barre, PA 18702, USA
Mobile: +1 (570) 890-5512
Email: winstontaylor(a)theleadingone.net
www.theleadingone.net
The patch titled
Subject: mm: fix apply_to_existing_page_range()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-fix-apply_to_existing_page_range.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Subject: mm: fix apply_to_existing_page_range()
Date: Wed, 9 Apr 2025 12:40:43 +0300
In the case of apply_to_existing_page_range(), apply_to_pte_range() is
reached with 'create' set to false. When !create, the loop over the PTE
page table is broken.
apply_to_pte_range() will only move to the next PTE entry if 'create' is
true or if the current entry is not pte_none().
This means that the user of apply_to_existing_page_range() will not have
'fn' called for any entries after the first pte_none() in the PTE page
table.
Fix the loop logic in apply_to_pte_range().
There are no known runtime issues from this, but the fix is trivial enough
for stable@ even without a known buggy user.
Link: https://lkml.kernel.org/r/20250409094043.1629234-1-kirill.shutemov@linux.in…
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Fixes: be1db4753ee6 ("mm/memory.c: add apply_to_existing_page_range() helper")
Cc: Daniel Axtens <dja(a)axtens.net>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/mm/memory.c~mm-fix-apply_to_existing_page_range
+++ a/mm/memory.c
@@ -2943,11 +2943,11 @@ static int apply_to_pte_range(struct mm_
if (fn) {
do {
if (create || !pte_none(ptep_get(pte))) {
- err = fn(pte++, addr, data);
+ err = fn(pte, addr, data);
if (err)
break;
}
- } while (addr += PAGE_SIZE, addr != end);
+ } while (pte++, addr += PAGE_SIZE, addr != end);
}
arch_leave_lazy_mmu_mode();
_
Patches currently in -mm which might be from kirill.shutemov(a)linux.intel.com are
mm-page_alloc-fix-deadlock-on-cpu_hotplug_lock-in-__accept_page.patch
mm-fix-apply_to_existing_page_range.patch
The patch titled
Subject: alloc_tag: handle incomplete bulk allocations in vm_module_tags_populate
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
alloc_tag-handle-incomplete-bulk-allocations-in-vm_module_tags_populate.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "T.J. Mercier" <tjmercier(a)google.com>
Subject: alloc_tag: handle incomplete bulk allocations in vm_module_tags_populate
Date: Wed, 9 Apr 2025 22:51:11 +0000
alloc_pages_bulk_node() may partially succeed and allocate fewer than the
requested nr_pages. There are several conditions under which this can
occur, but we have encountered the case where CONFIG_PAGE_OWNER is enabled
causing all bulk allocations to always fallback to single page allocations
due to commit 187ad460b841 ("mm/page_alloc: avoid page allocator recursion
with pagesets.lock held").
Currently vm_module_tags_populate() immediately fails when
alloc_pages_bulk_node() returns fewer than the requested number of pages.
When this happens memory allocation profiling gets disabled, for example
[ 14.297583] [9: modprobe: 465] Failed to allocate memory for allocation tags in the module scsc_wlan. Memory allocation profiling is disabled!
[ 14.299339] [9: modprobe: 465] modprobe: Failed to insmod '/vendor/lib/modules/scsc_wlan.ko' with args '': Out of memory
This patch causes vm_module_tags_populate() to retry bulk allocations for
the remaining memory instead of failing immediately which will avoid the
disablement of memory allocation profiling.
Link: https://lkml.kernel.org/r/20250409225111.3770347-1-tjmercier@google.com
Fixes: 0f9b685626da ("alloc_tag: populate memory for module tags as needed")
Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
Reported-by: Janghyuck Kim <janghyuck.kim(a)samsung.com>
Acked-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/alloc_tag.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
--- a/lib/alloc_tag.c~alloc_tag-handle-incomplete-bulk-allocations-in-vm_module_tags_populate
+++ a/lib/alloc_tag.c
@@ -422,11 +422,20 @@ static int vm_module_tags_populate(void)
unsigned long old_shadow_end = ALIGN(phys_end, MODULE_ALIGN);
unsigned long new_shadow_end = ALIGN(new_end, MODULE_ALIGN);
unsigned long more_pages;
- unsigned long nr;
+ unsigned long nr = 0;
more_pages = ALIGN(new_end - phys_end, PAGE_SIZE) >> PAGE_SHIFT;
- nr = alloc_pages_bulk_node(GFP_KERNEL | __GFP_NOWARN,
- NUMA_NO_NODE, more_pages, next_page);
+ while (nr < more_pages) {
+ unsigned long allocated;
+
+ allocated = alloc_pages_bulk_node(GFP_KERNEL | __GFP_NOWARN,
+ NUMA_NO_NODE, more_pages - nr, next_page + nr);
+
+ if (!allocated)
+ break;
+ nr += allocated;
+ }
+
if (nr < more_pages ||
vmap_pages_range(phys_end, phys_end + (nr << PAGE_SHIFT), PAGE_KERNEL,
next_page, PAGE_SHIFT) < 0) {
_
Patches currently in -mm which might be from tjmercier(a)google.com are
alloc_tag-handle-incomplete-bulk-allocations-in-vm_module_tags_populate.patch
The patch titled
Subject: alloc_tag: handle incomplete bulk allocations in vm_module_tags_populate
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
alloc_tag-handle-incomplete-bulk-allocations-in-vm_module_tags_populate.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "T.J. Mercier" <tjmercier(a)google.com>
Subject: alloc_tag: handle incomplete bulk allocations in vm_module_tags_populate
Date: Wed, 9 Apr 2025 19:54:47 +0000
alloc_pages_bulk_node may partially succeed and allocate fewer than the
requested nr_pages. There are several conditions under which this can
occur, but we have encountered the case where CONFIG_PAGE_OWNER is enabled
causing all bulk allocations to always fallback to single page allocations
due to commit 187ad460b841 ("mm/page_alloc: avoid page allocator recursion
with pagesets.lock held").
Currently vm_module_tags_populate immediately fails when
alloc_pages_bulk_node returns fewer than the requested number of pages.
This patch causes vm_module_tags_populate to retry bulk allocations for
the remaining memory instead.
Link: https://lkml.kernel.org/r/20250409195448.3697351-1-tjmercier@google.com
Fixes: 187ad460b841 ("mm/page_alloc: avoid page allocator recursion with pagesets.lock held")
Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
Reported-by: Janghyuck Kim <janghyuck.kim(a)samsung.com>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/alloc_tag.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
--- a/lib/alloc_tag.c~alloc_tag-handle-incomplete-bulk-allocations-in-vm_module_tags_populate
+++ a/lib/alloc_tag.c
@@ -422,11 +422,20 @@ static int vm_module_tags_populate(void)
unsigned long old_shadow_end = ALIGN(phys_end, MODULE_ALIGN);
unsigned long new_shadow_end = ALIGN(new_end, MODULE_ALIGN);
unsigned long more_pages;
- unsigned long nr;
+ unsigned long nr = 0;
more_pages = ALIGN(new_end - phys_end, PAGE_SIZE) >> PAGE_SHIFT;
- nr = alloc_pages_bulk_node(GFP_KERNEL | __GFP_NOWARN,
- NUMA_NO_NODE, more_pages, next_page);
+ while (nr < more_pages) {
+ unsigned long allocated;
+
+ allocated = alloc_pages_bulk_node(GFP_KERNEL | __GFP_NOWARN,
+ NUMA_NO_NODE, more_pages - nr, next_page + nr);
+
+ if (!allocated)
+ break;
+ nr += allocated;
+ }
+
if (nr < more_pages ||
vmap_pages_range(phys_end, phys_end + (nr << PAGE_SHIFT), PAGE_KERNEL,
next_page, PAGE_SHIFT) < 0) {
_
Patches currently in -mm which might be from tjmercier(a)google.com are
alloc_tag-handle-incomplete-bulk-allocations-in-vm_module_tags_populate.patch
struct rdma_cm_id has member "struct work_struct net_work"
that is reused for enqueuing cma_netevent_work_handler()s
onto cma_wq.
Below crash[1] can occur if more than one call to
cma_netevent_callback() occurs in quick succession,
which further enqueues cma_netevent_work_handler()s for the
same rdma_cm_id, overwriting any previously queued work-item(s)
that was just scheduled to run i.e. there is no guarantee
the queued work item may run between two successive calls
to cma_netevent_callback() and the 2nd INIT_WORK would overwrite
the 1st work item (for the same rdma_cm_id), despite grabbing
id_table_lock during enqueue.
Also drgn analysis [2] indicates the work item was likely overwritten.
Fix this by moving the INIT_WORK() to __rdma_create_id(),
so that it doesn't race with any existing queue_work() or
its worker thread.
[1] Trimmed crash stack:
=============================================
BUG: kernel NULL pointer dereference, address: 0000000000000008
kworker/u256:6 ... 6.12.0-0...
Workqueue: cma_netevent_work_handler [rdma_cm] (rdma_cm)
RIP: 0010:process_one_work+0xba/0x31a
Call Trace:
worker_thread+0x266/0x3a0
kthread+0xcf/0x100
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x1a/0x30
=============================================
[2] drgn crash analysis:
>>> trace = prog.crashed_thread().stack_trace()
>>> trace
(0) crash_setup_regs (./arch/x86/include/asm/kexec.h:111:15)
(1) __crash_kexec (kernel/crash_core.c:122:4)
(2) panic (kernel/panic.c:399:3)
(3) oops_end (arch/x86/kernel/dumpstack.c:382:3)
...
(8) process_one_work (kernel/workqueue.c:3168:2)
(9) process_scheduled_works (kernel/workqueue.c:3310:3)
(10) worker_thread (kernel/workqueue.c:3391:4)
(11) kthread (kernel/kthread.c:389:9)
Line workqueue.c:3168 for this kernel version is in process_one_work():
3168 strscpy(worker->desc, pwq->wq->name, WORKER_DESC_LEN);
>>> trace[8]["work"]
*(struct work_struct *)0xffff92577d0a21d8 = {
.data = (atomic_long_t){
.counter = (s64)536870912, <=== Note
},
.entry = (struct list_head){
.next = (struct list_head *)0xffff924d075924c0,
.prev = (struct list_head *)0xffff924d075924c0,
},
.func = (work_func_t)cma_netevent_work_handler+0x0 = 0xffffffffc2cec280,
}
Suspicion is that pwq is NULL:
>>> trace[8]["pwq"]
(struct pool_workqueue *)<absent>
In process_one_work(), pwq is assigned from:
struct pool_workqueue *pwq = get_work_pwq(work);
and get_work_pwq() is:
static struct pool_workqueue *get_work_pwq(struct work_struct *work)
{
unsigned long data = atomic_long_read(&work->data);
if (data & WORK_STRUCT_PWQ)
return work_struct_pwq(data);
else
return NULL;
}
WORK_STRUCT_PWQ is 0x4:
>>> print(repr(prog['WORK_STRUCT_PWQ']))
Object(prog, 'enum work_flags', value=4)
But work->data is 536870912 which is 0x20000000.
So, get_work_pwq() returns NULL and we crash in process_one_work():
3168 strscpy(worker->desc, pwq->wq->name, WORKER_DESC_LEN);
=============================================
Fixes: 925d046e7e52 ("RDMA/core: Add a netevent notifier to cma")
Cc: stable(a)vger.kernel.org
Co-developed-by: Håkon Bugge <haakon.bugge(a)oracle.com>
Signed-off-by: Håkon Bugge <haakon.bugge(a)oracle.com>
Signed-off-by: Sharath Srinivasan <sharath.srinivasan(a)oracle.com>
---
v1->v2 cc:stable@vger.kernel.org
---
drivers/infiniband/core/cma.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 91db10515d74..176d0b3e4488 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -72,6 +72,8 @@ static const char * const cma_events[] = {
static void cma_iboe_set_mgid(struct sockaddr *addr, union ib_gid *mgid,
enum ib_gid_type gid_type);
+static void cma_netevent_work_handler(struct work_struct *_work);
+
const char *__attribute_const__ rdma_event_msg(enum rdma_cm_event_type event)
{
size_t index = event;
@@ -1033,6 +1035,7 @@ __rdma_create_id(struct net *net, rdma_cm_event_handler event_handler,
get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
id_priv->id.route.addr.dev_addr.net = get_net(net);
id_priv->seq_num &= 0x00ffffff;
+ INIT_WORK(&id_priv->id.net_work, cma_netevent_work_handler);
rdma_restrack_new(&id_priv->res, RDMA_RESTRACK_CM_ID);
if (parent)
@@ -5227,7 +5230,6 @@ static int cma_netevent_callback(struct notifier_block *self,
if (!memcmp(current_id->id.route.addr.dev_addr.dst_dev_addr,
neigh->ha, ETH_ALEN))
continue;
- INIT_WORK(¤t_id->id.net_work, cma_netevent_work_handler);
cma_id_get(current_id);
queue_work(cma_wq, ¤t_id->id.net_work);
}
--
2.39.5 (Apple Git-154)
The function brcmf_usb_dlneeded() calls the function brcmf_usb_dl_cmd()
but dose not check its return value. The 'id.chiprev' is uninitialized if
the function brcmf_usb_dl_cmd() fails, and may propagate to
'devinfo->bus_pub.chiprev'.
Add error handling for brcmf_usb_dl_cmd() to return the function if the
'id.chiprev' is uninitialized.
Fixes: 71bb244ba2fd ("brcm80211: fmac: add USB support for bcm43235/6/8 chipsets")
Cc: stable(a)vger.kernel.org # v3.4+
Signed-off-by: Wentao Liang <vulab(a)iscas.ac.cn>
---
drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c
index 2821c27f317e..50dddac8a2ab 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c
@@ -790,6 +790,7 @@ brcmf_usb_dlneeded(struct brcmf_usbdev_info *devinfo)
{
struct bootrom_id_le id;
u32 chipid, chiprev;
+ int err;
brcmf_dbg(USB, "Enter\n");
@@ -798,7 +799,11 @@ brcmf_usb_dlneeded(struct brcmf_usbdev_info *devinfo)
/* Check if firmware downloaded already by querying runtime ID */
id.chip = cpu_to_le32(0xDEAD);
- brcmf_usb_dl_cmd(devinfo, DL_GETVER, &id, sizeof(id));
+ err = brcmf_usb_dl_cmd(devinfo, DL_GETVER, &id, sizeof(id));
+ if (err) {
+ brcmf_err("DL_GETID Failed\n");
+ return false;
+ }
chipid = le32_to_cpu(id.chip);
chiprev = le32_to_cpu(id.chiprev);
--
2.42.0.windows.2
The of_find_device_by_node() function increments the reference count of
the embedded device, which should be released with put_device() when it
is no longer needed.
In ill_acc_of_setup(), put_device() is only called on error paths, but
not on the success path. Fix this by calling put_device() before
returning successfully.
Compile-tested only.
Cc: stable(a)vger.kernel.org
Fixes: 5433acd81e873 ("MIPS: ralink: add illegal access driver")
Signed-off-by: Thorsten Blum <thorsten.blum(a)linux.dev>
---
arch/mips/ralink/ill_acc.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/mips/ralink/ill_acc.c b/arch/mips/ralink/ill_acc.c
index 25341b2319d0..6d1d829854b6 100644
--- a/arch/mips/ralink/ill_acc.c
+++ b/arch/mips/ralink/ill_acc.c
@@ -84,6 +84,7 @@ static int __init ill_acc_of_setup(void)
rt_memc_w32(ILL_INT_STATUS, REG_ILL_ACC_TYPE);
dev_info(&pdev->dev, "irq registered\n");
+ put_device(&pdev->dev);
return 0;
}
--
2.49.0
The RK3399 Puma SoM contains the internal Cypress CYUSB3304 USB
hub, that shows instability due to improper reset pin configuration.
Currently reset pin is modeled as a vcc5v0_host regulator, that
might result in too short reset pulse duration.
Starting with the v6.6, the Onboard USB hub driver (later renamed
to Onboard USB dev) contains support for Cypress HX3 hub family.
It can be now used to correctly model the RK3399 Puma SoM hardware.
The first commits in this series fix the onboard USB dev driver to
support all HX3 hub variants, including the CYUSB3304 found in
the RK3399 Puma SoM.
This allows to introduce fix for internal USB hub instability on
RK3399 Puma, by replacing the vcc5v0_host regulator with
cy3304_reset, used inside the hub node.
Please be aware that the patch that fixes USB hub instability in
arch/arm64/boot/dts/rockchip/rk3399-puma.dtsi can me merged only
after updating the Onboard USB dev driver, otherwise the hub
will not work.
Two last commits in the series disable unrouted USB controllers
and PHYs on RK3399 Puma SOM and Haikou carrier board, with no
intended functional changes.
Signed-off-by: Lukasz Czechowski <lukasz.czechowski(a)thaumatec.com>
---
Lukasz Czechowski (3):
usb: misc: onboard_usb_dev: fix support for Cypress HX3 hubs
dt-bindings: usb: cypress,hx3: Add support for all variants
arm64: dts: rockchip: fix internal USB hub instability on RK3399 Puma
Quentin Schulz (2):
arm64: dts: rockchip: disable unrouted USB controllers and PHY on RK3399 Puma
arm64: dts: rockchip: disable unrouted USB controllers and PHY on RK3399 Puma with Haikou
.../devicetree/bindings/usb/cypress,hx3.yaml | 6 +++
.../arm64/boot/dts/rockchip/rk3399-puma-haikou.dts | 8 ----
arch/arm64/boot/dts/rockchip/rk3399-puma.dtsi | 43 ++++++++++------------
drivers/usb/misc/onboard_usb_dev.c | 10 ++++-
drivers/usb/misc/onboard_usb_dev.h | 6 +++
5 files changed, 39 insertions(+), 34 deletions(-)
---
base-commit: 1e26c5e28ca5821a824e90dd359556f5e9e7b89f
change-id: 20250326-onboard_usb_dev-a7c063a8a515
Best regards,
--
Lukasz Czechowski <lukasz.czechowski(a)thaumatec.com>
Good day/night to everyone.
I am sorry about the duplicate post, I had missed that I was also supposed to CC the regressions list.
I have had S3 sleep broken for well over half a year at this point where the kernel simply crashes upon entering sleep, and the PC turns off (literally stops pulling power from the power supply), and turning it on, at first, causes to to run through 1 on/off cycle, with I believe is a diagnostic step on this board, as it happens whenever something causes the PC to unexpectedly turn off (unless the source is external, like a blackout, in which case once power is restored and I turn on the PC, it turns on normally).
At first I thought that it might be an AMDGPU issue, however removing my dGPU, and also, the rest of my hardware, changed nothing.
However I have had at one point had asus-wmi try to put the system into S0ix, despite the firmware lacking support for S0ix, and now FastFetch no longer prints out my motherboards model number on the Host line, instead just showing ASUS MB, so I have reason to suspect that asus-wmi may be the culprit for this regression.
Unfortunately, I am unable to procure logs, since the kernel crashes before anything meaningful is logged.
I know that this is not a hardware issue, since on Windows and older live-images (Solus is my distribution of choice) these issues do not exist.
I am unsure as to which kernel update broke S3 for me, however I believe it was either late in the 6.10.x cycle or the 6.11.x cycle, since the Solus 4.6 live-image has no such issues on 6.10.13, which was also our last 6.10.x kernel update, and I only started experienced it when we updated to 6.11.5, our first update to 6.11.x, and updating to 6.12.x did not fix anything. The current Solus 4.7 live-image has 6.12.9, and my installed system is currently on 6.12.21.
Hardware:
Motherboard: Asus Z97 Pro Gamer
CPU: i5-4570
dGPU: Sapphire Radeon 540 4GB
RAM: 2x Crucial 8GB DDR3L@1600MT/s, 1x Crucial 4GB DDR3L@1600MT/s
Storage: 500GB Western Digital WD5000AAKX SATA3 7200RPM HDD
DVD Drive: Lite-On DH16ABSH
Add-in cards: Intel 3168 PCIe+USB (by means of a simple adapter) VIA VT6315 Firewire 400 PCIe, MosChip MCS9865 Parallel PCI (over integrated ASMedia ASM1083/1085 PCIe to PCI bridge)
PSU: Seasonix SS-860XP² 860W
Good day/night to everyone.
I have had S3 sleep broken for well over half a year at this point where the kernel simply crashes upon entering sleep, and the PC turns off (literally stops pulling power from the power supply), and turning it on, at first, causes to to run through 1 on/off cycle, with I believe is a diagnostic step on this board, as it happens whenever something causes the PC to unexpectedly turn off (unless the source is external, like a blackout, in which case once power is restored and I turn on the PC, it turns on normally).
At first I thought that it might be an AMDGPU issue, however removing my dGPU, and also, the rest of my hardware, changed nothing.
However I have had at one point had asus-wmi try to put the system into S0ix, despite the firmware lacking support for S0ix, and now FastFetch no longer prints out my motherboards model number on the Host line, instead just showing ASUS MB, so I have reason to suspect that asus-wmi may be the culprit for this regression.
Unfortunately, I am unable to procure logs, since the kernel crashes before anything meaningful is logged.
I know that this is not a hardware issue, since on Windows and older live-images (Solus is my distribution of choice) these issues do not exist.
I am unsure as to which kernel update broke S3 for me, however I believe it was either late in the 6.10.x cycle or the 6.11.x cycle, since the Solus 4.6 live-image has no such issues on 6.10.13, which was also our last 6.10.x kernel update, and I only started experienced it when we updated to 6.11.5, our first update to 6.11.x, and updating to 6.12.x did not fix anything. The current Solus 4.7 live-image has 6.12.9, and my installed system is currently on 6.12.21.
Hardware:
Motherboard: Asus Z97 Pro Gamer
CPU: i5-4570
dGPU: Sapphire Radeon 540 4GB
RAM: 2x Crucial 8GB DDR3L@1600MT/s, 1x Crucial 4GB DDR3L@1600MT/s
Storage: 500GB Western Digital WD5000AAKX SATA3 7200RPM HDD
DVD Drive: Lite-On DH16ABSH
Add-in cards: Intel 3168 PCIe+USB (by means of a simple adapter) VIA VT6315 Firewire 400 PCIe, MosChip MCS9865 Parallel PCI (over integrated ASMedia ASM1083/1085 PCIe to PCI bridge)
PSU: Seasonix SS-860XP² 860W
The function brcmf_usb_dl_writeimage() calls the function
brcmf_usb_dl_cmd() but dose not check its return value. The
'state.state' and the 'state.bytes' are uninitialized if the
function brcmf_usb_dl_cmd() fails. It is dangerous to use
uninitialized variables in the conditions.
Add error handling for brcmf_usb_dl_cmd() to jump to error
handling path if the brcmf_usb_dl_cmd() fails and the
'state.state' and the 'state.bytes' are uninitialized.
Fixes: 71bb244ba2fd ("brcm80211: fmac: add USB support for bcm43235/6/8 chipsets")
Cc: stable(a)vger.kernel.org # v3.4+
Signed-off-by: Wentao Liang <vulab(a)iscas.ac.cn>
---
drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c
index 50dddac8a2ab..1c97cd777225 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/usb.c
@@ -901,7 +901,9 @@ brcmf_usb_dl_writeimage(struct brcmf_usbdev_info *devinfo, u8 *fw, int fwlen)
}
/* 1) Prepare USB boot loader for runtime image */
- brcmf_usb_dl_cmd(devinfo, DL_START, &state, sizeof(state));
+ err = brcmf_usb_dl_cmd(devinfo, DL_START, &state, sizeof(state));
+ if (err)
+ goto fail;
rdlstate = le32_to_cpu(state.state);
rdlbytes = le32_to_cpu(state.bytes);
--
2.42.0.windows.2
When a CPU chooses to call push_dl_task and picks a task to push to
another CPU's runqueue then it will call find_lock_later_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.
Please go through the original rt change for more details on the issue.
To fix this, after the lock is obtained inside the find_lock_later_rq,
it ensures that the task is still at the head of pushable tasks list.
Also removed some checks that are no longer needed with the addition of
this new check.
However, the new check of pushable tasks list only applies when
find_lock_later_rq is called by push_dl_task. For the other caller i.e.
dl_task_offline_migration, existing checks are used.
Signed-off-by: Harshit Agarwal <harshit(a)nutanix.com>
Cc: stable(a)vger.kernel.org
---
Changes in v3:
- Incorporated review comments from Juri around the commit message as
well as around the comment regarding checks in find_lock_later_rq.
- Link to v2:
https://lore.kernel.org/stable/20250317022325.52791-1-harshit@nutanix.com/
Changes in v2:
- As per Juri's suggestion, moved the check inside find_lock_later_rq
similar to rt change. Here we distinguish among the push_dl_task
caller vs dl_task_offline_migration by checking if the task is
throttled or not.
- Fixed the commit message to refer to the rt change by title.
- Link to v1:
https://lore.kernel.org/lkml/20250307204255.60640-1-harshit@nutanix.com/
---
kernel/sched/deadline.c | 73 +++++++++++++++++++++++++++--------------
1 file changed, 49 insertions(+), 24 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 38e4537790af..e0c95f33e1ed 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2621,6 +2621,25 @@ static int find_later_rq(struct task_struct *task)
return -1;
}
+static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
+{
+ struct task_struct *p;
+
+ if (!has_pushable_dl_tasks(rq))
+ return NULL;
+
+ p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
+
+ WARN_ON_ONCE(rq->cpu != task_cpu(p));
+ WARN_ON_ONCE(task_current(rq, p));
+ WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
+
+ WARN_ON_ONCE(!task_on_rq_queued(p));
+ WARN_ON_ONCE(!dl_task(p));
+
+ return p;
+}
+
/* Locks the rq it finds */
static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
{
@@ -2648,12 +2667,37 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
/* Retry if something changed. */
if (double_lock_balance(rq, later_rq)) {
- if (unlikely(task_rq(task) != rq ||
+ /*
+ * double_lock_balance had to release rq->lock, in the
+ * meantime, task may no longer be fit to be migrated.
+ * Check the following to ensure that the task is
+ * still suitable for migration:
+ * 1. It is possible the task was scheduled,
+ * migrate_disabled was set and then got preempted,
+ * so we must check the task migration disable
+ * flag.
+ * 2. The CPU picked is in the task's affinity.
+ * 3. For throttled task (dl_task_offline_migration),
+ * check the following:
+ * - the task is not on the rq anymore (it was
+ * migrated)
+ * - the task is not on CPU anymore
+ * - the task is still a dl task
+ * - the task is not queued on the rq anymore
+ * 4. For the non-throttled task (push_dl_task), the
+ * check to ensure that this task is still at the
+ * head of the pushable tasks list is enough.
+ */
+ if (unlikely(is_migration_disabled(task) ||
!cpumask_test_cpu(later_rq->cpu, &task->cpus_mask) ||
- task_on_cpu(rq, task) ||
- !dl_task(task) ||
- is_migration_disabled(task) ||
- !task_on_rq_queued(task))) {
+ (task->dl.dl_throttled &&
+ (task_rq(task) != rq ||
+ task_on_cpu(rq, task) ||
+ !dl_task(task) ||
+ !task_on_rq_queued(task))) ||
+ (!task->dl.dl_throttled &&
+ task != pick_next_pushable_dl_task(rq)))) {
+
double_unlock_balance(rq, later_rq);
later_rq = NULL;
break;
@@ -2676,25 +2720,6 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
return later_rq;
}
-static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
-{
- struct task_struct *p;
-
- if (!has_pushable_dl_tasks(rq))
- return NULL;
-
- p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
-
- WARN_ON_ONCE(rq->cpu != task_cpu(p));
- WARN_ON_ONCE(task_current(rq, p));
- WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
-
- WARN_ON_ONCE(!task_on_rq_queued(p));
- WARN_ON_ONCE(!dl_task(p));
-
- return p;
-}
-
/*
* See if the non running -deadline tasks on this rq
* can be sent to some other CPU where they can preempt
--
2.49.0.111.g5b97a56fa0
A recent optimization change in LLVM [1] aims to transform certain loop
idioms into calls to strlen() or wcslen(). This change transforms the
first while loop in UniStrcat() into a call to wcslen(), breaking the
build when UniStrcat() gets inlined into alloc_path_with_tree_prefix():
ld.lld: error: undefined symbol: wcslen
>>> referenced by nls_ucs2_utils.h:54 (fs/smb/client/../../nls/nls_ucs2_utils.h:54)
>>> vmlinux.o:(alloc_path_with_tree_prefix)
>>> referenced by nls_ucs2_utils.h:54 (fs/smb/client/../../nls/nls_ucs2_utils.h:54)
>>> vmlinux.o:(alloc_path_with_tree_prefix)
Disable this optimization with '-fno-builtin-wcslen', which prevents the
compiler from assuming that wcslen() is available in the kernel's C
library
Cc: stable(a)vger.kernel.org
Link: https://github.com/llvm/llvm-project/commit/9694844d7e36fd5e01011ab56b64f27… [1]
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
---
Makefile | 3 +++
1 file changed, 3 insertions(+)
diff --git a/Makefile b/Makefile
index 38689a0c3605..f42418556507 100644
--- a/Makefile
+++ b/Makefile
@@ -1068,6 +1068,9 @@ ifdef CONFIG_CC_IS_GCC
KBUILD_CFLAGS += -fconserve-stack
endif
+# Ensure compilers do not transform certain loops into calls to wcslen()
+KBUILD_CFLAGS += -fno-builtin-wcslen
+
# change __FILE__ to the relative path to the source directory
ifdef building_out_of_srctree
KBUILD_CPPFLAGS += $(call cc-option,-ffile-prefix-map=$(srcroot)/=)
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250407-fno-builtin-wcslen-90a858ae7d54
Best regards,
--
Nathan Chancellor <nathan(a)kernel.org>
Recently, during a debugging session using local MPTCP connections, I
noticed MPJoinAckHMacFailure was strangely not zero on the server side.
The first patch fixes this issue -- present since v5.9 -- and the second
one validates it in the selftests.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (2):
mptcp: only inc MPJoinAckHMacFailure for HMAC failures
selftests: mptcp: validate MPJoin HMacFailure counters
net/mptcp/subflow.c | 8 ++++++--
tools/testing/selftests/net/mptcp/mptcp_join.sh | 18 ++++++++++++++++++
2 files changed, 24 insertions(+), 2 deletions(-)
---
base-commit: 61f96e684edd28ca40555ec49ea1555df31ba619
change-id: 20250407-net-mptcp-hmac-failure-mib-66f599305ff3
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
The patch titled
Subject: mm: protect kernel pgtables in apply_to_pte_range()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-protect-kernel-pgtables-in-apply_to_pte_range.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Alexander Gordeev <agordeev(a)linux.ibm.com>
Subject: mm: protect kernel pgtables in apply_to_pte_range()
Date: Tue, 8 Apr 2025 18:07:32 +0200
The lazy MMU mode can only be entered and left under the protection of the
page table locks for all page tables which may be modified. Yet, when it
comes to kernel mappings apply_to_pte_range() does not take any locks.
That does not conform arch_enter|leave_lazy_mmu_mode() semantics and could
potentially lead to re-schedulling a process while in lazy MMU mode or
racing on a kernel page table updates.
Link: https://lkml.kernel.org/r/ef8f6538b83b7fc3372602f90375348f9b4f3596.17441281…
Fixes: 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy updates")
Signed-off-by: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Guenetr Roeck <linux(a)roeck-us.net>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jeremy Fitzhardinge <jeremy(a)goop.org>
Cc: Juegren Gross <jgross(a)suse.com>
Cc: Nicholas Piggin <npiggin(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kasan/shadow.c | 7 ++-----
mm/memory.c | 5 ++++-
2 files changed, 6 insertions(+), 6 deletions(-)
--- a/mm/kasan/shadow.c~mm-protect-kernel-pgtables-in-apply_to_pte_range
+++ a/mm/kasan/shadow.c
@@ -308,14 +308,14 @@ static int kasan_populate_vmalloc_pte(pt
__memset((void *)page, KASAN_VMALLOC_INVALID, PAGE_SIZE);
pte = pfn_pte(PFN_DOWN(__pa(page)), PAGE_KERNEL);
- spin_lock(&init_mm.page_table_lock);
if (likely(pte_none(ptep_get(ptep)))) {
set_pte_at(&init_mm, addr, ptep, pte);
page = 0;
}
- spin_unlock(&init_mm.page_table_lock);
+
if (page)
free_page(page);
+
return 0;
}
@@ -401,13 +401,10 @@ static int kasan_depopulate_vmalloc_pte(
page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
- spin_lock(&init_mm.page_table_lock);
-
if (likely(!pte_none(ptep_get(ptep)))) {
pte_clear(&init_mm, addr, ptep);
free_page(page);
}
- spin_unlock(&init_mm.page_table_lock);
return 0;
}
--- a/mm/memory.c~mm-protect-kernel-pgtables-in-apply_to_pte_range
+++ a/mm/memory.c
@@ -2926,6 +2926,7 @@ static int apply_to_pte_range(struct mm_
pte = pte_offset_kernel(pmd, addr);
if (!pte)
return err;
+ spin_lock(&init_mm.page_table_lock);
} else {
if (create)
pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
@@ -2951,7 +2952,9 @@ static int apply_to_pte_range(struct mm_
arch_leave_lazy_mmu_mode();
- if (mm != &init_mm)
+ if (mm == &init_mm)
+ spin_unlock(&init_mm.page_table_lock);
+ else
pte_unmap_unlock(mapped_pte, ptl);
*mask |= PGTBL_PTE_MODIFIED;
_
Patches currently in -mm which might be from agordeev(a)linux.ibm.com are
kasan-avoid-sleepable-page-allocation-from-atomic-context.patch
mm-cleanup-apply_to_pte_range-routine.patch
mm-protect-kernel-pgtables-in-apply_to_pte_range.patch
The patch titled
Subject: mm: clean up apply_to_pte_range()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-cleanup-apply_to_pte_range-routine.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Alexander Gordeev <agordeev(a)linux.ibm.com>
Subject: mm: clean up apply_to_pte_range()
Date: Tue, 8 Apr 2025 18:07:31 +0200
Reverse 'create' vs 'mm == &init_mm' conditions and move page table mask
modification out of the atomic context. This is a prerequisite for fixing
missing kernel page tables lock.
Link: https://lkml.kernel.org/r/0c65bc334f17ff1d7d92d31c69d7065769bbce4e.17441281…
Fixes: 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy updates")
Signed-off-by: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Guenetr Roeck <linux(a)roeck-us.net>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jeremy Fitzhardinge <jeremy(a)goop.org>
Cc: Juegren Gross <jgross(a)suse.com>
Cc: Nicholas Piggin <npiggin(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)
--- a/mm/memory.c~mm-cleanup-apply_to_pte_range-routine
+++ a/mm/memory.c
@@ -2915,24 +2915,28 @@ static int apply_to_pte_range(struct mm_
pte_fn_t fn, void *data, bool create,
pgtbl_mod_mask *mask)
{
+ int err = create ? -ENOMEM : -EINVAL;
pte_t *pte, *mapped_pte;
- int err = 0;
spinlock_t *ptl;
- if (create) {
- mapped_pte = pte = (mm == &init_mm) ?
- pte_alloc_kernel_track(pmd, addr, mask) :
- pte_alloc_map_lock(mm, pmd, addr, &ptl);
+ if (mm == &init_mm) {
+ if (create)
+ pte = pte_alloc_kernel_track(pmd, addr, mask);
+ else
+ pte = pte_offset_kernel(pmd, addr);
if (!pte)
- return -ENOMEM;
+ return err;
} else {
- mapped_pte = pte = (mm == &init_mm) ?
- pte_offset_kernel(pmd, addr) :
- pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (create)
+ pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
+ else
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
if (!pte)
- return -EINVAL;
+ return err;
+ mapped_pte = pte;
}
+ err = 0;
arch_enter_lazy_mmu_mode();
if (fn) {
@@ -2944,12 +2948,14 @@ static int apply_to_pte_range(struct mm_
}
} while (addr += PAGE_SIZE, addr != end);
}
- *mask |= PGTBL_PTE_MODIFIED;
arch_leave_lazy_mmu_mode();
if (mm != &init_mm)
pte_unmap_unlock(mapped_pte, ptl);
+
+ *mask |= PGTBL_PTE_MODIFIED;
+
return err;
}
_
Patches currently in -mm which might be from agordeev(a)linux.ibm.com are
kasan-avoid-sleepable-page-allocation-from-atomic-context.patch
mm-cleanup-apply_to_pte_range-routine.patch
mm-protect-kernel-pgtables-in-apply_to_pte_range.patch
The patch titled
Subject: kasan: avoid sleepable page allocation from atomic context
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
kasan-avoid-sleepable-page-allocation-from-atomic-context.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Alexander Gordeev <agordeev(a)linux.ibm.com>
Subject: kasan: avoid sleepable page allocation from atomic context
Date: Tue, 8 Apr 2025 18:07:30 +0200
Patch series "mm: Fix apply_to_pte_range() vs lazy MMU mode", v2.
This series is an attempt to fix the violation of lazy MMU mode context
requirement as described for arch_enter_lazy_mmu_mode():
This mode can only be entered and left under the protection of
the page table locks for all page tables which may be modified.
On s390 if I make arch_enter_lazy_mmu_mode() -> preempt_enable() and
arch_leave_lazy_mmu_mode() -> preempt_disable() I am getting this:
[ 553.332108] preempt_count: 1, expected: 0
[ 553.332117] no locks held by multipathd/2116.
[ 553.332128] CPU: 24 PID: 2116 Comm: multipathd Kdump: loaded Tainted:
[ 553.332139] Hardware name: IBM 3931 A01 701 (LPAR)
[ 553.332146] Call Trace:
[ 553.332152] [<00000000158de23a>] dump_stack_lvl+0xfa/0x150
[ 553.332167] [<0000000013e10d12>] __might_resched+0x57a/0x5e8
[ 553.332178] [<00000000144eb6c2>] __alloc_pages+0x2ba/0x7c0
[ 553.332189] [<00000000144d5cdc>] __get_free_pages+0x2c/0x88
[ 553.332198] [<00000000145663f6>] kasan_populate_vmalloc_pte+0x4e/0x110
[ 553.332207] [<000000001447625c>] apply_to_pte_range+0x164/0x3c8
[ 553.332218] [<000000001448125a>] apply_to_pmd_range+0xda/0x318
[ 553.332226] [<000000001448181c>] __apply_to_page_range+0x384/0x768
[ 553.332233] [<0000000014481c28>] apply_to_page_range+0x28/0x38
[ 553.332241] [<00000000145665da>] kasan_populate_vmalloc+0x82/0x98
[ 553.332249] [<00000000144c88d0>] alloc_vmap_area+0x590/0x1c90
[ 553.332257] [<00000000144ca108>] __get_vm_area_node.constprop.0+0x138/0x260
[ 553.332265] [<00000000144d17fc>] __vmalloc_node_range+0x134/0x360
[ 553.332274] [<0000000013d5dbf2>] alloc_thread_stack_node+0x112/0x378
[ 553.332284] [<0000000013d62726>] dup_task_struct+0x66/0x430
[ 553.332293] [<0000000013d63962>] copy_process+0x432/0x4b80
[ 553.332302] [<0000000013d68300>] kernel_clone+0xf0/0x7d0
[ 553.332311] [<0000000013d68bd6>] __do_sys_clone+0xae/0xc8
[ 553.332400] [<0000000013d68dee>] __s390x_sys_clone+0xd6/0x118
[ 553.332410] [<0000000013c9d34c>] do_syscall+0x22c/0x328
[ 553.332419] [<00000000158e7366>] __do_syscall+0xce/0xf0
[ 553.332428] [<0000000015913260>] system_call+0x70/0x98
This exposes a KASAN issue fixed with patch 1 and apply_to_pte_range()
issue fixed with patch 3, while patch 2 is a prerequisite.
Commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu
mode") looks like powerpc-only fix, yet not entirely conforming to the
above provided requirement (page tables itself are still not protected).
If I am not mistaken, xen and sparc are alike.
This patch (of 3):
apply_to_page_range() enters lazy MMU mode and then invokes
kasan_populate_vmalloc_pte() callback on each page table walk iteration.
The lazy MMU mode may only be entered only under protection of the page
table lock. However, the callback can go into sleep when trying to
allocate a single page.
Change __get_free_page() allocation mode from GFP_KERNEL to GFP_ATOMIC to
avoid scheduling out while in atomic context.
Link: https://lkml.kernel.org/r/cover.1744128123.git.agordeev@linux.ibm.com
Link: https://lkml.kernel.org/r/2d9f4ac4528701b59d511a379a60107fa608ad30.17441281…
Fixes: 3c5c3cfb9ef4 ("kasan: support backing vmalloc space with real shadow memory")
Signed-off-by: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Guenetr Roeck <linux(a)roeck-us.net>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jeremy Fitzhardinge <jeremy(a)goop.org>
Cc: Juegren Gross <jgross(a)suse.com>
Cc: Nicholas Piggin <npiggin(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kasan/shadow.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/kasan/shadow.c~kasan-avoid-sleepable-page-allocation-from-atomic-context
+++ a/mm/kasan/shadow.c
@@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pt
if (likely(!pte_none(ptep_get(ptep))))
return 0;
- page = __get_free_page(GFP_KERNEL);
+ page = __get_free_page(GFP_ATOMIC);
if (!page)
return -ENOMEM;
_
Patches currently in -mm which might be from agordeev(a)linux.ibm.com are
kasan-avoid-sleepable-page-allocation-from-atomic-context.patch
mm-cleanup-apply_to_pte_range-routine.patch
mm-protect-kernel-pgtables-in-apply_to_pte_range.patch
Hi folks,
This series fixes support for correctly saving and restoring fltcon0
and fltcon1 registers on gs101 for non-alive banks where the fltcon
register offset is not at a fixed offset (unlike previous SoCs).
This is done by adding a eint_fltcon_offset and providing GS101
specific pin macros that take an additional parameter (similar to
how exynosautov920 handles it's eint_con_offset).
Additionally the SoC specific suspend and resume callbacks are
re-factored so that each SoC variant has it's own callback containing
the peculiarities for that SoC.
Finally support for filter selection on alive banks is added, this is
currently only enabled for gs101. The code path can be excercised using
`echo mem > /sys/power/state`
regards,
Peter
To: Krzysztof Kozlowski <krzk(a)kernel.org>
To: Sylwester Nawrocki <s.nawrocki(a)samsung.com>
To: Alim Akhtar <alim.akhtar(a)samsung.com>
To: Linus Walleij <linus.walleij(a)linaro.org>
Cc: linux-arm-kernel(a)lists.infradead.org
Cc: linux-samsung-soc(a)vger.kernel.org
Cc: linux-gpio(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: andre.draszik(a)linaro.org
Cc: tudor.ambarus(a)linaro.org
Cc: willmcvicker(a)google.com
Cc: semen.protsenko(a)linaro.org
Cc: kernel-team(a)android.com
Cc: jaewon02.kim(a)samsung.com
Signed-off-by: Peter Griffin <peter.griffin(a)linaro.org>
---
Changes in v6:
- Make drvdata->suspend/resume symmetrically reversed (Krzysztof)
- rebase on linux-next
- Link to v5: https://lore.kernel.org/r/20250312-pinctrl-fltcon-suspend-v5-0-d98d5b271242…
Changes in v5:
- Split drvdata suspend & resume callbacks into a dedicated patch (Krzysztof)
- Add comment about stable dependency (Krzysztof)
- Add back in {} braces (Krzysztof)
- Link to v4: https://lore.kernel.org/r/20250307-pinctrl-fltcon-suspend-v4-0-2d775e486036…
Changes in v4:
- save->eint_fltcon1 is an argument to pr_debug(), not readl() change alignment accordingly (Andre)
- Link to v3: https://lore.kernel.org/r/20250306-pinctrl-fltcon-suspend-v3-0-f9ab4ff6a24e…
Changes in v3:
- Ensure EXYNOS_FLTCON_DIGITAL bit is cleared (Andre)
- Make it obvious that exynos_eint_set_filter() is conditional on bank type (Andre)
- Make it obvious exynos_set_wakeup() is conditional on bank type (Andre)
- Align style where the '+' is placed first (Andre)
- Remove unnecessary braces (Andre)
- Link to v2: https://lore.kernel.org/r/20250301-pinctrl-fltcon-suspend-v2-0-a7eef9bb443b…
Changes in v2:
- Remove eint_flt_selectable bool as it can be deduced from EINT_TYPE_WKUP (Peter)
- Move filter config register comment to header file (Andre)
- Rename EXYNOS_FLTCON_DELAY to EXYNOS_FLTCON_ANALOG (Andre)
- Remove misleading old comment (Andre)
- Refactor exynos_eint_update_flt_reg() into a loop (Andre)
- Split refactor of suspend/resume callbacks & gs101 parts into separate patches (Andre)
- Link to v1: https://lore.kernel.org/r/20250120-pinctrl-fltcon-suspend-v1-0-e77900b2a854…
---
Peter Griffin (4):
pinctrl: samsung: refactor drvdata suspend & resume callbacks
pinctrl: samsung: add dedicated SoC eint suspend/resume callbacks
pinctrl: samsung: add gs101 specific eint suspend/resume callbacks
pinctrl: samsung: Add filter selection support for alive bank on gs101
drivers/pinctrl/samsung/pinctrl-exynos-arm64.c | 52 ++---
drivers/pinctrl/samsung/pinctrl-exynos.c | 294 +++++++++++++++----------
drivers/pinctrl/samsung/pinctrl-exynos.h | 28 ++-
drivers/pinctrl/samsung/pinctrl-samsung.c | 21 +-
drivers/pinctrl/samsung/pinctrl-samsung.h | 8 +-
5 files changed, 252 insertions(+), 151 deletions(-)
---
base-commit: cd37a617b4bfb43f84dbbf8058317b487f5203ae
change-id: 20250120-pinctrl-fltcon-suspend-2333a137c4d4
Best regards,
--
Peter Griffin <peter.griffin(a)linaro.org>
Overview
========
When a CPU chooses to call push_rt_task and picks a task to push to
another CPU's runqueue then it will call find_lock_lowest_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.
Crashes
=======
This bug resulted in quite a few flavors of crashes triggering kernel
panics with various crash signatures such as assert failures, page
faults, null pointer dereferences, and queue corruption errors all
coming from scheduler itself.
Some of the crashes:
-> kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx >= MAX_RT_PRIO)
Call Trace:
? __die_body+0x1a/0x60
? die+0x2a/0x50
? do_trap+0x85/0x100
? pick_next_task_rt+0x6e/0x1d0
? do_error_trap+0x64/0xa0
? pick_next_task_rt+0x6e/0x1d0
? exc_invalid_op+0x4c/0x60
? pick_next_task_rt+0x6e/0x1d0
? asm_exc_invalid_op+0x12/0x20
? pick_next_task_rt+0x6e/0x1d0
__schedule+0x5cb/0x790
? update_ts_time_stats+0x55/0x70
schedule_idle+0x1e/0x40
do_idle+0x15e/0x200
cpu_startup_entry+0x19/0x20
start_secondary+0x117/0x160
secondary_startup_64_no_verify+0xb0/0xbb
-> BUG: kernel NULL pointer dereference, address: 00000000000000c0
Call Trace:
? __die_body+0x1a/0x60
? no_context+0x183/0x350
? __warn+0x8a/0xe0
? exc_page_fault+0x3d6/0x520
? asm_exc_page_fault+0x1e/0x30
? pick_next_task_rt+0xb5/0x1d0
? pick_next_task_rt+0x8c/0x1d0
__schedule+0x583/0x7e0
? update_ts_time_stats+0x55/0x70
schedule_idle+0x1e/0x40
do_idle+0x15e/0x200
cpu_startup_entry+0x19/0x20
start_secondary+0x117/0x160
secondary_startup_64_no_verify+0xb0/0xbb
-> BUG: unable to handle page fault for address: ffff9464daea5900
kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq->cpu != task_cpu(p))
-> kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq->nr_running)
Call Trace:
? __die_body+0x1a/0x60
? die+0x2a/0x50
? do_trap+0x85/0x100
? dequeue_top_rt_rq+0xa2/0xb0
? do_error_trap+0x64/0xa0
? dequeue_top_rt_rq+0xa2/0xb0
? exc_invalid_op+0x4c/0x60
? dequeue_top_rt_rq+0xa2/0xb0
? asm_exc_invalid_op+0x12/0x20
? dequeue_top_rt_rq+0xa2/0xb0
dequeue_rt_entity+0x1f/0x70
dequeue_task_rt+0x2d/0x70
__schedule+0x1a8/0x7e0
? blk_finish_plug+0x25/0x40
schedule+0x3c/0xb0
futex_wait_queue_me+0xb6/0x120
futex_wait+0xd9/0x240
do_futex+0x344/0xa90
? get_mm_exe_file+0x30/0x60
? audit_exe_compare+0x58/0x70
? audit_filter_rules.constprop.26+0x65e/0x1220
__x64_sys_futex+0x148/0x1f0
do_syscall_64+0x30/0x80
entry_SYSCALL_64_after_hwframe+0x62/0xc7
-> BUG: unable to handle page fault for address: ffff8cf3608bc2c0
Call Trace:
? __die_body+0x1a/0x60
? no_context+0x183/0x350
? spurious_kernel_fault+0x171/0x1c0
? exc_page_fault+0x3b6/0x520
? plist_check_list+0x15/0x40
? plist_check_list+0x2e/0x40
? asm_exc_page_fault+0x1e/0x30
? _cond_resched+0x15/0x30
? futex_wait_queue_me+0xc8/0x120
? futex_wait+0xd9/0x240
? try_to_wake_up+0x1b8/0x490
? futex_wake+0x78/0x160
? do_futex+0xcd/0xa90
? plist_check_list+0x15/0x40
? plist_check_list+0x2e/0x40
? plist_del+0x6a/0xd0
? plist_check_list+0x15/0x40
? plist_check_list+0x2e/0x40
? dequeue_pushable_task+0x20/0x70
? __schedule+0x382/0x7e0
? asm_sysvec_reschedule_ipi+0xa/0x20
? schedule+0x3c/0xb0
? exit_to_user_mode_prepare+0x9e/0x150
? irqentry_exit_to_user_mode+0x5/0x30
? asm_sysvec_reschedule_ipi+0x12/0x20
Above are some of the common examples of the crashes that were observed
due to this issue.
Details
=======
Let's look at the following scenario to understand this race.
1) CPU A enters push_rt_task
a) CPU A has chosen next_task = task p.
b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq).
c) CPU A identifies CPU X as a destination CPU (X < Z).
d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq).
e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has
locked CPU X’s rq, and thus, CPU A must wait.
2) At CPU Z
a) Previous task has completed execution and thus, CPU Z enters
schedule, locks its own rq after CPU A releases it.
b) CPU Z dequeues previous task and begins executing task p.
c) CPU Z unlocks its rq.
d) Task p yields the CPU (ex. by doing IO or waiting to acquire a
lock) which triggers the schedule function on CPU Z.
e) CPU Z enters schedule again, locks its own rq, and dequeues task p.
f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq.
3) At CPU B
a) CPU B enters try_to_wake_up with input task p.
b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates
B.state = WAKING.
c) CPU B via select_task_rq determines CPU Y as the target CPU.
4) The race
a) CPU A acquires CPU X’s lock and relocks CPU Z.
b) CPU A reads task p.cpu = Z and incorrectly concludes task p is
still on CPU Z.
c) CPU A failed to notice task p had been dequeued from CPU Z while
CPU A was waiting for locks in double_lock_balance. If CPU A knew
that task p had been dequeued, it would return NULL forcing
push_rt_task to give up the task p's migration.
d) CPU B updates task p.cpu = Y and calls ttwu_queue.
e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task
p.on_rq = 1.
f) CPU B unlocks CPU Y, triggering memory synchronization.
g) CPU A reads task p.on_rq = 1, cementing its assumption that task p
has not migrated.
h) CPU A decides to migrate p to CPU X.
This leads to A dequeuing p from Y's queue and various crashes down the
line.
Solution
========
The solution here is fairly simple. After obtaining the lock (at 4a),
the check is enhanced to make sure that the task is still at the head of
the pushable tasks list. If not, then it is anyway not suitable for
being pushed out.
Testing
=======
The fix is tested on a cluster of 3 nodes, where the panics due to this
are hit every couple of days. A fix similar to this was deployed on such
cluster and was stable for more than 30 days.
Co-developed-by: Jon Kohler <jon(a)nutanix.com>
Signed-off-by: Jon Kohler <jon(a)nutanix.com>
Co-developed-by: Gauri Patwardhan <gauri.patwardhan(a)nutanix.com>
Signed-off-by: Gauri Patwardhan <gauri.patwardhan(a)nutanix.com>
Co-developed-by: Rahul Chunduru <rahul.chunduru(a)nutanix.com>
Signed-off-by: Rahul Chunduru <rahul.chunduru(a)nutanix.com>
Signed-off-by: Harshit Agarwal <harshit(a)nutanix.com>
Tested-by: Will Ton <william.ton(a)nutanix.com>
Reviewed-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
Cc: stable(a)vger.kernel.org
---
Changes in v2:
- As per Steve's suggestion, removed some checks that are done after
obtaining the lock that are no longer needed with the addition of new
check.
- Moved up is_migration_disabled check.
- Link to v1:
https://lore.kernel.org/lkml/20250211054646.23987-1-harshit@nutanix.com/
Changes in v3:
- Updated commit message to add stable maintainers and reviewed-by tag.
- Link to v2:
https://lore.kernel.org/lkml/20250214170844.201692-1-harshit@nutanix.com/
---
kernel/sched/rt.c | 54 +++++++++++++++++++++++------------------------
1 file changed, 26 insertions(+), 28 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4b8e33c615b1..4762dd3f50c5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1885,6 +1885,27 @@ static int find_lowest_rq(struct task_struct *task)
return -1;
}
+static struct task_struct *pick_next_pushable_task(struct rq *rq)
+{
+ struct task_struct *p;
+
+ if (!has_pushable_tasks(rq))
+ return NULL;
+
+ p = plist_first_entry(&rq->rt.pushable_tasks,
+ struct task_struct, pushable_tasks);
+
+ BUG_ON(rq->cpu != task_cpu(p));
+ BUG_ON(task_current(rq, p));
+ BUG_ON(task_current_donor(rq, p));
+ BUG_ON(p->nr_cpus_allowed <= 1);
+
+ BUG_ON(!task_on_rq_queued(p));
+ BUG_ON(!rt_task(p));
+
+ return p;
+}
+
/* Will lock the rq it finds */
static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
{
@@ -1915,18 +1936,16 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
/*
* We had to unlock the run queue. In
* the mean time, task could have
- * migrated already or had its affinity changed.
- * Also make sure that it wasn't scheduled on its rq.
+ * migrated already or had its affinity changed,
+ * therefore check if the task is still at the
+ * head of the pushable tasks list.
* It is possible the task was scheduled, set
* "migrate_disabled" and then got preempted, so we must
* check the task migration disable flag here too.
*/
- if (unlikely(task_rq(task) != rq ||
+ if (unlikely(is_migration_disabled(task) ||
!cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
- task_on_cpu(rq, task) ||
- !rt_task(task) ||
- is_migration_disabled(task) ||
- !task_on_rq_queued(task))) {
+ task != pick_next_pushable_task(rq))) {
double_unlock_balance(rq, lowest_rq);
lowest_rq = NULL;
@@ -1946,27 +1965,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
return lowest_rq;
}
-static struct task_struct *pick_next_pushable_task(struct rq *rq)
-{
- struct task_struct *p;
-
- if (!has_pushable_tasks(rq))
- return NULL;
-
- p = plist_first_entry(&rq->rt.pushable_tasks,
- struct task_struct, pushable_tasks);
-
- BUG_ON(rq->cpu != task_cpu(p));
- BUG_ON(task_current(rq, p));
- BUG_ON(task_current_donor(rq, p));
- BUG_ON(p->nr_cpus_allowed <= 1);
-
- BUG_ON(!task_on_rq_queued(p));
- BUG_ON(!rt_task(p));
-
- return p;
-}
-
/*
* If the current CPU has more than one RT task, see if the non
* running task can migrate over to a CPU that is running a task
--
2.22.3
Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which
can require large contiguous memory depending on the implementation.
This change prevents allocation failures by allowing the system to fall
back to vmalloc when contiguous memory allocation fails.
Since this buffer is only used for debugging purposes, physical memory
contiguity is not required, making vmalloc a suitable alternative.
Cc: stable(a)vger.kernel.org
Fixes: 07814a9439a3b0 ("sched_ext: Print debug dump after an error exit")
Suggested-by: Rik van Riel <riel(a)surriel.com>
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Acked-by: Andrea Righi <arighi(a)nvidia.com>
---
Changes in v3:
- Rewording the patch message
- Link to v2: https://lore.kernel.org/r/20250408-scx-v2-1-1979fc040903@debian.org
Changes in v2:
- Use kvfree() on the free path as well.
- Link to v1: https://lore.kernel.org/r/20250407-scx-v1-1-774ba74a2c17@debian.org
---
kernel/sched/ext.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 66bcd40a28ca1..db9af6a3c04fd 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4623,7 +4623,7 @@ static void scx_ops_bypass(bool bypass)
static void free_exit_info(struct scx_exit_info *ei)
{
- kfree(ei->dump);
+ kvfree(ei->dump);
kfree(ei->msg);
kfree(ei->bt);
kfree(ei);
@@ -4639,7 +4639,7 @@ static struct scx_exit_info *alloc_exit_info(size_t exit_dump_len)
ei->bt = kcalloc(SCX_EXIT_BT_LEN, sizeof(ei->bt[0]), GFP_KERNEL);
ei->msg = kzalloc(SCX_EXIT_MSG_LEN, GFP_KERNEL);
- ei->dump = kzalloc(exit_dump_len, GFP_KERNEL);
+ ei->dump = kvzalloc(exit_dump_len, GFP_KERNEL);
if (!ei->bt || !ei->msg || !ei->dump) {
free_exit_info(ei);
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250407-scx-11dbf94803c3
Best regards,
--
Breno Leitao <leitao(a)debian.org>
This fix is the deadline version of the change made to the rt scheduler
titled: "sched/rt: Fix race in push_rt_task".
Here is the summary of the issue:
When a CPU chooses to call push_dl_task and picks a task to push to
another CPU's runqueue then it will call find_lock_later_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.
Please go through the original change for more details on the issue.
In this fix, after the lock is obtained inside the find_lock_later_rq we
ensure that the task is still at the head of pushable tasks list. Also
removed some checks that are no longer needed with the addition this new
check.
However, the check of pushable tasks list only applies when
find_lock_later_rq is called by push_dl_task. For the other caller i.e.
dl_task_offline_migration, we use the existing checks.
Signed-off-by: Harshit Agarwal <harshit(a)nutanix.com>
Cc: stable(a)vger.kernel.org
---
Changes in v2:
- As per Juri's suggestion, moved the check inside find_lock_later_rq
similar to rt change. Here we distinguish among the push_dl_task
caller vs dl_task_offline_migration by checking if the task is
throttled or not.
- Fixed the commit message to refer to the rt change by title.
- Link to v1:
https://lore.kernel.org/lkml/20250307204255.60640-1-harshit@nutanix.com/
---
kernel/sched/deadline.c | 66 ++++++++++++++++++++++++++---------------
1 file changed, 42 insertions(+), 24 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 38e4537790af..2366801b4557 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2621,6 +2621,25 @@ static int find_later_rq(struct task_struct *task)
return -1;
}
+static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
+{
+ struct task_struct *p;
+
+ if (!has_pushable_dl_tasks(rq))
+ return NULL;
+
+ p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
+
+ WARN_ON_ONCE(rq->cpu != task_cpu(p));
+ WARN_ON_ONCE(task_current(rq, p));
+ WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
+
+ WARN_ON_ONCE(!task_on_rq_queued(p));
+ WARN_ON_ONCE(!dl_task(p));
+
+ return p;
+}
+
/* Locks the rq it finds */
static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
{
@@ -2648,12 +2667,30 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
/* Retry if something changed. */
if (double_lock_balance(rq, later_rq)) {
- if (unlikely(task_rq(task) != rq ||
+ /*
+ * We had to unlock the run queue. In the meantime,
+ * task could have migrated already or had its affinity
+ * changed.
+ * It is possible the task was scheduled, set
+ * "migrate_disabled" and then got preempted, so we must
+ * check the task migration disable flag here too.
+ * For throttled task (dl_task_offline_migration), we
+ * check if the task is migrated to a different rq or
+ * is not a dl task anymore.
+ * For the non-throttled task (push_dl_task), the check
+ * to ensure that this task is still at the head of the
+ * pushable tasks list is enough.
+ */
+ if (unlikely(is_migration_disabled(task) ||
!cpumask_test_cpu(later_rq->cpu, &task->cpus_mask) ||
- task_on_cpu(rq, task) ||
- !dl_task(task) ||
- is_migration_disabled(task) ||
- !task_on_rq_queued(task))) {
+ (task->dl.dl_throttled &&
+ (task_rq(task) != rq ||
+ task_on_cpu(rq, task) ||
+ !dl_task(task) ||
+ !task_on_rq_queued(task))) ||
+ (!task->dl.dl_throttled &&
+ task != pick_next_pushable_dl_task(rq)))) {
+
double_unlock_balance(rq, later_rq);
later_rq = NULL;
break;
@@ -2676,25 +2713,6 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
return later_rq;
}
-static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
-{
- struct task_struct *p;
-
- if (!has_pushable_dl_tasks(rq))
- return NULL;
-
- p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
-
- WARN_ON_ONCE(rq->cpu != task_cpu(p));
- WARN_ON_ONCE(task_current(rq, p));
- WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
-
- WARN_ON_ONCE(!task_on_rq_queued(p));
- WARN_ON_ONCE(!dl_task(p));
-
- return p;
-}
-
/*
* See if the non running -deadline tasks on this rq
* can be sent to some other CPU where they can preempt
--
2.39.3
Some peripheral subsystems request IRQ_TYPE_EDGE_BOTH interrupt type and
report request failures on LIOINTC. To avoid such failures we support to
set IRQ_TYPE_EDGE_BOTH type on LIOINTC, by setting LIOINTC_REG_INTC_EDGE
to true and keep LIOINTC_REG_INTC_POL as is.
Cc: stable(a)vger.kernel.org
Signed-off-by: Yinbo Zhu <zhuyinbo(a)loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai(a)loongson.cn>
---
drivers/irqchip/irq-loongson-liointc.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/irqchip/irq-loongson-liointc.c b/drivers/irqchip/irq-loongson-liointc.c
index 2b1bd4a96665..c0c8ef8d27cf 100644
--- a/drivers/irqchip/irq-loongson-liointc.c
+++ b/drivers/irqchip/irq-loongson-liointc.c
@@ -128,6 +128,10 @@ static int liointc_set_type(struct irq_data *data, unsigned int type)
liointc_set_bit(gc, LIOINTC_REG_INTC_EDGE, mask, false);
liointc_set_bit(gc, LIOINTC_REG_INTC_POL, mask, true);
break;
+ case IRQ_TYPE_EDGE_BOTH:
+ liointc_set_bit(gc, LIOINTC_REG_INTC_EDGE, mask, true);
+ /* Requester need "both", keep LIOINTC_REG_INTC_POL as is */
+ break;
case IRQ_TYPE_EDGE_RISING:
liointc_set_bit(gc, LIOINTC_REG_INTC_EDGE, mask, true);
liointc_set_bit(gc, LIOINTC_REG_INTC_POL, mask, false);
--
2.47.1
tpm2_start_auth_session() does not mask TPM RC correctly from the callers:
[ 28.766528] tpm tpm0: A TPM error (2307) occurred start auth session
Process TPM RCs inside tpm2_start_auth_session(), and map them to POSIX
error codes.
Cc: stable(a)vger.kernel.org # v6.10+
Fixes: 699e3efd6c64 ("tpm: Add HMAC session start and end functions")
Reported-by: Herbert Xu <herbert(a)gondor.apana.org.au>
Closes: https://lore.kernel.org/linux-integrity/Z_NgdRHuTKP6JK--@gondor.apana.org.a…
Signed-off-by: Jarkko Sakkinen <jarkko(a)kernel.org>
---
drivers/char/tpm/tpm2-sessions.c | 34 ++++++++++++++++----------------
include/linux/tpm.h | 1 +
2 files changed, 18 insertions(+), 17 deletions(-)
diff --git a/drivers/char/tpm/tpm2-sessions.c b/drivers/char/tpm/tpm2-sessions.c
index 3f89635ba5e8..1ed23375e4cb 100644
--- a/drivers/char/tpm/tpm2-sessions.c
+++ b/drivers/char/tpm/tpm2-sessions.c
@@ -40,11 +40,6 @@
*
* These are the usage functions:
*
- * tpm2_start_auth_session() which allocates the opaque auth structure
- * and gets a session from the TPM. This must be called before
- * any of the following functions. The session is protected by a
- * session_key which is derived from a random salt value
- * encrypted to the NULL seed.
* tpm2_end_auth_session() kills the session and frees the resources.
* Under normal operation this function is done by
* tpm_buf_check_hmac_response(), so this is only to be used on
@@ -963,16 +958,13 @@ static int tpm2_load_null(struct tpm_chip *chip, u32 *null_key)
}
/**
- * tpm2_start_auth_session() - create a HMAC authentication session with the TPM
- * @chip: the TPM chip structure to create the session with
+ * tpm2_start_auth_session() - Create an a HMAC authentication session
+ * @chip: A TPM chip
*
- * This function loads the NULL seed from its saved context and starts
- * an authentication session on the null seed, fills in the
- * @chip->auth structure to contain all the session details necessary
- * for performing the HMAC, encrypt and decrypt operations and
- * returns. The NULL seed is flushed before this function returns.
+ * Loads the ephemeral key (null seed), and starts an HMAC authenticated
+ * session. The null seed is flushed before the return.
*
- * Return: zero on success or actual error encountered.
+ * Returns zero on success, or a POSIX error code.
*/
int tpm2_start_auth_session(struct tpm_chip *chip)
{
@@ -1024,11 +1016,19 @@ int tpm2_start_auth_session(struct tpm_chip *chip)
/* hash algorithm for session */
tpm_buf_append_u16(&buf, TPM_ALG_SHA256);
- rc = tpm_transmit_cmd(chip, &buf, 0, "start auth session");
+ rc = tpm_transmit_cmd(chip, &buf, 0, "StartAuthSession");
tpm2_flush_context(chip, null_key);
-
- if (rc == TPM2_RC_SUCCESS)
- rc = tpm2_parse_start_auth_session(auth, &buf);
+ switch (rc) {
+ case TPM2_RC_SUCCESS:
+ break;
+ case TPM2_RC_SESSION_MEMORY:
+ rc = -ENOMEM;
+ goto out;
+ default:
+ rc = -EFAULT;
+ goto out;
+ }
+ rc = tpm2_parse_start_auth_session(auth, &buf);
tpm_buf_destroy(&buf);
diff --git a/include/linux/tpm.h b/include/linux/tpm.h
index 6c3125300c00..c1d3d60b416f 100644
--- a/include/linux/tpm.h
+++ b/include/linux/tpm.h
@@ -257,6 +257,7 @@ enum tpm2_return_codes {
TPM2_RC_TESTING = 0x090A, /* RC_WARN */
TPM2_RC_REFERENCE_H0 = 0x0910,
TPM2_RC_RETRY = 0x0922,
+ TPM2_RC_SESSION_MEMORY = 0x0903,
};
enum tpm2_command_codes {
--
2.39.5
From: Wenlin Kang <wenlin.kang(a)windriver.com>
The selftest tpdir2 terminated with a 'Segmentation fault' during loading.
root@localhost:~# cd linux-kenel/tools/testing/selftests/arm64/abi && make
root@localhost:~/linux-kernel/tools/testing/selftests/arm64/abi# ./tpidr2
Segmentation fault
The cause of this is the __arch_clear_user() failure.
load_elf_binary() [fs/binfmt_elf.c]
-> if (likely(elf_bss != elf_brk) && unlikely(padzero(elf_bes)))
-> padzero()
-> clear_user() [arch/arm64/include/asm/uaccess.h]
-> __arch_clear_user() [arch/arm64/lib/clear_user.S]
For more details, please see:
https://lore.kernel.org/lkml/1d0342f3-0474-482b-b6db-81ca7820a462@t-8ch.de/…
This issue has been fixed in the mainline. Here I have backported
the relevant commits for the linux-6.6.y branch and attached them.
With these patches, tpdir2 works as:
root@localhost:~/linux-kernel/tools/testing/selftests/arm64/abi# ./tpidr2
TAP version 13
1..5
ok 0 skipped, TPIDR2 not supported
ok 1 skipped, TPIDR2 not supported
ok 2 skipped, TPIDR2 not supported
ok 3 skipped, TPIDR2 not supported
ok 4 skipped, TPIDR2 not supported
This issue is resolved by the first patch. However, to ensure
functional completeness, all related patches were backported
according to the following link.
https://lore.kernel.org/all/20230929031716.it.155-kees@kernel.org/#t
Eric W. Biederman (1):
binfmt_elf: Support segments with 0 filesz and misaligned starts
Kees Cook (5):
binfmt_elf: elf_bss no longer used by load_elf_binary()
binfmt_elf: Use elf_load() for interpreter
binfmt_elf: Use elf_load() for library
binfmt_elf: Only report padzero() errors when PROT_WRITE
mm: Remove unused vm_brk()
fs/binfmt_elf.c | 215 ++++++++++++++++-----------------------------
include/linux/mm.h | 3 +-
mm/mmap.c | 6 --
mm/nommu.c | 5 --
4 files changed, 76 insertions(+), 153 deletions(-)
--
2.43.0
Hello,
New build issue found on stable-rc/linux-5.15.y:
---
variable 'base_clk' is used uninitialized whenever 'if' condition is
true [-Werror,-Wsometimes-uninitialized] in
drivers/mmc/host/sdhci-brcmstb.o (drivers/mmc/host/sdhci-brcmstb.c)
[logspec:kbuild,kbuild.compiler.error]
---
- dashboard: https://d.kernelci.org/i/maestro:eb9b0da83cc077e6176b9903d98f0f78704ac17f
- giturl: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
- commit HEAD: 0b4857306c618d2052f6455b90747ef1df364ecd
Log excerpt:
=====================================================
drivers/mmc/host/sdhci-brcmstb.c:303:6: error: variable 'base_clk' is
used uninitialized whenever 'if' condition is true
[-Werror,-Wsometimes-uninitialized]
303 | if (res)
| ^~~
drivers/mmc/host/sdhci-brcmstb.c:377:24: note: uninitialized use occurs here
377 | clk_disable_unprepare(base_clk);
| ^~~~~~~~
drivers/mmc/host/sdhci-brcmstb.c:303:2: note: remove the 'if' if its
condition is always false
303 | if (res)
| ^~~~~~~~
304 | goto err;
| ~~~~~~~~
drivers/mmc/host/sdhci-brcmstb.c:296:6: error: variable 'base_clk' is
used uninitialized whenever 'if' condition is true
[-Werror,-Wsometimes-uninitialized]
296 | if (IS_ERR(priv->cfg_regs)) {
| ^~~~~~~~~~~~~~~~~~~~~~
drivers/mmc/host/sdhci-brcmstb.c:377:24: note: uninitialized use occurs here
377 | clk_disable_unprepare(base_clk);
| ^~~~~~~~
drivers/mmc/host/sdhci-brcmstb.c:296:2: note: remove the 'if' if its
condition is always false
296 | if (IS_ERR(priv->cfg_regs)) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
297 | res = PTR_ERR(priv->cfg_regs);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
298 | goto err;
| ~~~~~~~~~
299 | }
| ~
drivers/mmc/host/sdhci-brcmstb.c:281:6: error: variable 'base_clk' is
used uninitialized whenever 'if' condition is true
[-Werror,-Wsometimes-uninitialized]
281 | if (IS_ERR(host)) {
| ^~~~~~~~~~~~
drivers/mmc/host/sdhci-brcmstb.c:377:24: note: uninitialized use occurs here
377 | clk_disable_unprepare(base_clk);
| ^~~~~~~~
drivers/mmc/host/sdhci-brcmstb.c:281:2: note: remove the 'if' if its
condition is always false
281 | if (IS_ERR(host)) {
| ^~~~~~~~~~~~~~~~~~~
282 | res = PTR_ERR(host);
| ~~~~~~~~~~~~~~~~~~~~
283 | goto err_clk;
| ~~~~~~~~~~~~~
284 | }
| ~
drivers/mmc/host/sdhci-brcmstb.c:260:22: note: initialize the variable
'base_clk' to silence this warning
260 | struct clk *base_clk;
| ^
| = NULL
3 errors generated.
CC [M] drivers/gpu/drm/drm_gem.o
CC [M] drivers/staging/rtl8723bs/hal/odm_EdcaTurboCheck.o
CC [M] drivers/net/ethernet/rocker/rocker_tlv.o
CC [M] drivers/staging/nvec/nvec.o
CC [M] drivers/staging/media/zoran/zoran_card.o
=====================================================
# Builds where the incident occurred:
## defconfig+allmodconfig+CONFIG_FRAME_WARN=2048 on (arm):
- compiler: clang-17
- dashboard: https://d.kernelci.org/build/maestro:67f509a66fa43d168f278a2b
#kernelci issue maestro:eb9b0da83cc077e6176b9903d98f0f78704ac17f
Reported-by: kernelci.org bot <bot(a)kernelci.org>
--
This is an experimental report format. Please send feedback in!
Talk to us at kernelci(a)lists.linux.dev
Made with love by the KernelCI team - https://kernelci.org
Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which
can require large contiguous memory (up to order=9) depending on the
implementation. This change prevents allocation failures by allowing the
system to fall back to vmalloc when contiguous memory allocation fails.
Since this buffer is only used for debugging purposes, physical memory
contiguity is not required, making vmalloc a suitable alternative.
Cc: stable(a)vger.kernel.org
Fixes: 07814a9439a3b0 ("sched_ext: Print debug dump after an error exit")
Suggested-by: Rik van Riel <riel(a)surriel.com>
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Acked-by: Andrea Righi <arighi(a)nvidia.com>
---
Changes in v2:
- Use kvfree() on the free path as well.
- Link to v1: https://lore.kernel.org/r/20250407-scx-v1-1-774ba74a2c17@debian.org
---
kernel/sched/ext.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 66bcd40a28ca1..db9af6a3c04fd 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4623,7 +4623,7 @@ static void scx_ops_bypass(bool bypass)
static void free_exit_info(struct scx_exit_info *ei)
{
- kfree(ei->dump);
+ kvfree(ei->dump);
kfree(ei->msg);
kfree(ei->bt);
kfree(ei);
@@ -4639,7 +4639,7 @@ static struct scx_exit_info *alloc_exit_info(size_t exit_dump_len)
ei->bt = kcalloc(SCX_EXIT_BT_LEN, sizeof(ei->bt[0]), GFP_KERNEL);
ei->msg = kzalloc(SCX_EXIT_MSG_LEN, GFP_KERNEL);
- ei->dump = kzalloc(exit_dump_len, GFP_KERNEL);
+ ei->dump = kvzalloc(exit_dump_len, GFP_KERNEL);
if (!ei->bt || !ei->msg || !ei->dump) {
free_exit_info(ei);
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250407-scx-11dbf94803c3
Best regards,
--
Breno Leitao <leitao(a)debian.org>
Hello,
New build issue found on stable-rc/linux-6.12.y:
---
in vmlinux (vmlinux.lds) [logspec:kbuild,kbuild.compiler]
---
- dashboard: https://d.kernelci.org/i/maestro:9938a6d051bfcd7063bdcf21603c29bf8822a3a5
- giturl: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
- commit HEAD: 8e9508dd93587658f8f8116bc709aeb272144427
Log excerpt:
=====================================================
arm-linux-gnueabihf-ld:./arch/arm/kernel/vmlinux.lds:30: syntax error
=====================================================
# Builds where the incident occurred:
## multi_v5_defconfig on (arm):
- compiler: gcc-12
- dashboard: https://d.kernelci.org/build/maestro:67f50c196fa43d168f278cea
#kernelci issue maestro:9938a6d051bfcd7063bdcf21603c29bf8822a3a5
Reported-by: kernelci.org bot <bot(a)kernelci.org>
--
This is an experimental report format. Please send feedback in!
Talk to us at kernelci(a)lists.linux.dev
Made with love by the KernelCI team - https://kernelci.org
The fsl-mc bus associated to the root DPRC in a DPAA2 system exports a
device file for userspace access to the MC firmware. In case the DPRC's
local MC portal (DPMCP) is currently in use, a new DPMCP device is
allocated through the fsl_mc_portal_allocate() function.
In this case, the call to fsl_mc_portal_allocate() will fail with -EINVAL
when trying to add a device link between the root DPRC (consumer) and
the newly allocated DPMCP device (supplier). This is because the DPMCP
is a dependent of the DPRC device (the bus).
Fix this by not adding a device link in case the DPMCP is allocated for
the root DPRC's usage.
Fixes: afb77422819f ("bus: fsl-mc: automatically add a device_link on fsl_mc_[portal,object]_allocate")
Signed-off-by: Ioana Ciornei <ioana.ciornei(a)nxp.com>
Cc: stable(a)vger.kernel.org
---
drivers/bus/fsl-mc/mc-io.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/drivers/bus/fsl-mc/mc-io.c b/drivers/bus/fsl-mc/mc-io.c
index a0ad7866cbfc..cd8754763f40 100644
--- a/drivers/bus/fsl-mc/mc-io.c
+++ b/drivers/bus/fsl-mc/mc-io.c
@@ -214,12 +214,19 @@ int __must_check fsl_mc_portal_allocate(struct fsl_mc_device *mc_dev,
if (error < 0)
goto error_cleanup_resource;
- dpmcp_dev->consumer_link = device_link_add(&mc_dev->dev,
- &dpmcp_dev->dev,
- DL_FLAG_AUTOREMOVE_CONSUMER);
- if (!dpmcp_dev->consumer_link) {
- error = -EINVAL;
- goto error_cleanup_mc_io;
+ /* If the DPRC device itself tries to allocate a portal (usually for
+ * UAPI interaction), don't add a device link between them since the
+ * DPMCP device is an actual child device of the DPRC and a reverse
+ * dependency is not allowed.
+ */
+ if (mc_dev != mc_bus_dev) {
+ dpmcp_dev->consumer_link = device_link_add(&mc_dev->dev,
+ &dpmcp_dev->dev,
+ DL_FLAG_AUTOREMOVE_CONSUMER);
+ if (!dpmcp_dev->consumer_link) {
+ error = -EINVAL;
+ goto error_cleanup_mc_io;
+ }
}
*new_mc_io = mc_io;
--
2.34.1
The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.12.y
git checkout FETCH_HEAD
git cherry-pick -x cbef7442fba510b7eb229dcc9f39d3dde4a159a4
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025040846-bonsai-tackiness-6500@gregkh' --subject-prefix 'PATCH 6.12.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cbef7442fba510b7eb229dcc9f39d3dde4a159a4 Mon Sep 17 00:00:00 2001
From: Tudor Ambarus <tudor.ambarus(a)linaro.org>
Date: Fri, 17 Jan 2025 14:18:51 +0000
Subject: [PATCH] mmc: sdhci-msm: fix dev reference leaked through
of_qcom_ice_get
The driver leaks the device reference taken with
of_find_device_by_node(). Fix the leak by using devm_of_qcom_ice_get().
Fixes: c7eed31e235c ("mmc: sdhci-msm: Switch to the new ICE API")
Cc: stable(a)vger.kernel.org
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski(a)linaro.org>
Acked-by: Ulf Hansson <ulf.hansson(a)linaro.org>
Reviewed-by: Abel Vesa <abel.vesa(a)linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Link: https://lore.kernel.org/r/20250117-qcom-ice-fix-dev-leak-v2-2-1ffa5b6884cb@…
Signed-off-by: Bjorn Andersson <andersson(a)kernel.org>
diff --git a/drivers/mmc/host/sdhci-msm.c b/drivers/mmc/host/sdhci-msm.c
index e3d39311fdc7..3fd898647237 100644
--- a/drivers/mmc/host/sdhci-msm.c
+++ b/drivers/mmc/host/sdhci-msm.c
@@ -1873,7 +1873,7 @@ static int sdhci_msm_ice_init(struct sdhci_msm_host *msm_host,
if (!(cqhci_readl(cq_host, CQHCI_CAP) & CQHCI_CAP_CS))
return 0;
- ice = of_qcom_ice_get(dev);
+ ice = devm_of_qcom_ice_get(dev);
if (ice == ERR_PTR(-EOPNOTSUPP)) {
dev_warn(dev, "Disabling inline encryption support\n");
ice = NULL;