ast_dp_is_connected() used to also check for link training success
to report the DP connector as connected. Without this check, the
physical_status is always connected. So if no monitor is present, it
will fail to read the EDID and set the default resolution to 640x480
instead of 1024x768.
Signed-off-by: Jocelyn Falempe <jfalempe(a)redhat.com>
Fixes: 2281475168d2 ("drm/ast: astdp: Perform link training during atomic_enable")
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Dave Airlie <airlied(a)redhat.com>
Cc: dri-devel(a)lists.freedesktop.org
Cc: <stable(a)vger.kernel.org> # v6.12+
---
drivers/gpu/drm/ast/ast_dp.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/ast/ast_dp.c b/drivers/gpu/drm/ast/ast_dp.c
index 0e282b7b167c..30aad5c0112a 100644
--- a/drivers/gpu/drm/ast/ast_dp.c
+++ b/drivers/gpu/drm/ast/ast_dp.c
@@ -17,6 +17,12 @@ static bool ast_astdp_is_connected(struct ast_device *ast)
{
if (!ast_get_index_reg_mask(ast, AST_IO_VGACRI, 0xDF, AST_IO_VGACRDF_HPD))
return false;
+ /*
+ * HPD might be set even if no monitor is connected, so also check that
+ * the link training was successful.
+ */
+ if (!ast_get_index_reg_mask(ast, AST_IO_VGACRI, 0xDC, AST_IO_VGACRDC_LINK_SUCCESS))
+ return false;
return true;
}
base-commit: 798047e63ac970f105c49c22e6d44df901c486b2
--
2.47.1
Hi all,
I have been sent here from the linux-ide mailing list. Over there, I
have reported an issue with the 6.6 stable series of the kernel,
starting with 6.6.51. The root cause is as follows:
On 12.09.2024, commit 872f86e1757bbb0a334ee739b824e47c448f5ebc ("ata:
libata-scsi: Check ATA_QCFLAG_RTF_FILLED before using result_tf") was
applied to 6.6, adding checks of ATA_QCFLAG_RTF_FILLED to libata_scsi.
The patch seen in baseline commit
18676c6aab0863618eb35443e7b8615eea3535a9 ("ata: libata-core: Set
ATA_QCFLAG_RTF_FILLED in fill_result_tf()") should have gone together
with this.
Without it, I receive errors retrieving SMART data from SATA disks via a
C602 SAS controller, apparently because in this situation
ATA_QCFLAG_RTF_FILLED is not set.
I applied 18676c6aab0863618eb35443e7b8615eea3535a9 from baseline to
6.6.74 and the problem went away.
If you need any further information, do not hesitate to contact me
(linux user since 0.99.x, but only debugging it once every few years or
so...).
Regards
Christian
From: Andrey Vatoropin <a.vatoropin(a)crpt.ru>
Add suffix ULL to constant 500 in order to give the compiler complete
information about the proper arithmetic to use. Notice that this
constant is used in a context that expects an expression of type
u64 (64 bits, unsigned).
The expression PCC_NUM_RETRIES * pcc_chan->latency, which at
preprocessing time translates to pcc_chan->latency; is currently
being evaluated using 32-bit arithmetic.
This is similar to commit b52f45110502
("ACPI / CPPC: Use 64-bit arithmetic instead of 32-bit")
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Andrey Vatoropin <a.vatoropin(a)crpt.ru>
---
drivers/hwmon/xgene-hwmon.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hwmon/xgene-hwmon.c b/drivers/hwmon/xgene-hwmon.c
index 1e3bd129a922..43a08ddb964b 100644
--- a/drivers/hwmon/xgene-hwmon.c
+++ b/drivers/hwmon/xgene-hwmon.c
@@ -61,7 +61,7 @@
* Arbitrary retries in case the remote processor is slow to respond
* to PCC commands
*/
-#define PCC_NUM_RETRIES 500
+#define PCC_NUM_RETRIES 500ULL
#define ASYNC_MSG_FIFO_SIZE 16
#define MBOX_OP_TIMEOUTMS 1000
--
2.43.0
From: Andrey Vatoropin <a.vatoropin(a)crpt.ru>
Add suffix ULL to constant 500 in order to give the compiler complete
information about the proper arithmetic to use. Notice that this
constant is used in a context that expects an expression of type
u64 (64 bits, unsigned).
The expression PCC_NUM_RETRIES * cppc_ss->latency, which at
preprocessing time translates to cppc_ss->latency; is currently
being evaluated using 32-bit arithmetic.
This is similar to commit b52f45110502 ("ACPI / CPPC: Use 64-bit
arithmetic instead of 32-bit")
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Andrey Vatoropin <a.vatoropin(a)crpt.ru>
---
drivers/hwmon/xgene-hwmon.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hwmon/xgene-hwmon.c b/drivers/hwmon/xgene-hwmon.c
index 1e3bd129a922..43a08ddb964b 100644
--- a/drivers/hwmon/xgene-hwmon.c
+++ b/drivers/hwmon/xgene-hwmon.c
@@ -61,7 +61,7 @@
* Arbitrary retries in case the remote processor is slow to respond
* to PCC commands
*/
-#define PCC_NUM_RETRIES 500
+#define PCC_NUM_RETRIES 500ULL
#define ASYNC_MSG_FIFO_SIZE 16
#define MBOX_OP_TIMEOUTMS 1000
--
2.43.0
Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()")
skips charging any zswapped base pages when it failed to zswap the entire
folio.
However, when some base pages are zswapped but it failed to zswap
the entire folio, the zswap operation is rolled back.
When freeing zswap entries for those pages, zswap_entry_free() uncharges
the pages that were not previously charged, causing zswap charging to
become inconsistent.
This inconsistency triggers two warnings with following steps:
# On a machine with 64GiB of RAM and 36GiB of zswap
$ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng
$ sudo reboot
Two warnings are:
in mm/memcontrol.c:163, function obj_cgroup_release():
WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
in mm/page_counter.c:60, function page_counter_cancel():
if (WARN_ONCE(new < 0, "page_counter underflow: %ld nr_pages=%lu\n",
new, nr_pages))
While objcg events should only be accounted for when the entire folio is
zswapped, objcg charging should be performed regardlessly.
Fix accordingly.
After resolving the inconsistency, these warnings disappear.
Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Hyeonggon Yoo <42.hyeyoo(a)gmail.com>
---
v1->v2:
Fixed objcg events being accounted for on zswap failure.
Fixed the incorrect description. I misunderstood that the base pages are
going to be stored in zswap, but their zswap entries are freed immediately.
Added a comment on why it charges pages that are going to be removed
from zswap.
mm/zswap.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 6504174fbc6a..10b30ac46deb 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1568,20 +1568,26 @@ bool zswap_store(struct folio *folio)
bytes = zswap_store_page(page, objcg, pool);
if (bytes < 0)
- goto put_pool;
+ goto charge_zswap;
compressed_bytes += bytes;
}
- if (objcg) {
- obj_cgroup_charge_zswap(objcg, compressed_bytes);
+ if (objcg)
count_objcg_events(objcg, ZSWPOUT, nr_pages);
- }
atomic_long_add(nr_pages, &zswap_stored_pages);
count_vm_events(ZSWPOUT, nr_pages);
ret = true;
+charge_zswap:
+ /*
+ * Charge zswapped pages even when it failed to zswap the entire folio,
+ * because zswap_entry_free() will uncharge them anyway.
+ * Otherwise zswap charging will become inconsistent.
+ */
+ if (objcg)
+ obj_cgroup_charge_zswap(objcg, compressed_bytes);
put_pool:
zswap_pool_put(pool);
put_objcg:
--
2.47.1
From: Andrey Vatoropin <a.vatoropin(a)crpt.ru>
Size of variable sd_gain equals four bytes - DA9150_QIF_SD_GAIN_SIZE.
Size of variable shunt_val equals two bytes - DA9150_QIF_SHUNT_VAL_SIZE.
The expression sd_gain * shunt_val is currently being evaluated using
32-bit arithmetic. So during the multiplication an overflow may occur.
As the value of type 'u64' is used as storage for the eventual result, put
ULL variable at the first position of each expression in order to give the
compiler complete information about the proper arithmetic to use. According
to C99 the guaranteed width for a variable of type 'unsigned long long' >=
64 bits.
Remove the explicit cast to u64 as it is meaningless.
Just for the sake of consistency, perform the similar trick with another
expression concerning 'iavg'.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: a419b4fd9138 ("power: Add support for DA9150 Fuel-Gauge")
Signed-off-by: Andrey Vatoropin <a.vatoropin(a)crpt.ru>
---
drivers/power/supply/da9150-fg.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/power/supply/da9150-fg.c b/drivers/power/supply/da9150-fg.c
index 652c1f213af1..63bec706167c 100644
--- a/drivers/power/supply/da9150-fg.c
+++ b/drivers/power/supply/da9150-fg.c
@@ -247,7 +247,7 @@ static int da9150_fg_current_avg(struct da9150_fg *fg,
DA9150_QIF_SD_GAIN_SIZE);
da9150_fg_read_sync_end(fg);
- div = (u64) (sd_gain * shunt_val * 65536ULL);
+ div = 65536ULL * sd_gain * shunt_val;
do_div(div, 1000000);
- res = (u64) (iavg * 1000000ULL);
+ res = 1000000ULL * iavg;
do_div(res, div);
--
2.43.0
From: Andrey Vatoropin <a.vatoropin(a)crpt.ru>
The function ib_umem_find_best_pgsz may return a value of zero.
If this occurs, the function ib_umem_num_dma_blocks could attempt
to divide by zero, resulting in a division by zero error.
To avoid this, please add a check to ensure that the variable
is not zero before performing the division.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 6d1782919dc9 ("drm/cma: Introduce drm_gem_cma_dumb_create_internal()")
Signed-off-by: Andrey Vatoropin <a.vatoropin(a)crpt.ru>
---
drivers/infiniband/hw/erdma/erdma_verbs.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
index 51d619edb6c5..7ad38fb84661 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.c
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -781,6 +781,10 @@ static int get_mtt_entries(struct erdma_dev *dev, struct erdma_mem *mem,
mem->page_size = ib_umem_find_best_pgsz(mem->umem, req_page_size, virt);
+ if (!mem->page_size) {
+ ret = -EINVAL;
+ goto error_ret;
+ }
mem->page_offset = start & (mem->page_size - 1);
mem->mtt_nents = ib_umem_num_dma_blocks(mem->umem, mem->page_size);
mem->page_cnt = mem->mtt_nents;
mem->mtt = erdma_create_mtt(dev, MTT_SIZE(mem->page_cnt),
force_continuous);
--
2.43.0
GCC changed the default C standard dialect from gnu17 to gnu23,
which should not have impacted the kernel because it explicitly requests
the gnu11 standard in the main Makefile. However, there are certain
places in the s390 code that use their own CFLAGS without a '-std='
value, which break with this dialect change because of the kernel's own
definitions of bool, false, and true conflicting with the C23 reserved
keywords.
include/linux/stddef.h:11:9: error: cannot use keyword 'false' as enumeration constant
11 | false = 0,
| ^~~~~
include/linux/stddef.h:11:9: note: 'false' is a keyword with '-std=c23' onwards
include/linux/types.h:35:33: error: 'bool' cannot be defined via 'typedef'
35 | typedef _Bool bool;
| ^~~~
include/linux/types.h:35:33: note: 'bool' is a keyword with '-std=c23' onwards
Add '-std=gnu11' to the decompressor and purgatory CFLAGS to eliminate
these errors and make the C standard version of these areas match the
rest of the kernel.
Cc: stable(a)vger.kernel.org
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
---
I only see one other error in various files with a recent GCC 15.0.1
snapshot, which I can eliminate by dropping the version part of the
condition for CONFIG_GCC_ASM_FLAG_OUTPUT_BROKEN. Is this a regression of
the fix for the problem of GCC 14.2.0 or is something else doing on
here?
arch/s390/include/asm/bitops.h: Assembler messages:
arch/s390/include/asm/bitops.h:60: Error: operand 1: syntax error; missing ')' after base register
arch/s390/include/asm/bitops.h:60: Error: operand 2: syntax error; ')' not allowed here
arch/s390/include/asm/bitops.h:60: Error: junk at end of line: `,4'
arch/s390/include/asm/bitops.h:60: Error: operand 1: syntax error; missing ')' after base register
arch/s390/include/asm/bitops.h:60: Error: operand 2: syntax error; ')' not allowed here
arch/s390/include/asm/bitops.h:60: Error: junk at end of line: `,64'
arch/s390/include/asm/bitops.h:60: Error: operand 1: syntax error; missing ')' after base register
arch/s390/include/asm/bitops.h:60: Error: operand 2: syntax error; ')' not allowed here
arch/s390/include/asm/bitops.h:60: Error: junk at end of line: `,4'
make[6]: *** [scripts/Makefile.build:194: fs/gfs2/glock.o] Error 1
---
arch/s390/Makefile | 2 +-
arch/s390/purgatory/Makefile | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/s390/Makefile b/arch/s390/Makefile
index 3f25498dac65..5fae311203c2 100644
--- a/arch/s390/Makefile
+++ b/arch/s390/Makefile
@@ -22,7 +22,7 @@ KBUILD_AFLAGS_DECOMPRESSOR := $(CLANG_FLAGS) -m64 -D__ASSEMBLY__
ifndef CONFIG_AS_IS_LLVM
KBUILD_AFLAGS_DECOMPRESSOR += $(if $(CONFIG_DEBUG_INFO),$(aflags_dwarf))
endif
-KBUILD_CFLAGS_DECOMPRESSOR := $(CLANG_FLAGS) -m64 -O2 -mpacked-stack
+KBUILD_CFLAGS_DECOMPRESSOR := $(CLANG_FLAGS) -m64 -O2 -mpacked-stack -std=gnu11
KBUILD_CFLAGS_DECOMPRESSOR += -DDISABLE_BRANCH_PROFILING -D__NO_FORTIFY
KBUILD_CFLAGS_DECOMPRESSOR += -D__DECOMPRESSOR
KBUILD_CFLAGS_DECOMPRESSOR += -fno-delete-null-pointer-checks -msoft-float -mbackchain
diff --git a/arch/s390/purgatory/Makefile b/arch/s390/purgatory/Makefile
index 24eccaa29337..bdcf2a3b6c41 100644
--- a/arch/s390/purgatory/Makefile
+++ b/arch/s390/purgatory/Makefile
@@ -13,7 +13,7 @@ CFLAGS_sha256.o := -D__DISABLE_EXPORTS -D__NO_FORTIFY
$(obj)/mem.o: $(srctree)/arch/s390/lib/mem.S FORCE
$(call if_changed_rule,as_o_S)
-KBUILD_CFLAGS := -fno-strict-aliasing -Wall -Wstrict-prototypes
+KBUILD_CFLAGS := -std=gnu11 -fno-strict-aliasing -Wall -Wstrict-prototypes
KBUILD_CFLAGS += -Wno-pointer-sign -Wno-sign-compare
KBUILD_CFLAGS += -fno-zero-initialized-in-bss -fno-builtin -ffreestanding
KBUILD_CFLAGS += -Os -m64 -msoft-float -fno-common
---
base-commit: b2832409e00b6330781458d7db0080508a35a9a8
change-id: 20250122-s390-fix-std-for-gcc-15-0abfa4caf757
Best regards,
--
Nathan Chancellor <nathan(a)kernel.org>
On 28/01/2025 09:57, Jens Schmidt wrote:
> This is on Debian testing with the following kernel, built from
> the tarball on kernel.org:
>
> Linux sappc1 6.12.10 #4 SMP PREEMPT_DYNAMIC Fri Jan 17 22:17:45 CET 2025 x86_64 GNU/Linux
>
> More by chance than anything else I noticed this sysfs entry:
>
> /devices/pci0000:00/0000:00:02.0/rc/rc0/input33
> ^^
> immediately after a reboot. After letting the system run for
> a while the file name counter may well be in the thousands.
>
> I instrumented function drm_dp_cec_attach to dump the values
> of the expressions involved in the following test:
>
> if (aux->cec.adap->capabilities == cec_caps &&
> aux->cec.adap->available_log_addrs == num_las) {
>
> The result was that on every call I have
>
> aux->cec.adap->capabilities == 0b1101111110
> cec_caps == 0b0101111110
> aux->cec.adap->available_log_addrs == 4
> num_las == 4
>
> which triggers recreation of the sysfs entry.
>
> So the capabilities differ in CEC_CAP_REPLY_VENDOR_ID. If I
> clear that bit in above test, I am back to normal, getting only
>
> /devices/pci0000:00/0000:00:02.0/rc/rc0/input1
>
> and keeping that throughout the runtime of the system.
>
> Could this be related to commit
> 613f21505b25a4f43f33de00f11afc059bedde2b?
Well spotted! Yes, it is related to that commit.
I'll take a closer look at this tomorrow since this test against
cec_caps needs work.
Regards,
Hans
From: Aditya Garg <gargaditya08(a)live.com>
Before 6.13, random seed to the firmware was given based on the logic
whether the device had valid OTP or not, and such devices were found
mainly on the T2 and Apple Silicon Macs. In 6.13, the logic was changed,
and the device table was used for this purpose, so as to cover the special
case of BCM43752 chip.
During the transition, the device table for BCM4364 and BCM4355 Wi-Fi chips
which had valid OTP was not modified, thus breaking Wi-Fi on these devices.
This patch adds does the necessary changes, similar to the ones done for
other chips.
Fixes: ea11a89c3ac6 ("wifi: brcmfmac: add flag for random seed during firmware download")
Cc: stable(a)vger.kernel.org
Signed-off-by: Aditya Garg <gargaditya08(a)live.com>
---
drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c
index e4395b1f8..d2caa80e9 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c
@@ -2712,7 +2712,7 @@ static const struct pci_device_id brcmf_pcie_devid_table[] = {
BRCMF_PCIE_DEVICE(BRCM_PCIE_4350_DEVICE_ID, WCC),
BRCMF_PCIE_DEVICE_SUB(0x4355, BRCM_PCIE_VENDOR_ID_BROADCOM, 0x4355, WCC),
BRCMF_PCIE_DEVICE(BRCM_PCIE_4354_RAW_DEVICE_ID, WCC),
- BRCMF_PCIE_DEVICE(BRCM_PCIE_4355_DEVICE_ID, WCC),
+ BRCMF_PCIE_DEVICE(BRCM_PCIE_4355_DEVICE_ID, WCC_SEED),
BRCMF_PCIE_DEVICE(BRCM_PCIE_4356_DEVICE_ID, WCC),
BRCMF_PCIE_DEVICE(BRCM_PCIE_43567_DEVICE_ID, WCC),
BRCMF_PCIE_DEVICE(BRCM_PCIE_43570_DEVICE_ID, WCC),
@@ -2723,7 +2723,7 @@ static const struct pci_device_id brcmf_pcie_devid_table[] = {
BRCMF_PCIE_DEVICE(BRCM_PCIE_43602_2G_DEVICE_ID, WCC),
BRCMF_PCIE_DEVICE(BRCM_PCIE_43602_5G_DEVICE_ID, WCC),
BRCMF_PCIE_DEVICE(BRCM_PCIE_43602_RAW_DEVICE_ID, WCC),
- BRCMF_PCIE_DEVICE(BRCM_PCIE_4364_DEVICE_ID, WCC),
+ BRCMF_PCIE_DEVICE(BRCM_PCIE_4364_DEVICE_ID, WCC_SEED),
BRCMF_PCIE_DEVICE(BRCM_PCIE_4365_DEVICE_ID, BCA),
BRCMF_PCIE_DEVICE(BRCM_PCIE_4365_2G_DEVICE_ID, BCA),
BRCMF_PCIE_DEVICE(BRCM_PCIE_4365_5G_DEVICE_ID, BCA),
--
2.39.5 (Apple Git-154)
From: yangerkun <yangerkun(a)huawei.com>
[ Upstream commit 64a7ce76fb901bf9f9c36cf5d681328fc0fd4b5a ]
After we switch tmpfs dir operations from simple_dir_operations to
simple_offset_dir_operations, every rename happened will fill new dentry
to dest dir's maple tree(&SHMEM_I(inode)->dir_offsets->mt) with a free
key starting with octx->newx_offset, and then set newx_offset equals to
free key + 1. This will lead to infinite readdir combine with rename
happened at the same time, which fail generic/736 in xfstests(detail show
as below).
1. create 5000 files(1 2 3...) under one dir
2. call readdir(man 3 readdir) once, and get one entry
3. rename(entry, "TEMPFILE"), then rename("TEMPFILE", entry)
4. loop 2~3, until readdir return nothing or we loop too many
times(tmpfs break test with the second condition)
We choose the same logic what commit 9b378f6ad48cf ("btrfs: fix infinite
directory reads") to fix it, record the last_index when we open dir, and
do not emit the entry which index >= last_index. The file->private_data
now used in offset dir can use directly to do this, and we also update
the last_index when we llseek the dir file.
Fixes: a2e459555c5f ("shmem: stable directory offsets")
Signed-off-by: yangerkun <yangerkun(a)huawei.com>
Link: https://lore.kernel.org/r/20240731043835.1828697-1-yangerkun@huawei.com
Reviewed-by: Chuck Lever <chuck.lever(a)oracle.com>
[brauner: only update last_index after seek when offset is zero like Jan suggested]
Signed-off-by: Christian Brauner <brauner(a)kernel.org>
Signed-off-by: Andrea Ciprietti <ciprietti(a)google.com>
---
fs/libfs.c | 39 ++++++++++++++++++++++++++++-----------
1 file changed, 28 insertions(+), 11 deletions(-)
diff --git a/fs/libfs.c b/fs/libfs.c
index dc0f7519045f..916c39e758b1 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -371,6 +371,15 @@ void simple_offset_destroy(struct offset_ctx *octx)
xa_destroy(&octx->xa);
}
+static int offset_dir_open(struct inode *inode, struct file *file)
+{
+ struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
+ unsigned long next_offset = (unsigned long)ctx->next_offset;
+
+ file->private_data = (void *)next_offset;
+ return 0;
+}
+
/**
* offset_dir_llseek - Advance the read position of a directory descriptor
* @file: an open directory whose position is to be updated
@@ -384,6 +393,9 @@ void simple_offset_destroy(struct offset_ctx *octx)
*/
static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
{
+ struct inode *inode = file->f_inode;
+ struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
+
switch (whence) {
case SEEK_CUR:
offset += file->f_pos;
@@ -397,7 +409,11 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
}
/* In this case, ->private_data is protected by f_pos_lock */
- file->private_data = NULL;
+ if (!offset) {
+ unsigned long next_offset = (unsigned long)ctx->next_offset;
+
+ file->private_data = (void *)next_offset;
+ }
return vfs_setpos(file, offset, U32_MAX);
}
@@ -427,7 +443,7 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
inode->i_ino, fs_umode_to_dtype(inode->i_mode));
}
-static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
+static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, long last_index)
{
struct offset_ctx *so_ctx = inode->i_op->get_offset_ctx(inode);
XA_STATE(xas, &so_ctx->xa, ctx->pos);
@@ -436,17 +452,21 @@ static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
while (true) {
dentry = offset_find_next(&xas);
if (!dentry)
- return ERR_PTR(-ENOENT);
+ return;
+
+ if (dentry2offset(dentry) >= last_index) {
+ dput(dentry);
+ return;
+ }
if (!offset_dir_emit(ctx, dentry)) {
dput(dentry);
- break;
+ return;
}
dput(dentry);
ctx->pos = xas.xa_index + 1;
}
- return NULL;
}
/**
@@ -473,22 +493,19 @@ static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
static int offset_readdir(struct file *file, struct dir_context *ctx)
{
struct dentry *dir = file->f_path.dentry;
+ long last_index = (long)file->private_data;
lockdep_assert_held(&d_inode(dir)->i_rwsem);
if (!dir_emit_dots(file, ctx))
return 0;
- /* In this case, ->private_data is protected by f_pos_lock */
- if (ctx->pos == 2)
- file->private_data = NULL;
- else if (file->private_data == ERR_PTR(-ENOENT))
- return 0;
- file->private_data = offset_iterate_dir(d_inode(dir), ctx);
+ offset_iterate_dir(d_inode(dir), ctx, last_index);
return 0;
}
const struct file_operations simple_offset_dir_operations = {
+ .open = offset_dir_open,
.llseek = offset_dir_llseek,
.iterate_shared = offset_readdir,
.read = generic_read_dir,
--
2.48.1.262.g85cc9f2d1e-goog
The auto_parser assumed sort() was stable, but the kernel's sort() uses
heapsort, which has never been stable. After commit 0e02ca29a563
("lib/sort: optimize heapsort with double-pop variation"), the order of
equal elements changed, causing the headset to fail to work.
Fix the issue by recording the original order of elements before
sorting and using it as a tiebreaker for equal elements in the
comparison function.
Fixes: b9030a005d58 ("ALSA: hda - Use standard sort function in hda_auto_parser.c")
Reported-by: Austrum <austrum.lab(a)gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219158
Tested-by: Austrum <austrum.lab(a)gmail.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Kuan-Wei Chiu <visitorckw(a)gmail.com>
---
sound/pci/hda/hda_auto_parser.c | 8 +++++++-
sound/pci/hda/hda_auto_parser.h | 1 +
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/sound/pci/hda/hda_auto_parser.c b/sound/pci/hda/hda_auto_parser.c
index 84393f4f429d..8923813ce424 100644
--- a/sound/pci/hda/hda_auto_parser.c
+++ b/sound/pci/hda/hda_auto_parser.c
@@ -80,7 +80,11 @@ static int compare_input_type(const void *ap, const void *bp)
/* In case one has boost and the other one has not,
pick the one with boost first. */
- return (int)(b->has_boost_on_pin - a->has_boost_on_pin);
+ if (a->has_boost_on_pin != b->has_boost_on_pin)
+ return (int)(b->has_boost_on_pin - a->has_boost_on_pin);
+
+ /* Keep the original order */
+ return a->order - b->order;
}
/* Reorder the surround channels
@@ -400,6 +404,8 @@ int snd_hda_parse_pin_defcfg(struct hda_codec *codec,
reorder_outputs(cfg->speaker_outs, cfg->speaker_pins);
/* sort inputs in the order of AUTO_PIN_* type */
+ for (i = 0; i < cfg->num_inputs; i++)
+ cfg->inputs[i].order = i;
sort(cfg->inputs, cfg->num_inputs, sizeof(cfg->inputs[0]),
compare_input_type, NULL);
diff --git a/sound/pci/hda/hda_auto_parser.h b/sound/pci/hda/hda_auto_parser.h
index 579b11beac71..87af3d8c02f7 100644
--- a/sound/pci/hda/hda_auto_parser.h
+++ b/sound/pci/hda/hda_auto_parser.h
@@ -37,6 +37,7 @@ struct auto_pin_cfg_item {
unsigned int is_headset_mic:1;
unsigned int is_headphone_mic:1; /* Mic-only in headphone jack */
unsigned int has_boost_on_pin:1;
+ int order;
};
struct auto_pin_cfg;
--
2.34.1
When converting to folios the cleanup path of shmem_get_pages() was
missed. When a DMA remap fails and the max segment size is greater than
PAGE_SIZE it will attempt to retry the remap with a PAGE_SIZEd segment
size. The cleanup code isn't properly using the folio apis and as a
result isn't handling compound pages correctly.
v2 -> v3:
(Ville) Just use shmem_sg_free_table() as-is in the failure path of
shmem_get_pages(). shmem_sg_free_table() will clear mapping unevictable
but it will be reset when it retries in shmem_sg_alloc_table().
v1 -> v2:
(Ville) Fixed locations where we were not clearing mapping unevictable.
Cc: stable(a)vger.kernel.org
Cc: Ville Syrjala <ville.syrjala(a)linux.intel.com>
Cc: Vidya Srinivas <vidya.srinivas(a)intel.com>
Link: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13487
Link: https://lore.kernel.org/lkml/20250116135636.410164-1-bgeffon@google.com/
Fixes: 0b62af28f249 ("i915: convert shmem_sg_free_table() to use a folio_batch")
Signed-off-by: Brian Geffon <bgeffon(a)google.com>
Suggested-by: Tomasz Figa <tfiga(a)google.com>
---
drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index fe69f2c8527d..ae3343c81a64 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -209,8 +209,6 @@ static int shmem_get_pages(struct drm_i915_gem_object *obj)
struct address_space *mapping = obj->base.filp->f_mapping;
unsigned int max_segment = i915_sg_segment_size(i915->drm.dev);
struct sg_table *st;
- struct sgt_iter sgt_iter;
- struct page *page;
int ret;
/*
@@ -239,9 +237,7 @@ static int shmem_get_pages(struct drm_i915_gem_object *obj)
* for PAGE_SIZE chunks instead may be helpful.
*/
if (max_segment > PAGE_SIZE) {
- for_each_sgt_page(page, sgt_iter, st)
- put_page(page);
- sg_free_table(st);
+ shmem_sg_free_table(st, mapping, false, false);
kfree(st);
max_segment = PAGE_SIZE;
--
2.48.1.262.g85cc9f2d1e-goog
Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()")
mistakenly skipped charging any zswapped pages when a single call to
zswap_store_page() failed, even if some pages in the folio are
successfully stored in zswap.
Making things worse, these not-charged pages are uncharged in
zswap_entry_free(), making zswap charging inconsistent.
This inconsistency triggers two warnings when following these steps:
# On a machine with 64GiB of RAM and 36GiB of zswap
$ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng
$ sudo reboot
Two warnings are:
in mm/memcontrol.c:163, function obj_cgroup_release():
WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
in mm/page_counter.c:60, function page_counter_cancel():
if (WARN_ONCE(new < 0, "page_counter underflow: %ld nr_pages=%lu\n",
new, nr_pages))
Charge zswapped pages even if some pages of the folio are not zswapped.
After resolving the inconsistency, these warnings disappear.
Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Hyeonggon Yoo <42.hyeyoo(a)gmail.com>
---
mm/zswap.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 6504174fbc6a..92752cd05c75 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1568,20 +1568,20 @@ bool zswap_store(struct folio *folio)
bytes = zswap_store_page(page, objcg, pool);
if (bytes < 0)
- goto put_pool;
+ goto charge_zswap;
compressed_bytes += bytes;
}
- if (objcg) {
- obj_cgroup_charge_zswap(objcg, compressed_bytes);
- count_objcg_events(objcg, ZSWPOUT, nr_pages);
- }
-
atomic_long_add(nr_pages, &zswap_stored_pages);
count_vm_events(ZSWPOUT, nr_pages);
ret = true;
+charge_zswap:
+ if (objcg) {
+ obj_cgroup_charge_zswap(objcg, compressed_bytes);
+ count_objcg_events(objcg, ZSWPOUT, nr_pages);
+ }
put_pool:
zswap_pool_put(pool);
put_objcg:
--
2.47.1
Add everything needed to support the DSI output on Renesas r8a779h0
(V4M) SoC, and the DP output (via sn65dsi86 DSI to DP bridge) on the
Renesas grey-hawk board.
Overall the DSI and the board design is almost identical to Renesas
r8a779g0 and white-hawk board.
Note: the v4 no longer has the dts and the clk patches, as those have
been merged to renesas-devel.
Signed-off-by: Tomi Valkeinen <tomi.valkeinen+renesas(a)ideasonboard.com>
---
Changes in v5:
- Add minItems/maxItems to the top level cmms & vsps properties
- Drop "minItems: 1" when not needed
- Link to v4: https://lore.kernel.org/r/20241213-rcar-gh-dsi-v4-0-f8e41425207b@ideasonboa…
Changes in v4:
- Dropped patches merged to renesas-devel
- Added new patch "dt-bindings: display: renesas,du: Add missing
maxItems" to fix the bindings
- Add the missing maxItems to "dt-bindings: display: renesas,du: Add
r8a779h0"
- Link to v3: https://lore.kernel.org/r/20241206-rcar-gh-dsi-v3-0-d74c2166fa15@ideasonboa…
Changes in v3:
- Update "Write DPTSR only if there are more than one crtc" patch to
"Write DPTSR only if the second source exists"
- Add Laurent's Rb
- Link to v2: https://lore.kernel.org/r/20241205-rcar-gh-dsi-v2-0-42471851df86@ideasonboa…
Changes in v2:
- Add the DT binding with a new conditional block, so that we can set
only the port@0 as required
- Drop port@1 from r8a779h0.dtsi (there's no port@1)
- Add a new patch to write DPTSR only if num_crtcs > 1
- Drop RCAR_DU_FEATURE_NO_DPTSR (not needed anymore)
- Add Cc: stable to the fix, and move it as first patch
- Added the tags from reviews
- Link to v1: https://lore.kernel.org/r/20241203-rcar-gh-dsi-v1-0-738ae1a95d2a@ideasonboa…
---
Tomi Valkeinen (7):
drm/rcar-du: dsi: Fix PHY lock bit check
drm/rcar-du: Write DPTSR only if the second source exists
dt-bindings: display: renesas,du: Add missing constraints
dt-bindings: display: renesas,du: Add r8a779h0
dt-bindings: display: bridge: renesas,dsi-csi2-tx: Add r8a779h0
drm/rcar-du: dsi: Add r8a779h0 support
drm/rcar-du: Add support for r8a779h0
.../display/bridge/renesas,dsi-csi2-tx.yaml | 1 +
.../devicetree/bindings/display/renesas,du.yaml | 67 ++++++++++++++++++++--
drivers/gpu/drm/renesas/rcar-du/rcar_du_drv.c | 18 ++++++
drivers/gpu/drm/renesas/rcar-du/rcar_du_group.c | 24 ++++++--
drivers/gpu/drm/renesas/rcar-du/rcar_mipi_dsi.c | 4 +-
.../gpu/drm/renesas/rcar-du/rcar_mipi_dsi_regs.h | 1 -
6 files changed, 102 insertions(+), 13 deletions(-)
---
base-commit: adc218676eef25575469234709c2d87185ca223a
change-id: 20241008-rcar-gh-dsi-9c01f5deeac8
Best regards,
--
Tomi Valkeinen <tomi.valkeinen(a)ideasonboard.com>
There are two variables that indicate the interrupt type to be used
in the next test execution, global "irq_type" and test->irq_type.
The former is referenced from pci_endpoint_test_get_irq() to preserve
the current type for ioctl(PCITEST_GET_IRQTYPE).
In pci_endpoint_test_request_irq(), since this global variable is
referenced when an error occurs, the unintended error message is
displayed.
For example, the following message shows "MSI 3" even if the current
irq type becomes "MSI-X".
# pcitest -i 2
pci-endpoint-test 0000:01:00.0: Failed to request IRQ 30 for MSI 3
SET IRQ TYPE TO MSI-X: NOT OKAY
Fix this issue by using test->irq_type instead of global "irq_type".
Cc: stable(a)vger.kernel.org
Fixes: b2ba9225e031 ("misc: pci_endpoint_test: Avoid using module parameter to determine irqtype")
Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko(a)socionext.com>
---
drivers/misc/pci_endpoint_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/misc/pci_endpoint_test.c b/drivers/misc/pci_endpoint_test.c
index 302955c20979..a342587fc78a 100644
--- a/drivers/misc/pci_endpoint_test.c
+++ b/drivers/misc/pci_endpoint_test.c
@@ -235,7 +235,7 @@ static bool pci_endpoint_test_request_irq(struct pci_endpoint_test *test)
return true;
fail:
- switch (irq_type) {
+ switch (test->irq_type) {
case IRQ_TYPE_INTX:
dev_err(dev, "Failed to request IRQ %d for Legacy\n",
pci_irq_vector(pdev, i));
--
2.25.1
Attention: Sir/Madam,
I am Cesandra Pace, a Research Assistant working with Pharma CURE Laboratory Ltd. We are Bio-pharmaceutical Company here in the U.K.I'm looking for a reliable businessman/individual in your region to represent this company in sourcing some of our basic raw materials used for the manufacturing of high quality Anti-Viral Vaccines, Cancer treatment medications and other lifesaving Pharmaceutical Products.
When I receive a response from you, I shall divulge to you my intent for your consideration.
Best regards,
Cesandra Pace
Research & Dev Dept-
Pharma CURE Laboratory Ltd UK
Attention: Sir/Madam,
I am Cesandra Pace, a Research Assistant working with Pharma CURE Laboratory Ltd. We are Bio-pharmaceutical Company here in the U.K.I'm looking for a reliable businessman/individual in your region to represent this company in sourcing some of our basic raw materials used for the manufacturing of high quality Anti-Viral Vaccines, Cancer treatment medications and other lifesaving Pharmaceutical Products.
When I receive a response from you, I shall divulge to you my intent for your consideration.
Best regards,
Cesandra Pace
Research & Dev Dept-
Pharma CURE Laboratory Ltd UK
In xfs_inactive(), xfs_reflink_cancel_cow_range() is called
without error handling, risking unnoticed failures and
inconsistent behavior compared to other parts of the code.
Fix this issue by adding an error handling for the
xfs_reflink_cancel_cow_range(), improving code robustness.
Fixes: 6231848c3aa5 ("xfs: check for cow blocks before trying to clear them")
Cc: <stable(a)vger.kernel.org> # v4.17
Reviewed-by: "Darrick J. Wong" <djwong(a)kernel.org>
Signed-off-by: Wentao Liang <vulab(a)iscas.ac.cn>
---
fs/xfs/xfs_inode.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c8ad2606f928..1ff514b6c035 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1404,8 +1404,11 @@ xfs_inactive(
goto out;
/* Try to clean out the cow blocks if there are any. */
- if (xfs_inode_has_cow_data(ip))
- xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, true);
+ if (xfs_inode_has_cow_data(ip)) {
+ error = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, true);
+ if (error)
+ goto out;
+ }
if (VFS_I(ip)->i_nlink != 0) {
/*
--
2.42.0.windows.2
reveliofuzzing reported that a SCSI_IOCTL_SEND_COMMAND ioctl with out_len
set to 0xd42, SCSI command set to ATA_16 PASS-THROUGH, ATA command set to
ATA_NOP, and protocol set to ATA_PROT_PIO, can cause ata_pio_sector() to
write outside the allocated buffer, overwriting random memory.
While a ATA device is supposed to abort a ATA_NOP command, there does seem
to be a bug either in libata-sff or QEMU, where either this status is not
set, or the status is cleared before read by ata_sff_hsm_move().
Anyway, that is most likely a separate bug.
Looking at __atapi_pio_bytes(), it already has a safety check to ensure
that __atapi_pio_bytes() cannot write outside the allocated buffer.
Add a similar check to ata_pio_sector(), such that also ata_pio_sector()
cannot write outside the allocated buffer.
Cc: stable(a)vger.kernel.org
Reported-by: reveliofuzzing <reveliofuzzing(a)gmail.com>
Closes: https://lore.kernel.org/linux-ide/CA+-ZZ_jTgxh3bS7m+KX07_EWckSnW3N2adX3KV63…
Signed-off-by: Niklas Cassel <cassel(a)kernel.org>
---
Changes since v1:
-Add stable to Cc.
drivers/ata/libata-sff.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
index 67f277e1c3bf..5a46c066abc3 100644
--- a/drivers/ata/libata-sff.c
+++ b/drivers/ata/libata-sff.c
@@ -601,7 +601,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
{
struct ata_port *ap = qc->ap;
struct page *page;
- unsigned int offset;
+ unsigned int offset, count;
if (!qc->cursg) {
qc->curbytes = qc->nbytes;
@@ -617,25 +617,27 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
page = nth_page(page, (offset >> PAGE_SHIFT));
offset %= PAGE_SIZE;
- trace_ata_sff_pio_transfer_data(qc, offset, qc->sect_size);
+ /* don't overrun current sg */
+ count = min(qc->cursg->length - qc->cursg_ofs, qc->sect_size);
+
+ trace_ata_sff_pio_transfer_data(qc, offset, count);
/*
* Split the transfer when it splits a page boundary. Note that the
* split still has to be dword aligned like all ATA data transfers.
*/
WARN_ON_ONCE(offset % 4);
- if (offset + qc->sect_size > PAGE_SIZE) {
+ if (offset + count > PAGE_SIZE) {
unsigned int split_len = PAGE_SIZE - offset;
ata_pio_xfer(qc, page, offset, split_len);
- ata_pio_xfer(qc, nth_page(page, 1), 0,
- qc->sect_size - split_len);
+ ata_pio_xfer(qc, nth_page(page, 1), 0, count - split_len);
} else {
- ata_pio_xfer(qc, page, offset, qc->sect_size);
+ ata_pio_xfer(qc, page, offset, count);
}
- qc->curbytes += qc->sect_size;
- qc->cursg_ofs += qc->sect_size;
+ qc->curbytes += count;
+ qc->cursg_ofs += count;
if (qc->cursg_ofs == qc->cursg->length) {
qc->cursg = sg_next(qc->cursg);
--
2.48.1
This series primarily adds check at relevant places in venus driver
where there are possible OOB accesses due to unexpected payload from
venus firmware. The patches describes the specific OOB possibility.
Please review and share your feedback.
Validated on sc7180(v4), rb5(v6) and db410c(v1).
Changes in v3:
- update the packet parsing logic in hfi_parser. The utility parsing api
now returns the size of data parsed, accordingly the parser adjust the
remaining bytes, taking care of OOB scenario as well (Bryan)
- Link to v2:
https://lore.kernel.org/r/20241128-venus_oob_2-v2-0-483ae0a464b8@quicinc.com
Changes in v2:
- init_codec to always update with latest payload from firmware
(Dmitry/Bryan)
- Rewrite the logic of packet parsing to consider payload size for
different packet type (Bryan)
- Consider reading sfr data till available space (Dmitry)
- Add reviewed-by tags
- Link to v1:
https://lore.kernel.org/all/20241105-venus_oob-v1-0-8d4feedfe2bb@quicinc.co…
Signed-off-by: Vikash Garodia <quic_vgarodia(a)quicinc.com>
---
Vikash Garodia (4):
media: venus: hfi_parser: add check to avoid out of bound access
media: venus: hfi_parser: refactor hfi packet parsing logic
media: venus: hfi: add check to handle incorrect queue size
media: venus: hfi: add a check to handle OOB in sfr region
drivers/media/platform/qcom/venus/hfi_parser.c | 94 +++++++++++++++++++-------
drivers/media/platform/qcom/venus/hfi_venus.c | 15 +++-
2 files changed, 82 insertions(+), 27 deletions(-)
---
base-commit: c7ccf3683ac9746b263b0502255f5ce47f64fe0a
change-id: 20241115-venus_oob_2-21708239176a
Best regards,
--
Vikash Garodia <quic_vgarodia(a)quicinc.com>
Current unmapped_area() may fail to get the available area.
For example, we have a vma range like below:
0123456789abcdef
m m A m m
Let's assume start_gap is 0x2000 and stack_guard_gap is 0x1000. And now
we are looking for free area with size 0x1000 within [0x2000, 0xd000].
The unmapped_area_topdown() could find address at 0x8000, while
unmapped_area() fails.
In original code before commit 3499a13168da ("mm/mmap: use maple tree
for unmapped_area{_topdown}"), the logic is:
* find a gap with total length, including alignment
* adjust start address with alignment
What we do now is:
* find a gap with total length, including alignment
* adjust start address with alignment
* then compare the left range with total length
This is not necessary to compare with total length again after start
address adjusted.
Also, it is not correct to minus 1 here. This may not trigger an issue
in real world, since address are usually aligned with page size.
Fixes: 58c5d0d6d522 ("mm/mmap: regression fix for unmapped_area{_topdown}")
Signed-off-by: Wei Yang <richard.weiyang(a)gmail.com>
CC: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
CC: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
CC: Vlastimil Babka <vbabka(a)suse.cz>
CC: Jann Horn <jannh(a)google.com>
CC: Rick Edgecombe <rick.p.edgecombe(a)intel.com>
Cc: <stable(a)vger.kernel.org>
---
mm/vma.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vma.c b/mm/vma.c
index 3f45a245e31b..d82fdbc710b0 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2668,7 +2668,7 @@ unsigned long unmapped_area(struct vm_unmapped_area_info *info)
gap += (info->align_offset - gap) & info->align_mask;
tmp = vma_next(&vmi);
if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */
- if (vm_start_gap(tmp) < gap + length - 1) {
+ if (vm_start_gap(tmp) < gap + info->length) {
low_limit = tmp->vm_end;
vma_iter_reset(&vmi);
goto retry;
--
2.34.1
We should select PCI_HYPERV here, otherwise it's possible for devices to
not show up as expected, at least not in an orderly manner.
Cc: stable(a)vger.kernel.org
Cc: Wei Liu <wei.liu(a)kernel.org>
Signed-off-by: Hamza Mahfooz <hamzamahfooz(a)linux.microsoft.com>
---
drivers/hv/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 862c47b191af..6ee75b3f0fa6 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -9,6 +9,7 @@ config HYPERV
select PARAVIRT
select X86_HV_CALLBACK_VECTOR if X86
select OF_EARLY_FLATTREE if OF
+ select PCI_HYPERV if PCI
help
Select this option to run Linux as a Hyper-V client operating
system.
--
2.47.1
When an Icelake or Sapphire Rapids CPU isn't providing the maximum and
critical thresholds for particular DIMM the driver should return an
error to the userspace instead of giving it stale (best case) or wrong
(the structure contains all zeros after kzalloc() call) data.
The issue can be reproduced by binding the peci driver while the host is
fully booted and idle, this makes PECI interaction unreliable enough.
Fixes: 73bc1b885dae ("hwmon: peci: Add dimmtemp driver")
Fixes: 621995b6d795 ("hwmon: (peci/dimmtemp) Add Sapphire Rapids support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Paul Fertser <fercerpav(a)gmail.com>
---
drivers/hwmon/peci/dimmtemp.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/drivers/hwmon/peci/dimmtemp.c b/drivers/hwmon/peci/dimmtemp.c
index d6762259dd69..fbe82d9852e0 100644
--- a/drivers/hwmon/peci/dimmtemp.c
+++ b/drivers/hwmon/peci/dimmtemp.c
@@ -127,8 +127,6 @@ static int update_thresholds(struct peci_dimmtemp *priv, int dimm_no)
return 0;
ret = priv->gen_info->read_thresholds(priv, dimm_order, chan_rank, &data);
- if (ret == -ENODATA) /* Use default or previous value */
- return 0;
if (ret)
return ret;
@@ -509,11 +507,11 @@ read_thresholds_icx(struct peci_dimmtemp *priv, int dimm_order, int chan_rank, u
ret = peci_ep_pci_local_read(priv->peci_dev, 0, 13, 0, 2, 0xd4, ®_val);
if (ret || !(reg_val & BIT(31)))
- return -ENODATA; /* Use default or previous value */
+ return -ENODATA;
ret = peci_ep_pci_local_read(priv->peci_dev, 0, 13, 0, 2, 0xd0, ®_val);
if (ret)
- return -ENODATA; /* Use default or previous value */
+ return -ENODATA;
/*
* Device 26, Offset 224e0: IMC 0 channel 0 -> rank 0
@@ -546,11 +544,11 @@ read_thresholds_spr(struct peci_dimmtemp *priv, int dimm_order, int chan_rank, u
ret = peci_ep_pci_local_read(priv->peci_dev, 0, 30, 0, 2, 0xd4, ®_val);
if (ret || !(reg_val & BIT(31)))
- return -ENODATA; /* Use default or previous value */
+ return -ENODATA;
ret = peci_ep_pci_local_read(priv->peci_dev, 0, 30, 0, 2, 0xd0, ®_val);
if (ret)
- return -ENODATA; /* Use default or previous value */
+ return -ENODATA;
/*
* Device 26, Offset 219a8: IMC 0 channel 0 -> rank 0
--
2.34.1
From: Kan Liang <kan.liang(a)linux.intel.com>
The EAX of the CPUID Leaf 023H enumerates the mask of valid sub-leaves.
To tell the availability of the sub-leaf 1 (enumerate the counter mask),
perf should check the bit 1 (0x2) of EAS, rather than bit 0 (0x1).
The error is not user-visible on bare metal. Because the sub-leaf 0 and
the sub-leaf 1 are always available. However, it may bring issues in a
virtualization environment when a VMM only enumerates the sub-leaf 0.
Fixes: eb467aaac21e ("perf/x86/intel: Support Architectural PerfMon Extension leaf")
Signed-off-by: Kan Liang <kan.liang(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
---
arch/x86/events/intel/core.c | 4 ++--
arch/x86/include/asm/perf_event.h | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 5e8521a54474..12eb96219740 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4966,8 +4966,8 @@ static void update_pmu_cap(struct x86_hybrid_pmu *pmu)
if (ebx & ARCH_PERFMON_EXT_EQ)
pmu->config_mask |= ARCH_PERFMON_EVENTSEL_EQ;
- if (sub_bitmaps & ARCH_PERFMON_NUM_COUNTER_LEAF_BIT) {
- cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_NUM_COUNTER_LEAF,
+ if (sub_bitmaps & ARCH_PERFMON_NUM_COUNTER_LEAF) {
+ cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_NUM_COUNTER_LEAF_BIT,
&eax, &ebx, &ecx, &edx);
pmu->cntr_mask64 = eax;
pmu->fixed_cntr_mask64 = ebx;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index adaeb8ca3a8a..71e2ae021374 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -197,7 +197,7 @@ union cpuid10_edx {
#define ARCH_PERFMON_EXT_UMASK2 0x1
#define ARCH_PERFMON_EXT_EQ 0x2
#define ARCH_PERFMON_NUM_COUNTER_LEAF_BIT 0x1
-#define ARCH_PERFMON_NUM_COUNTER_LEAF 0x1
+#define ARCH_PERFMON_NUM_COUNTER_LEAF BIT(ARCH_PERFMON_NUM_COUNTER_LEAF_BIT)
/*
* Intel Architectural LBR CPUID detection/enumeration details:
--
2.40.1
Recently, a few issues linked to MPTCP have been reported by syzbot. All
the remaining ones are addressed in this series.
- Patch 1: Address "KMSAN: uninit-value in mptcp_incoming_options (2)".
A fix for v5.11.
- Patch 2: Address "WARNING in mptcp_pm_nl_set_flags (2)". A fix for
v5.18.
- Patch 3: Address "WARNING in __mptcp_clean_una (2)". A fix for v6.4,
backported up to v6.1.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (1):
mptcp: pm: only set fullmesh for subflow endp
Paolo Abeni (2):
mptcp: consolidate suboption status
mptcp: handle fastopen disconnect correctly
net/mptcp/options.c | 13 +++++--------
net/mptcp/pm_netlink.c | 3 ++-
net/mptcp/protocol.c | 4 +++-
net/mptcp/protocol.h | 30 ++++++++++++++++--------------
4 files changed, 26 insertions(+), 24 deletions(-)
---
base-commit: 15a901361ec3fb1c393f91880e1cbf24ec0a88bd
change-id: 20250123-net-mptcp-syzbot-issues-ffdbbb9b85d7
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
The patch titled
Subject: kfence: skip __GFP_THISNODE allocations on NUMA systems
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
kfence-skip-__gfp_thisnode-allocations-on-numa-systems.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Marco Elver <elver(a)google.com>
Subject: kfence: skip __GFP_THISNODE allocations on NUMA systems
Date: Fri, 24 Jan 2025 13:01:38 +0100
On NUMA systems, __GFP_THISNODE indicates that an allocation _must_ be on
a particular node, and failure to allocate on the desired node will result
in a failed allocation.
Skip __GFP_THISNODE allocations if we are running on a NUMA system, since
KFENCE can't guarantee which node its pool pages are allocated on.
Link: https://lkml.kernel.org/r/20250124120145.410066-1-elver@google.com
Fixes: 236e9f153852 ("kfence: skip all GFP_ZONEMASK allocations")
Signed-off-by: Marco Elver <elver(a)google.com>
Reported-by: Vlastimil Babka <vbabka(a)suse.cz>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Chistoph Lameter <cl(a)linux.com>
Cc: Dmitriy Vyukov <dvyukov(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kfence/core.c | 2 ++
1 file changed, 2 insertions(+)
--- a/mm/kfence/core.c~kfence-skip-__gfp_thisnode-allocations-on-numa-systems
+++ a/mm/kfence/core.c
@@ -21,6 +21,7 @@
#include <linux/log2.h>
#include <linux/memblock.h>
#include <linux/moduleparam.h>
+#include <linux/nodemask.h>
#include <linux/notifier.h>
#include <linux/panic_notifier.h>
#include <linux/random.h>
@@ -1084,6 +1085,7 @@ void *__kfence_alloc(struct kmem_cache *
* properties (e.g. reside in DMAable memory).
*/
if ((flags & GFP_ZONEMASK) ||
+ ((flags & __GFP_THISNODE) && num_online_nodes() > 1) ||
(s->flags & (SLAB_CACHE_DMA | SLAB_CACHE_DMA32))) {
atomic_long_inc(&counters[KFENCE_COUNTER_SKIP_INCOMPAT]);
return NULL;
_
Patches currently in -mm which might be from elver(a)google.com are
kfence-skip-__gfp_thisnode-allocations-on-numa-systems.patch
The patch titled
Subject: nilfs2: fix possible int overflows in nilfs_fiemap()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
nilfs2-fix-possible-int-overflows-in-nilfs_fiemap.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Nikita Zhandarovich <n.zhandarovich(a)fintech.ru>
Subject: nilfs2: fix possible int overflows in nilfs_fiemap()
Date: Sat, 25 Jan 2025 07:20:53 +0900
Since nilfs_bmap_lookup_contig() in nilfs_fiemap() calculates its result
by being prepared to go through potentially maxblocks == INT_MAX blocks,
the value in n may experience an overflow caused by left shift of blkbits.
While it is extremely unlikely to occur, play it safe and cast right hand
expression to wider type to mitigate the issue.
Found by Linux Verification Center (linuxtesting.org) with static analysis
tool SVACE.
Link: https://lkml.kernel.org/r/20250124222133.5323-1-konishi.ryusuke@gmail.com
Fixes: 622daaff0a89 ("nilfs2: fiemap support")
Signed-off-by: Nikita Zhandarovich <n.zhandarovich(a)fintech.ru>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/nilfs2/inode.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/fs/nilfs2/inode.c~nilfs2-fix-possible-int-overflows-in-nilfs_fiemap
+++ a/fs/nilfs2/inode.c
@@ -1186,7 +1186,7 @@ int nilfs_fiemap(struct inode *inode, st
if (size) {
if (phys && blkphy << blkbits == phys + size) {
/* The current extent goes on */
- size += n << blkbits;
+ size += (u64)n << blkbits;
} else {
/* Terminate the current extent */
ret = fiemap_fill_next_extent(
@@ -1199,14 +1199,14 @@ int nilfs_fiemap(struct inode *inode, st
flags = FIEMAP_EXTENT_MERGED;
logical = blkoff << blkbits;
phys = blkphy << blkbits;
- size = n << blkbits;
+ size = (u64)n << blkbits;
}
} else {
/* Start a new extent */
flags = FIEMAP_EXTENT_MERGED;
logical = blkoff << blkbits;
phys = blkphy << blkbits;
- size = n << blkbits;
+ size = (u64)n << blkbits;
}
blkoff += n;
}
_
Patches currently in -mm which might be from n.zhandarovich(a)fintech.ru are
nilfs2-fix-possible-int-overflows-in-nilfs_fiemap.patch
Having the exec queue snapshot inside a "GuC CT" section was always
wrong. Commit c28fd6c358db ("drm/xe/devcoredump: Improve section
headings and add tile info") tried to fix that bug, but with that also
broke the mesa tool that parses the devcoredump, hence it was reverted
in commit 70fb86a85dc9 ("drm/xe: Revert some changes that break a mesa
debug tool").
With the mesa tool also fixed, this can propagate as a fix on both
kernel and userspace side to avoid unnecessary headache for a debug
feature.
Cc: John Harrison <John.C.Harrison(a)Intel.com>
Cc: Julia Filipchuk <julia.filipchuk(a)intel.com>
Cc: José Roberto de Souza <jose.souza(a)intel.com>
Cc: stable(a)vger.kernel.org
Fixes: 70fb86a85dc9 ("drm/xe: Revert some changes that break a mesa debug tool")
Signed-off-by: Lucas De Marchi <lucas.demarchi(a)intel.com>
---
drivers/gpu/drm/xe/xe_devcoredump.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
index 81dc7795c0651..a7946a76777e7 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.c
+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
@@ -119,11 +119,7 @@ static ssize_t __xe_devcoredump_read(char *buffer, size_t count,
drm_puts(&p, "\n**** GuC CT ****\n");
xe_guc_ct_snapshot_print(ss->guc.ct, &p);
- /*
- * Don't add a new section header here because the mesa debug decoder
- * tool expects the context information to be in the 'GuC CT' section.
- */
- /* drm_puts(&p, "\n**** Contexts ****\n"); */
+ drm_puts(&p, "\n**** Contexts ****\n");
xe_guc_exec_queue_snapshot_print(ss->ge, &p);
drm_puts(&p, "\n**** Job ****\n");
--
2.48.0
From: Lucas De Marchi <lucas.demarchi(a)intel.com>
Commit 70fb86a85dc9 ("drm/xe: Revert some changes that break a mesa
debug tool") partially reverted some changes to workaround breakage
caused to mesa tools. However, in doing so it also broke fetching the
GuC log via debugfs since xe_print_blob_ascii85() simply bails out.
The fix is to avoid the extra newlines: the devcoredump interface is
line-oriented and adding random newlines in the middle breaks it. If a
tool is able to parse it by looking at the data and checking for chars
that are out of the ascii85 space, it can still do so. A format change
that breaks the line-oriented output on devcoredump however needs better
coordination with existing tools.
v2:
- added suffix description comment
Reviewed-by: José Roberto de Souza <jose.souza(a)intel.com>
Cc: John Harrison <John.C.Harrison(a)Intel.com>
Cc: Julia Filipchuk <julia.filipchuk(a)intel.com>
Cc: José Roberto de Souza <jose.souza(a)intel.com>
Cc: stable(a)vger.kernel.org
Fixes: 70fb86a85dc9 ("drm/xe: Revert some changes that break a mesa debug tool")
Fixes: ec1455ce7e35 ("drm/xe/devcoredump: Add ASCII85 dump helper function")
Signed-off-by: Lucas De Marchi <lucas.demarchi(a)intel.com>
---
drivers/gpu/drm/xe/xe_devcoredump.c | 33 +++++++++++------------------
drivers/gpu/drm/xe/xe_devcoredump.h | 2 +-
drivers/gpu/drm/xe/xe_guc_ct.c | 3 ++-
drivers/gpu/drm/xe/xe_guc_log.c | 4 +++-
4 files changed, 18 insertions(+), 24 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
index 81dc7795c0651..6f73b1ba0f2aa 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.c
+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
@@ -395,42 +395,33 @@ int xe_devcoredump_init(struct xe_device *xe)
/**
* xe_print_blob_ascii85 - print a BLOB to some useful location in ASCII85
*
- * The output is split to multiple lines because some print targets, e.g. dmesg
- * cannot handle arbitrarily long lines. Note also that printing to dmesg in
- * piece-meal fashion is not possible, each separate call to drm_puts() has a
- * line-feed automatically added! Therefore, the entire output line must be
- * constructed in a local buffer first, then printed in one atomic output call.
+ * The output is split to multiple print calls because some print targets, e.g.
+ * dmesg cannot handle arbitrarily long lines. These targets may add newline
+ * between calls.
*
* There is also a scheduler yield call to prevent the 'task has been stuck for
* 120s' kernel hang check feature from firing when printing to a slow target
* such as dmesg over a serial port.
*
- * TODO: Add compression prior to the ASCII85 encoding to shrink huge buffers down.
- *
* @p: the printer object to output to
* @prefix: optional prefix to add to output string
+ * @suffix: optional suffix to add at the end. 0 disables it and is
+ * not added to the output, which is useful when using multiple calls
+ * to dump data to @p
* @blob: the Binary Large OBject to dump out
* @offset: offset in bytes to skip from the front of the BLOB, must be a multiple of sizeof(u32)
* @size: the size in bytes of the BLOB, must be a multiple of sizeof(u32)
*/
-void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix,
+void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix, char suffix,
const void *blob, size_t offset, size_t size)
{
const u32 *blob32 = (const u32 *)blob;
char buff[ASCII85_BUFSZ], *line_buff;
size_t line_pos = 0;
- /*
- * Splitting blobs across multiple lines is not compatible with the mesa
- * debug decoder tool. Note that even dropping the explicit '\n' below
- * doesn't help because the GuC log is so big some underlying implementation
- * still splits the lines at 512K characters. So just bail completely for
- * the moment.
- */
- return;
-
#define DMESG_MAX_LINE_LEN 800
-#define MIN_SPACE (ASCII85_BUFSZ + 2) /* 85 + "\n\0" */
+ /* Always leave space for the suffix char and the \0 */
+#define MIN_SPACE (ASCII85_BUFSZ + 2) /* 85 + "<suffix>\0" */
if (size & 3)
drm_printf(p, "Size not word aligned: %zu", size);
@@ -462,7 +453,6 @@ void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix,
line_pos += strlen(line_buff + line_pos);
if ((line_pos + MIN_SPACE) >= DMESG_MAX_LINE_LEN) {
- line_buff[line_pos++] = '\n';
line_buff[line_pos++] = 0;
drm_puts(p, line_buff);
@@ -474,10 +464,11 @@ void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix,
}
}
+ if (suffix)
+ line_buff[line_pos++] = suffix;
+
if (line_pos) {
- line_buff[line_pos++] = '\n';
line_buff[line_pos++] = 0;
-
drm_puts(p, line_buff);
}
diff --git a/drivers/gpu/drm/xe/xe_devcoredump.h b/drivers/gpu/drm/xe/xe_devcoredump.h
index 6a17e6d601022..5391a80a4d1ba 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.h
+++ b/drivers/gpu/drm/xe/xe_devcoredump.h
@@ -29,7 +29,7 @@ static inline int xe_devcoredump_init(struct xe_device *xe)
}
#endif
-void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix,
+void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix, char suffix,
const void *blob, size_t offset, size_t size);
#endif
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 8b65c5e959cc2..50c8076b51585 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -1724,7 +1724,8 @@ void xe_guc_ct_snapshot_print(struct xe_guc_ct_snapshot *snapshot,
snapshot->g2h_outstanding);
if (snapshot->ctb)
- xe_print_blob_ascii85(p, "CTB data", snapshot->ctb, 0, snapshot->ctb_size);
+ xe_print_blob_ascii85(p, "CTB data", '\n',
+ snapshot->ctb, 0, snapshot->ctb_size);
} else {
drm_puts(p, "CT disabled\n");
}
diff --git a/drivers/gpu/drm/xe/xe_guc_log.c b/drivers/gpu/drm/xe/xe_guc_log.c
index 80151ff6a71f8..44482ea919924 100644
--- a/drivers/gpu/drm/xe/xe_guc_log.c
+++ b/drivers/gpu/drm/xe/xe_guc_log.c
@@ -207,8 +207,10 @@ void xe_guc_log_snapshot_print(struct xe_guc_log_snapshot *snapshot, struct drm_
remain = snapshot->size;
for (i = 0; i < snapshot->num_chunks; i++) {
size_t size = min(GUC_LOG_CHUNK_SIZE, remain);
+ const char *prefix = i ? NULL : "Log data";
+ char suffix = i == snapshot->num_chunks - 1 ? '\n' : 0;
- xe_print_blob_ascii85(p, i ? NULL : "Log data", snapshot->copy[i], 0, size);
+ xe_print_blob_ascii85(p, prefix, suffix, snapshot->copy[i], 0, size);
remain -= size;
}
}
--
2.48.1
The patch titled
Subject: mm: kmemleak: fix upper boundary check for physical address objects
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-kmemleak-fix-upper-boundary-check-for-physical-address-objects.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Catalin Marinas <catalin.marinas(a)arm.com>
Subject: mm: kmemleak: fix upper boundary check for physical address objects
Date: Mon, 27 Jan 2025 18:42:33 +0000
Memblock allocations are registered by kmemleak separately, based on their
physical address. During the scanning stage, it checks whether an object
is within the min_low_pfn and max_low_pfn boundaries and ignores it
otherwise.
With the recent addition of __percpu pointer leak detection (commit
6c99d4eb7c5e ("kmemleak: enable tracking for percpu pointers")), kmemleak
started reporting leaks in setup_zone_pageset() and
setup_per_cpu_pageset(). These were caused by the node_data[0] object
(initialised in alloc_node_data()) ending on the PFN_PHYS(max_low_pfn)
boundary. The non-strict upper boundary check introduced by commit
84c326299191 ("mm: kmemleak: check physical address when scan") causes the
pg_data_t object to be ignored (not scanned) and the __percpu pointers it
contains to be reported as leaks.
Make the max_low_pfn upper boundary check strict when deciding whether to
ignore a physical address object and not scan it.
Link: https://lkml.kernel.org/r/20250127184233.2974311-1-catalin.marinas@arm.com
Fixes: 84c326299191 ("mm: kmemleak: check physical address when scan")
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Reported-by: Jakub Kicinski <kuba(a)kernel.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Cc: Patrick Wang <patrick.wang.shcn(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [6.0.x]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kmemleak.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/kmemleak.c~mm-kmemleak-fix-upper-boundary-check-for-physical-address-objects
+++ a/mm/kmemleak.c
@@ -1689,7 +1689,7 @@ static void kmemleak_scan(void)
unsigned long phys = object->pointer;
if (PHYS_PFN(phys) < min_low_pfn ||
- PHYS_PFN(phys + object->size) >= max_low_pfn)
+ PHYS_PFN(phys + object->size) > max_low_pfn)
__paint_it(object, KMEMLEAK_BLACK);
}
_
Patches currently in -mm which might be from catalin.marinas(a)arm.com are
mm-kmemleak-fix-upper-boundary-check-for-physical-address-objects.patch
When attaching uretprobes to processes running inside docker, the attached
process is segfaulted when encountering the retprobe.
The reason is that now that uretprobe is a system call the default seccomp
filters in docker block it as they only allow a specific set of known
syscalls. This is true for other userspace applications which use seccomp
to control their syscall surface.
Since uretprobe is a "kernel implementation detail" system call which is
not used by userspace application code directly, it is impractical and
there's very little point in forcing all userspace applications to
explicitly allow it in order to avoid crashing tracked processes.
Pass this systemcall through seccomp without depending on configuration.
Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
Reported-by: Rafael Buchbinder <rafi(a)rbk.io>
Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2A…
Cc: stable(a)vger.kernel.org
Signed-off-by: Eyal Birger <eyal.birger(a)gmail.com>
---
The following reproduction script synthetically demonstrates the problem:
cat > /tmp/x.c << EOF
char *syscalls[] = {
"write",
"exit_group",
"fstat",
};
__attribute__((noinline)) int probed(void)
{
printf("Probed\n");
return 1;
}
void apply_seccomp_filter(char **syscalls, int num_syscalls)
{
scmp_filter_ctx ctx;
ctx = seccomp_init(SCMP_ACT_KILL);
for (int i = 0; i < num_syscalls; i++) {
seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
seccomp_syscall_resolve_name(syscalls[i]), 0);
}
seccomp_load(ctx);
seccomp_release(ctx);
}
int main(int argc, char *argv[])
{
int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);
apply_seccomp_filter(syscalls, num_syscalls);
probed();
return 0;
}
EOF
cat > /tmp/trace.bt << EOF
uretprobe:/tmp/x:probed
{
printf("ret=%d\n", retval);
}
EOF
gcc -o /tmp/x /tmp/x.c -lseccomp
/usr/bin/bpftrace /tmp/trace.bt &
sleep 5 # wait for uretprobe attach
/tmp/x
pkill bpftrace
rm /tmp/x /tmp/x.c /tmp/trace.bt
---
kernel/seccomp.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 385d48293a5f..10a55c9b5c18 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
this_syscall = sd ? sd->nr :
syscall_get_nr(current, current_pt_regs());
+#ifdef CONFIG_X86_64
+ if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
+ return 0;
+#endif
+
switch (mode) {
case SECCOMP_MODE_STRICT:
__secure_computing_strict(this_syscall); /* may call do_exit */
--
2.43.0
This small series adds support for non-coherent video capture buffers
on Rockchip ISP V1. Patch 1 fixes cache management for dmabuf's
allocated by dma-contig allocator. Patch 2 allows non-coherent
allocations on the rkisp1 capture queue. Some timing measurements are
provided in the commit message of patch 2.
Signed-off-by: Mikhail Rudenko <mike.rudenko(a)gmail.com>
---
Changes in v2:
- Fix vb2_dc_dmabuf_ops_{begin,end}_cpu_access() for non-coherent buffers.
- Add cache management timing information to patch 2 commit message.
- Link to v1: https://lore.kernel.org/r/20250102-b4-rkisp-noncoherent-v1-1-bba164f7132c@g…
---
Mikhail Rudenko (2):
media: videobuf2: Fix dmabuf cache sync/flush in dma-contig
media: rkisp1: Allow non-coherent video capture buffers
drivers/media/common/videobuf2/videobuf2-dma-contig.c | 14 ++++++++++++++
drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c | 1 +
2 files changed, 15 insertions(+)
---
base-commit: 94794b5ce4d90ab134b0b101a02fddf6e74c437d
change-id: 20241231-b4-rkisp-noncoherent-ad6e7c7a68ba
Best regards,
--
Mikhail Rudenko <mike.rudenko(a)gmail.com>
Syzkaller has reported a warning in to_nfit_bus_uuid(): "only secondary
bus families can be translated". This warning is emited if the argument
is equal to NVDIMM_BUS_FAMILY_NFIT == 0. Function acpi_nfit_ctl() first
verifies that a user-provided value call_pkg->nd_family of type u64 is
not equal to 0. Then the value is converted to int, and only after that
is compared to NVDIMM_BUS_FAMILY_MAX. This can lead to passing an invalid
argument to acpi_nfit_ctl(), if call_pkg->nd_family is non-zero, while
the lower 32 bits are zero.
All checks of the input value should be applied to the original variable
call_pkg->nd_family.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: 6450ddbd5d8e ("ACPI: NFIT: Define runtime firmware activation commands")
Cc: stable(a)vger.kernel.org
Reported-by: syzbot+c80d8dc0d9fa81a3cd8c(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c80d8dc0d9fa81a3cd8c
Signed-off-by: Murad Masimov <m.masimov(a)mt-integration.ru>
---
drivers/acpi/nfit/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index a5d47819b3a4..ae035b93da08 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -485,7 +485,7 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm,
cmd_mask = nd_desc->cmd_mask;
if (cmd == ND_CMD_CALL && call_pkg->nd_family) {
family = call_pkg->nd_family;
- if (family > NVDIMM_BUS_FAMILY_MAX ||
+ if (call_pkg->nd_family > NVDIMM_BUS_FAMILY_MAX ||
!test_bit(family, &nd_desc->bus_family_mask))
return -EINVAL;
family = array_index_nospec(family,
--
2.39.2
The CPU PMU in Apple SoCs can be configured to fire its interrupt in one
of several ways, and since Apple A11 one of the method is FIQ. Only handle
the PMC interrupt as a FIQ when the CPU PMU has been configured to fire
FIQs.
Cc: stable(a)vger.kernel.org
Fixes: c7708816c944 ("irqchip/apple-aic: Wire PMU interrupts")
Signed-off-by: Nick Chan <towinchenmi(a)gmail.com>
---
Changes in v2:
Fix the conditional to have the intented behavior of evaluating to true
only when both PMCR0_IMODE is PMCR0_IMODE_FIQ and PMCR0_IACT is set by
reverting the conditional to how it is before c7708816c944.
Link to v1: https://lore.kernel.org/asahi/20250117170227.45243-1-towinchenmi@gmail.com/T
- Nick Chan
drivers/irqchip/irq-apple-aic.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/irqchip/irq-apple-aic.c b/drivers/irqchip/irq-apple-aic.c
index da5250f0155c..2b1684c60e3c 100644
--- a/drivers/irqchip/irq-apple-aic.c
+++ b/drivers/irqchip/irq-apple-aic.c
@@ -577,7 +577,8 @@ static void __exception_irq_entry aic_handle_fiq(struct pt_regs *regs)
AIC_FIQ_HWIRQ(AIC_TMR_EL02_VIRT));
}
- if (read_sysreg_s(SYS_IMP_APL_PMCR0_EL1) & PMCR0_IACT) {
+ if ((read_sysreg_s(SYS_IMP_APL_PMCR0_EL1) & (PMCR0_IMODE | PMCR0_IACT)) ==
+ (FIELD_PREP(PMCR0_IMODE, PMCR0_IMODE_FIQ) | PMCR0_IACT)) {
int irq;
if (cpumask_test_cpu(smp_processor_id(),
&aic_irqc->fiq_aff[AIC_CPU_PMU_P]->aff))
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
--
2.48.1
From: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
Any active plane needs to have its crtc included in the atomic
state. For planes enabled via uapi that is all handler in the core.
But when we use a plane for joiner the uapi code things the plane
is disabled and therefore doesn't have a crtc. So we need to pull
those in by hand. We do it first thing in
intel_joiner_add_affected_crtcs() so that any newly added crtc will
subsequently pull in all of its joined crtcs as well.
The symptoms from failing to do this are:
- duct tape in the form of commit 1d5b09f8daf8 ("drm/i915: Fix NULL
ptr deref by checking new_crtc_state")
- the plane's hw state will get overwritten by the disabled
uapi state if it can't find the uapi counterpart plane in
the atomic state from where it should copy the correct state
Cc: stable(a)vger.kernel.org
Signed-off-by: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
---
drivers/gpu/drm/i915/display/intel_display.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
index 7d68d652c1bc..2b31c8f4b7cd 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -6682,12 +6682,30 @@ static int intel_async_flip_check_hw(struct intel_atomic_state *state, struct in
static int intel_joiner_add_affected_crtcs(struct intel_atomic_state *state)
{
struct drm_i915_private *i915 = to_i915(state->base.dev);
+ const struct intel_plane_state *plane_state;
struct intel_crtc_state *crtc_state;
+ struct intel_plane *plane;
struct intel_crtc *crtc;
u8 affected_pipes = 0;
u8 modeset_pipes = 0;
int i;
+ /*
+ * Any plane which is in use by the joiner needs its crtc.
+ * Pull those in first as this will not have happened yet
+ * if the plane remains disabled according to uapi.
+ */
+ for_each_new_intel_plane_in_state(state, plane, plane_state, i) {
+ crtc = to_intel_crtc(plane_state->hw.crtc);
+ if (!crtc)
+ continue;
+
+ crtc_state = intel_atomic_get_crtc_state(&state->base, crtc);
+ if (IS_ERR(crtc_state))
+ return PTR_ERR(crtc_state);
+ }
+
+ /* Now pull in all joined crtcs */
for_each_new_intel_crtc_in_state(state, crtc, crtc_state, i) {
affected_pipes |= crtc_state->joiner_pipes;
if (intel_crtc_needs_modeset(crtc_state))
--
2.45.3
From: Lad Prabhakar <prabhakar.mahadev-lad.rj(a)bp.renesas.com>
According to the Rev.1.20 hardware manual for the RZ/Five SoC, the clock
source for HP is derived from PLL6 divided by 2. This patch corrects the
implementation by configuring HP as a fixed clock source instead of a MUX.
The `CPG_PL6_ETH_SSEL` register, which is available on the RZ/G2UL SoC, is
not present on the RZ/Five SoC, necessitating this change.
Fixes: 95d48d270305ad2c ("clk: renesas: r9a07g043: Add support for RZ/Five SoC")
Cc: stable(a)vger.kernel.org
Reported-by: Hien Huynh <hien.huynh.px(a)renesas.com>
Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj(a)bp.renesas.com>
---
drivers/clk/renesas/r9a07g043-cpg.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/clk/renesas/r9a07g043-cpg.c b/drivers/clk/renesas/r9a07g043-cpg.c
index b97e9a7b9708..da5aa015790c 100644
--- a/drivers/clk/renesas/r9a07g043-cpg.c
+++ b/drivers/clk/renesas/r9a07g043-cpg.c
@@ -138,7 +138,11 @@ static const struct cpg_core_clk r9a07g043_core_clks[] __initconst = {
DEF_DIV("P2", R9A07G043_CLK_P2, CLK_PLL3_DIV2_4_2, DIVPL3A, dtable_1_32),
DEF_FIXED("M0", R9A07G043_CLK_M0, CLK_PLL3_DIV2_4, 1, 1),
DEF_FIXED("ZT", R9A07G043_CLK_ZT, CLK_PLL3_DIV2_4_2, 1, 1),
+#ifdef CONFIG_ARM64
DEF_MUX("HP", R9A07G043_CLK_HP, SEL_PLL6_2, sel_pll6_2),
+#else
+ DEF_FIXED("HP", R9A07G043_CLK_HP, CLK_PLL6_250, 1, 1),
+#endif
DEF_FIXED("SPI0", R9A07G043_CLK_SPI0, CLK_DIV_PLL3_C, 1, 2),
DEF_FIXED("SPI1", R9A07G043_CLK_SPI1, CLK_DIV_PLL3_C, 1, 4),
DEF_SD_MUX("SD0", R9A07G043_CLK_SD0, SEL_SDHI0, SEL_SDHI0_STS, sel_sdhi,
--
2.43.0
The ASTDP transmitter sometimes takes up to second for enabling the
video signal, while the timeout is only 200 msec. This results in a
kernel error message. Increase the timeout to 1 second. An example
of the error message is shown below.
[ 697.084433] ------------[ cut here ]------------
[ 697.091115] ast 0000:02:00.0: [drm] drm_WARN_ON(!__ast_dp_wait_enable(ast, enabled))
[ 697.091233] WARNING: CPU: 1 PID: 160 at drivers/gpu/drm/ast/ast_dp.c:232 ast_dp_set_enable+0x123/0x140 [ast]
[...]
[ 697.272469] RIP: 0010:ast_dp_set_enable+0x123/0x140 [ast]
[...]
[ 697.415283] Call Trace:
[ 697.420727] <TASK>
[ 697.425908] ? show_trace_log_lvl+0x196/0x2c0
[ 697.433304] ? show_trace_log_lvl+0x196/0x2c0
[ 697.440693] ? drm_atomic_helper_commit_modeset_enables+0x30a/0x470
[ 697.450115] ? ast_dp_set_enable+0x123/0x140 [ast]
[ 697.458059] ? __warn.cold+0xaf/0xca
[ 697.464713] ? ast_dp_set_enable+0x123/0x140 [ast]
[ 697.472633] ? report_bug+0x134/0x1d0
[ 697.479544] ? handle_bug+0x58/0x90
[ 697.486127] ? exc_invalid_op+0x13/0x40
[ 697.492975] ? asm_exc_invalid_op+0x16/0x20
[ 697.500224] ? preempt_count_sub+0x14/0xc0
[ 697.507473] ? ast_dp_set_enable+0x123/0x140 [ast]
[ 697.515377] ? ast_dp_set_enable+0x123/0x140 [ast]
[ 697.523227] drm_atomic_helper_commit_modeset_enables+0x30a/0x470
[ 697.532388] drm_atomic_helper_commit_tail+0x58/0x90
[ 697.540400] ast_mode_config_helper_atomic_commit_tail+0x30/0x40 [ast]
[ 697.550009] commit_tail+0xfe/0x1d0
[ 697.556547] drm_atomic_helper_commit+0x198/0x1c0
This is a cosmetical problem. Enabling the video signal still works
even with the error message. The problem has always been present, but
only recent versions of the ast driver warn about missing the timeout.
Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Fixes: 4e29cc7c5c67 ("drm/ast: astdp: Replace ast_dp_set_on_off()")
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Jocelyn Falempe <jfalempe(a)redhat.com>
Cc: Dave Airlie <airlied(a)redhat.com>
Cc: dri-devel(a)lists.freedesktop.org
Cc: <stable(a)vger.kernel.org> # v6.13+
---
drivers/gpu/drm/ast/ast_dp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/ast/ast_dp.c b/drivers/gpu/drm/ast/ast_dp.c
index 30aad5c0112a1..2d7482a65f62a 100644
--- a/drivers/gpu/drm/ast/ast_dp.c
+++ b/drivers/gpu/drm/ast/ast_dp.c
@@ -201,7 +201,7 @@ static bool __ast_dp_wait_enable(struct ast_device *ast, bool enabled)
if (enabled)
vgacrdf_test |= AST_IO_VGACRDF_DP_VIDEO_ENABLE;
- for (i = 0; i < 200; ++i) {
+ for (i = 0; i < 1000; ++i) {
if (i)
mdelay(1);
vgacrdf = ast_get_index_reg_mask(ast, AST_IO_VGACRI, 0xdf,
--
2.48.1
From: Andrey Vatoropin <a.vatoropin(a)crpt.ru>
Size of variable args->pitch equals four bytes.
Size of variable args->height equals four bytes.
The expression args->pitch * args->height is currently being evaluated
using 32-bit arithmetic. During multiplication, an overflow may occur.
Above the expression args->pitch * args->height there is a check for its
minimum value. However, if args->pitch has a value greater than this
minimum, that check is insufficient.
Since a value of type 'u64' is used to store the eventual result,
cast the first variable of each expression to 'u64' to provide the
compiler with complete information about the appropriate arithmetic to use.
This is similar to commit 0f8f8a643000 ("drm/i915/gem: Detect overflow
in calculating dumb buffer size").
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 6d1782919dc9 ("drm/cma: Introduce drm_gem_cma_dumb_create_internal()")
Signed-off-by: Andrey Vatoropin <a.vatoropin(a)crpt.ru>
---
drivers/gpu/drm/drm_gem_dma_helper.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c b/drivers/gpu/drm/drm_gem_dma_helper.c
index 870b90b78bc4..a8862f6f702a 100644
--- a/drivers/gpu/drm/drm_gem_dma_helper.c
+++ b/drivers/gpu/drm/drm_gem_dma_helper.c
@@ -272,8 +272,8 @@ int drm_gem_dma_dumb_create_internal(struct drm_file *file_priv,
if (args->pitch < min_pitch)
args->pitch = min_pitch;
- if (args->size < args->pitch * args->height)
- args->size = args->pitch * args->height;
+ if (args->size < mul_u32_u32(args->pitch, args->height))
+ args->size = mul_u32_u32(args->pitch, args->height);
dma_obj = drm_gem_dma_create_with_handle(file_priv, drm, args->size,
&args->handle);
--
2.43.0
Previously, we incorrectly cast irq_domain->host_data directly to
mvebu_icu_msi_data. However, host_data actually stores a structure of
type msi_domain_info.
This incorrect assumption caused issues such as the thermal sensors of
the CP110 platform malfunctioning. Specifically, the translation of the
SEI interrupt to IRQ_TYPE_EDGE_RISING failed, preventing proper
interrupt handling. The following error was observed:
genirq: Setting trigger mode 4 for irq 85 failed (irq_chip_set_type_parent+0x0/0x34)
armada_thermal f2400000.system-controller:thermal-sensor@70: Cannot request threaded IRQ 85
This commit resolves the issue by first casting host_data to
msi_domain_info and then accessing the mvebu_icu_msi_data through
msi_domain_info->chip_data.
Cc: stable(a)vger.kernel.org
Fixes: d929e4db22b6 ("irqchip/irq-mvebu-icu: Prepare for real per device MSI")
Signed-off-by: Stefan Eichenberger <eichest(a)gmail.com>
---
Changes in v2:
* This patch is a v2 because it addresses the same issue as this patch:
https://lore.kernel.org/all/20241217111623.92625-1-eichest@gmail.com/
* This time it addresses the underlying issue instead of adding a
workaround (thanks to Thomas)
drivers/irqchip/irq-mvebu-icu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/irqchip/irq-mvebu-icu.c b/drivers/irqchip/irq-mvebu-icu.c
index b337f6c05f18..4eebed39880a 100644
--- a/drivers/irqchip/irq-mvebu-icu.c
+++ b/drivers/irqchip/irq-mvebu-icu.c
@@ -68,7 +68,8 @@ static int mvebu_icu_translate(struct irq_domain *d, struct irq_fwspec *fwspec,
unsigned long *hwirq, unsigned int *type)
{
unsigned int param_count = static_branch_unlikely(&legacy_bindings) ? 3 : 2;
- struct mvebu_icu_msi_data *msi_data = d->host_data;
+ struct msi_domain_info *info = d->host_data;
+ struct mvebu_icu_msi_data *msi_data = info->chip_data;
struct mvebu_icu *icu = msi_data->icu;
/* Check the count of the parameters in dt */
--
2.45.2
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: 01cfc84024e9a6b619696a35d2e5662255001cd0
Gitweb: https://git.kernel.org/tip/01cfc84024e9a6b619696a35d2e5662255001cd0
Author: Waiman Long <longman(a)redhat.com>
AuthorDate: Fri, 24 Jan 2025 20:54:42 -05:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Mon, 27 Jan 2025 10:30:59 +01:00
clocksource: Use get_random_bytes() in clocksource_verify_choose_cpus()
The following bug report happened in a PREEMPT_RT kernel.
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2012, name: kwatchdog
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
3 locks held by kwatchdog/2012:
#0: ffffffff8af2da60 (clocksource_mutex){+.+.}-{3:3}, at: clocksource_watchdog_kthread+0x13/0x50
#1: ffffffff8aa8d4d0 (cpu_hotplug_lock){++++}-{0:0}, at: clocksource_verify_percpu.part.0+0x5c/0x330
#2: ffff9fe02f5f33e0 ((batched_entropy_u32.lock)){+.+.}-{2:2}, at: get_random_u32+0x4f/0x110
Preemption disabled at:
[<ffffffff88c1fe56>] clocksource_verify_percpu.part.0+0x66/0x330
CPU: 33 PID: 2012 Comm: kwatchdog Not tainted 5.14.0-503.23.1.el9_5.x86_64+rt-debug #1
Call Trace:
<TASK>
__might_resched.cold+0xf4/0x12f
rt_spin_lock+0x4c/0x100
get_random_u32+0x4f/0x110
clocksource_verify_choose_cpus+0xab/0x1a0
clocksource_verify_percpu.part.0+0x6b/0x330
__clocksource_watchdog_kthread+0x193/0x1a0
clocksource_watchdog_kthread+0x18/0x50
kthread+0x114/0x140
ret_from_fork+0x2c/0x50
</TASK>
This happens due to the fact that get_random_u32() is called in
clocksource_verify_choose_cpus() with preemption disabled. If crng_ready()
is true by the time get_random_u32() is called, The batched_entropy_32
local lock will be acquired. In a PREEMPT_RT enabled kernel, it is a
rtmutex, which can't be acquireq with preemption disabled.
Fix this problem by using the less random get_random_bytes() function which
will not take any lock. In fact, it has the same random-ness as
get_random_u32_below() when crng_ready() is false.
Fixes: 7560c02bdffb ("clocksource: Check per-CPU clock synchronization when marked unstable")
Signed-off-by: Waiman Long <longman(a)redhat.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Suggested-by: Paul E. McKenney <paulmck(a)kernel.org>
Reviewed-by: Paul E. McKenney <paulmck(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20250125015442.3740588-2-longman@redhat.com
---
kernel/time/clocksource.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 77d9566..659c4b7 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -340,9 +340,13 @@ static void clocksource_verify_choose_cpus(void)
* and no replacement CPU is selected. This gracefully handles
* situations where verify_n_cpus is greater than the number of
* CPUs that are currently online.
+ *
+ * The get_random_bytes() is used here to avoid taking lock with
+ * preemption disabled.
*/
for (i = 1; i < n; i++) {
- cpu = get_random_u32_below(nr_cpu_ids);
+ get_random_bytes(&cpu, sizeof(cpu));
+ cpu %= nr_cpu_ids;
cpu = cpumask_next(cpu - 1, cpu_online_mask);
if (cpu >= nr_cpu_ids)
cpu = cpumask_first(cpu_online_mask);
From: Aradhya Bhatia <a-bhatia1(a)ti.com>
Fix the OF node pointer passed to the of_drm_find_bridge() call to find
the next bridge in the display chain.
The code to find the next panel (and create its panel-bridge) works
fine, but to find the next (non-panel) bridge does not.
To find the next bridge in the pipeline, we need to pass "np" - the OF
node pointer of the next entity in the devicetree chain. Passing
"of_node" to of_drm_find_bridge (which is what the code does currently)
will fetch the bridge for the cdns-dsi which is not what's required.
Fix that.
Fixes: e19233955d9e ("drm/bridge: Add Cadence DSI driver")
Cc: Stable List <stable(a)vger.kernel.org>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov(a)linaro.org>
Reviewed-by: Tomi Valkeinen <tomi.valkeinen(a)ideasonboard.com>
Signed-off-by: Aradhya Bhatia <a-bhatia1(a)ti.com>
---
drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c b/drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c
index c7a0247e06ad..2f897ea5e80a 100644
--- a/drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c
+++ b/drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c
@@ -952,7 +952,7 @@ static int cdns_dsi_attach(struct mipi_dsi_host *host,
bridge = drm_panel_bridge_add_typed(panel,
DRM_MODE_CONNECTOR_DSI);
} else {
- bridge = of_drm_find_bridge(dev->dev.of_node);
+ bridge = of_drm_find_bridge(np);
if (!bridge)
bridge = ERR_PTR(-EINVAL);
}
--
2.34.1
From: Aradhya Bhatia <a-bhatia1(a)ti.com>
Once the DSI Link and DSI Phy are initialized, the code needs to wait
for Clk and Data Lanes to be ready, before continuing configuration.
This is in accordance with the DSI Start-up procedure, found in the
Technical Reference Manual of Texas Instrument's J721E SoC[0] which
houses this DSI TX controller.
If the previous bridge (or crtc/encoder) are configured pre-maturely,
the input signal FIFO gets corrupt. This introduces a color-shift on the
display.
Allow the driver to wait for the clk and data lanes to get ready during
DSI enable.
[0]: See section 12.6.5.7.3 "Start-up Procedure" in J721E SoC TRM
TRM Link: http://www.ti.com/lit/pdf/spruil1
Fixes: e19233955d9e ("drm/bridge: Add Cadence DSI driver")
Cc: Stable List <stable(a)vger.kernel.org>
Tested-by: Dominik Haller <d.haller(a)phytec.de>
Reviewed-by: Tomi Valkeinen <tomi.valkeinen(a)ideasonboard.com>
Signed-off-by: Aradhya Bhatia <a-bhatia1(a)ti.com>
Signed-off-by: Aradhya Bhatia <aradhya.bhatia(a)linux.dev>
---
drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c b/drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c
index 87921a748cdb..6a77ca36cb9d 100644
--- a/drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c
+++ b/drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c
@@ -769,7 +769,7 @@ static void cdns_dsi_bridge_enable(struct drm_bridge *bridge)
struct phy_configure_opts_mipi_dphy *phy_cfg = &output->phy_opts.mipi_dphy;
unsigned long tx_byte_period;
struct cdns_dsi_cfg dsi_cfg;
- u32 tmp, reg_wakeup, div;
+ u32 tmp, reg_wakeup, div, status;
int nlanes;
if (WARN_ON(pm_runtime_get_sync(dsi->base.dev) < 0))
@@ -786,6 +786,19 @@ static void cdns_dsi_bridge_enable(struct drm_bridge *bridge)
cdns_dsi_hs_init(dsi);
cdns_dsi_init_link(dsi);
+ /*
+ * Now that the DSI Link and DSI Phy are initialized,
+ * wait for the CLK and Data Lanes to be ready.
+ */
+ tmp = CLK_LANE_RDY;
+ for (int i = 0; i < nlanes; i++)
+ tmp |= DATA_LANE_RDY(i);
+
+ if (readl_poll_timeout(dsi->regs + MCTL_MAIN_STS, status,
+ (tmp == (status & tmp)), 100, 500000))
+ dev_err(dsi->base.dev,
+ "Timed Out: DSI-DPhy Clock and Data Lanes not ready.\n");
+
writel(HBP_LEN(dsi_cfg.hbp) | HSA_LEN(dsi_cfg.hsa),
dsi->regs + VID_HSIZE1);
writel(HFP_LEN(dsi_cfg.hfp) | HACT_LEN(dsi_cfg.hact),
--
2.34.1
From: Aradhya Bhatia <a-bhatia1(a)ti.com>
The crtc_* mode parameters do not get generated (duplicated in this
case) from the regular parameters before the mode validation phase
begins.
The rest of the code conditionally uses the crtc_* parameters only
during the bridge enable phase, but sticks to the regular parameters
for mode validation. In this singular instance, however, the driver
tries to use the crtc_clock parameter even during the mode validation,
causing the validation to fail.
Allow the D-Phy config checks to use mode->clock instead of
mode->crtc_clock during mode_valid checks, like everywhere else in the
driver.
Fixes: fced5a364dee ("drm/bridge: cdns: Convert to phy framework")
Cc: Stable List <stable(a)vger.kernel.org>
Reviewed-by: Tomi Valkeinen <tomi.valkeinen(a)ideasonboard.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov(a)linaro.org>
Signed-off-by: Aradhya Bhatia <a-bhatia1(a)ti.com>
Signed-off-by: Aradhya Bhatia <aradhya.bhatia(a)linux.dev>
---
drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c b/drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c
index b0a1a6774ea6..19cc8734a4c8 100644
--- a/drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c
+++ b/drivers/gpu/drm/bridge/cadence/cdns-dsi-core.c
@@ -568,13 +568,14 @@ static int cdns_dsi_check_conf(struct cdns_dsi *dsi,
struct phy_configure_opts_mipi_dphy *phy_cfg = &output->phy_opts.mipi_dphy;
unsigned long dsi_hss_hsa_hse_hbp;
unsigned int nlanes = output->dev->lanes;
+ int mode_clock = (mode_valid_check ? mode->clock : mode->crtc_clock);
int ret;
ret = cdns_dsi_mode2cfg(dsi, mode, dsi_cfg, mode_valid_check);
if (ret)
return ret;
- phy_mipi_dphy_get_default_config(mode->crtc_clock * 1000,
+ phy_mipi_dphy_get_default_config(mode_clock * 1000,
mipi_dsi_pixel_format_to_bpp(output->dev->format),
nlanes, phy_cfg);
--
2.34.1
From: Thomas Gleixner <tglx(a)linutronix.de>
commit fe513c2ef0a172a58f158e2e70465c4317f0a9a2 upstream.
static_call_module_notify() triggers a WARN_ON(), when memory allocation
fails in __static_call_add_module().
That's not really justified, because the failure case must be correctly
handled by the well known call chain and the error code is passed
through to the initiating userspace application.
A memory allocation fail is not a fatal problem, but the WARN_ON() takes
the machine out when panic_on_warn is set.
Replace it with a pr_warn().
Fixes: 9183c3f9ed71 ("static_call: Add inline static call infrastructure")
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lkml.kernel.org/r/8734mf7pmb.ffs@tglx
Signed-off-by: Alexey Nepomnyashih <sdl(a)nppct.ru>
---
Backport to fix CVE-2024-49954
Link: https://nvd.nist.gov/vuln/detail/CVE-2024-49954
---
kernel/static_call.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/static_call.c b/kernel/static_call.c
index e9408409eb46..a008250e2533 100644
--- a/kernel/static_call.c
+++ b/kernel/static_call.c
@@ -431,7 +431,7 @@ static int static_call_module_notify(struct notifier_block *nb,
case MODULE_STATE_COMING:
ret = static_call_add_module(mod);
if (ret) {
- WARN(1, "Failed to allocate memory for static calls");
+ pr_warn("Failed to allocate memory for static calls\n");
static_call_del_module(mod);
}
break;
--
2.43.0
From: Nikita Zhandarovich <n.zhandarovich(a)fintech.ru>
[ Upstream commit 2a3cfb9a24a28da9cc13d2c525a76548865e182c ]
Since 'adev->dm.dc' in amdgpu_dm_fini() might turn out to be NULL
before the call to dc_enable_dmub_notifications(), check
beforehand to ensure there will not be a possible NULL-ptr-deref
there.
Also, since commit 1e88eb1b2c25 ("drm/amd/display: Drop
CONFIG_DRM_AMD_DC_HDCP") there are two separate checks for NULL in
'adev->dm.dc' before dc_deinit_callbacks() and dc_dmub_srv_destroy().
Clean up by combining them all under one 'if'.
Found by Linux Verification Center (linuxtesting.org) with static
analysis tool SVACE.
Fixes: 81927e2808be ("drm/amd/display: Support for DMUB AUX")
Signed-off-by: Nikita Zhandarovich <n.zhandarovich(a)fintech.ru>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
[ To fix CVE-2024-27041, I fix the merge conflict by still using macro
CONFIG_DRM_AMD_DC_HDCP in the first adev->dm.dc check. ]
Signed-off-by: Imkanmod Khan <imkanmodkhan(a)gmail.com>
---
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 8dc0f70df24f..7b4d44dcb343 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1882,14 +1882,14 @@ static void amdgpu_dm_fini(struct amdgpu_device *adev)
dc_deinit_callbacks(adev->dm.dc);
#endif
- if (adev->dm.dc)
+ if (adev->dm.dc) {
dc_dmub_srv_destroy(&adev->dm.dc->ctx->dmub_srv);
-
- if (dc_enable_dmub_notifications(adev->dm.dc)) {
- kfree(adev->dm.dmub_notify);
- adev->dm.dmub_notify = NULL;
- destroy_workqueue(adev->dm.delayed_hpd_wq);
- adev->dm.delayed_hpd_wq = NULL;
+ if (dc_enable_dmub_notifications(adev->dm.dc)) {
+ kfree(adev->dm.dmub_notify);
+ adev->dm.dmub_notify = NULL;
+ destroy_workqueue(adev->dm.delayed_hpd_wq);
+ adev->dm.delayed_hpd_wq = NULL;
+ }
}
if (adev->dm.dmub_bo)
--
2.25.1
From: "Luis Henriques (SUSE)" <luis.henriques(a)linux.dev>
[ commit 23dfdb56581ad92a9967bcd720c8c23356af74c1 upstream ]
The following kernel trace can be triggered with fstest generic/629 when
executed against a filesystem with fast-commit feature enabled:
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 0 PID: 866 Comm: mount Not tainted 6.10.0+ #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x66/0x90
register_lock_class+0x759/0x7d0
__lock_acquire+0x85/0x2630
? __find_get_block+0xb4/0x380
lock_acquire+0xd1/0x2d0
? __ext4_journal_get_write_access+0xd5/0x160
_raw_spin_lock+0x33/0x40
? __ext4_journal_get_write_access+0xd5/0x160
__ext4_journal_get_write_access+0xd5/0x160
ext4_reserve_inode_write+0x61/0xb0
__ext4_mark_inode_dirty+0x79/0x270
? ext4_ext_replay_set_iblocks+0x2f8/0x450
ext4_ext_replay_set_iblocks+0x330/0x450
ext4_fc_replay+0x14c8/0x1540
? jread+0x88/0x2e0
? rcu_is_watching+0x11/0x40
do_one_pass+0x447/0xd00
jbd2_journal_recover+0x139/0x1b0
jbd2_journal_load+0x96/0x390
ext4_load_and_init_journal+0x253/0xd40
ext4_fill_super+0x2cc6/0x3180
...
In the replay path there's an attempt to lock sbi->s_bdev_wb_lock in
function ext4_check_bdev_write_error(). Unfortunately, at this point this
spinlock has not been initialized yet. Moving it's initialization to an
earlier point in __ext4_fill_super() fixes this splat.
Signed-off-by: Luis Henriques (SUSE) <luis.henriques(a)linux.dev>
Link: https://patch.msgid.link/20240718094356.7863-1-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso(a)mit.edu>
Cc: stable(a)kernel.org
Signed-off-by: Imkanmod Khan <imkanmodkhan(a)gmail.com>
---
fs/ext4/super.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0b2591c07166..53f1deb049ec 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5264,6 +5264,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
mutex_init(&sbi->s_orphan_lock);
+ spin_lock_init(&sbi->s_bdev_wb_lock);
+
ext4_fast_commit_init(sb);
sb->s_root = NULL;
@@ -5514,7 +5516,6 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
* Save the original bdev mapping's wb_err value which could be
* used to detect the metadata async write error.
*/
- spin_lock_init(&sbi->s_bdev_wb_lock);
errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
&sbi->s_bdev_wb_err);
sb->s_bdev->bd_super = sb;
--
2.25.1
From: "Luis Henriques (SUSE)" <luis.henriques(a)linux.dev>
[ commit 23dfdb56581ad92a9967bcd720c8c23356af74c1 upstream ]
The following kernel trace can be triggered with fstest generic/629 when
executed against a filesystem with fast-commit feature enabled:
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 0 PID: 866 Comm: mount Not tainted 6.10.0+ #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x66/0x90
register_lock_class+0x759/0x7d0
__lock_acquire+0x85/0x2630
? __find_get_block+0xb4/0x380
lock_acquire+0xd1/0x2d0
? __ext4_journal_get_write_access+0xd5/0x160
_raw_spin_lock+0x33/0x40
? __ext4_journal_get_write_access+0xd5/0x160
__ext4_journal_get_write_access+0xd5/0x160
ext4_reserve_inode_write+0x61/0xb0
__ext4_mark_inode_dirty+0x79/0x270
? ext4_ext_replay_set_iblocks+0x2f8/0x450
ext4_ext_replay_set_iblocks+0x330/0x450
ext4_fc_replay+0x14c8/0x1540
? jread+0x88/0x2e0
? rcu_is_watching+0x11/0x40
do_one_pass+0x447/0xd00
jbd2_journal_recover+0x139/0x1b0
jbd2_journal_load+0x96/0x390
ext4_load_and_init_journal+0x253/0xd40
ext4_fill_super+0x2cc6/0x3180
...
In the replay path there's an attempt to lock sbi->s_bdev_wb_lock in
function ext4_check_bdev_write_error(). Unfortunately, at this point this
spinlock has not been initialized yet. Moving it's initialization to an
earlier point in __ext4_fill_super() fixes this splat.
Signed-off-by: Luis Henriques (SUSE) <luis.henriques(a)linux.dev>
Link: https://patch.msgid.link/20240718094356.7863-1-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso(a)mit.edu>
Cc: stable(a)kernel.org
Signed-off-by: Imkanmod Khan <imkanmodkhan(a)gmail.com>
---
fs/ext4/super.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 71ced0ada9a2..f019ce64eba4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5366,6 +5366,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
mutex_init(&sbi->s_orphan_lock);
+ spin_lock_init(&sbi->s_bdev_wb_lock);
+
ext4_fast_commit_init(sb);
sb->s_root = NULL;
@@ -5586,7 +5588,6 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
* Save the original bdev mapping's wb_err value which could be
* used to detect the metadata async write error.
*/
- spin_lock_init(&sbi->s_bdev_wb_lock);
errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
&sbi->s_bdev_wb_err);
EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS;
--
2.25.1
From: David Woodhouse <dwmw(a)amazon.co.uk>
[ Upstream commit 4b5bc2ec9a239bce261ffeafdd63571134102323 ]
Now that the following fix:
d0ceea662d45 ("x86/mm: Add _PAGE_NOPTISHADOW bit to avoid updating userspace page tables")
stops kernel_ident_mapping_init() from scribbling over the end of a
4KiB PGD by assuming the following 4KiB will be a userspace PGD,
there's no good reason for the kexec PGD to be part of a single
8KiB allocation with the control_code_page.
( It's not clear that that was the reason for x86_64 kexec doing it that
way in the first place either; there were no comments to that effect and
it seems to have been the case even before PTI came along. It looks like
it was just a happy accident which prevented memory corruption on kexec. )
Either way, it definitely isn't needed now. Just allocate the PGD
separately on x86_64, like i386 already does.
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Vivek Goyal <vgoyal(a)redhat.com>
Cc: Dave Young <dyoung(a)redhat.com>
Cc: Eric Biederman <ebiederm(a)xmission.com>
Cc: Ard Biesheuvel <ardb(a)kernel.org>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-6-dwmw2@infradead.org
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/x86/include/asm/kexec.h | 18 +++++++++---
arch/x86/kernel/machine_kexec_64.c | 45 ++++++++++++++++--------------
2 files changed, 38 insertions(+), 25 deletions(-)
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 256eee99afc8f..e2e1ec99c9998 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -16,6 +16,7 @@
# define PAGES_NR 4
#endif
+# define KEXEC_CONTROL_PAGE_SIZE 4096
# define KEXEC_CONTROL_CODE_MAX_SIZE 2048
#ifndef __ASSEMBLY__
@@ -44,7 +45,6 @@ struct kimage;
/* Maximum address we can use for the control code buffer */
# define KEXEC_CONTROL_MEMORY_LIMIT TASK_SIZE
-# define KEXEC_CONTROL_PAGE_SIZE 4096
/* The native architecture */
# define KEXEC_ARCH KEXEC_ARCH_386
@@ -59,9 +59,6 @@ struct kimage;
/* Maximum address we can use for the control pages */
# define KEXEC_CONTROL_MEMORY_LIMIT (MAXMEM-1)
-/* Allocate one page for the pdp and the second for the code */
-# define KEXEC_CONTROL_PAGE_SIZE (4096UL + 4096UL)
-
/* The native architecture */
# define KEXEC_ARCH KEXEC_ARCH_X86_64
#endif
@@ -146,6 +143,19 @@ struct kimage_arch {
};
#else
struct kimage_arch {
+ /*
+ * This is a kimage control page, as it must not overlap with either
+ * source or destination address ranges.
+ */
+ pgd_t *pgd;
+ /*
+ * The virtual mapping of the control code page itself is used only
+ * during the transition, while the current kernel's pages are all
+ * in place. Thus the intermediate page table pages used to map it
+ * are not control pages, but instead just normal pages obtained
+ * with get_zeroed_page(). And have to be tracked (below) so that
+ * they can be freed.
+ */
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 24b6eaacc81eb..5d61a342871b5 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -149,7 +149,8 @@ static void free_transition_pgtable(struct kimage *image)
image->arch.pte = NULL;
}
-static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
+static int init_transition_pgtable(struct kimage *image, pgd_t *pgd,
+ unsigned long control_page)
{
pgprot_t prot = PAGE_KERNEL_EXEC_NOENC;
unsigned long vaddr, paddr;
@@ -160,7 +161,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
pte_t *pte;
vaddr = (unsigned long)relocate_kernel;
- paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE);
+ paddr = control_page;
pgd += pgd_index(vaddr);
if (!pgd_present(*pgd)) {
p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);
@@ -219,7 +220,7 @@ static void *alloc_pgt_page(void *data)
return p;
}
-static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
+static int init_pgtable(struct kimage *image, unsigned long control_page)
{
struct x86_mapping_info info = {
.alloc_pgt_page = alloc_pgt_page,
@@ -228,12 +229,12 @@ static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
.kernpg_flag = _KERNPG_TABLE_NOENC,
};
unsigned long mstart, mend;
- pgd_t *level4p;
int result;
int i;
- level4p = (pgd_t *)__va(start_pgtable);
- clear_page(level4p);
+ image->arch.pgd = alloc_pgt_page(image);
+ if (!image->arch.pgd)
+ return -ENOMEM;
if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
info.page_flag |= _PAGE_ENC;
@@ -247,8 +248,8 @@ static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
mstart = pfn_mapped[i].start << PAGE_SHIFT;
mend = pfn_mapped[i].end << PAGE_SHIFT;
- result = kernel_ident_mapping_init(&info,
- level4p, mstart, mend);
+ result = kernel_ident_mapping_init(&info, image->arch.pgd,
+ mstart, mend);
if (result)
return result;
}
@@ -263,8 +264,8 @@ static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
mstart = image->segment[i].mem;
mend = mstart + image->segment[i].memsz;
- result = kernel_ident_mapping_init(&info,
- level4p, mstart, mend);
+ result = kernel_ident_mapping_init(&info, image->arch.pgd,
+ mstart, mend);
if (result)
return result;
@@ -274,15 +275,19 @@ static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
* Prepare EFI systab and ACPI tables for kexec kernel since they are
* not covered by pfn_mapped.
*/
- result = map_efi_systab(&info, level4p);
+ result = map_efi_systab(&info, image->arch.pgd);
if (result)
return result;
- result = map_acpi_tables(&info, level4p);
+ result = map_acpi_tables(&info, image->arch.pgd);
if (result)
return result;
- return init_transition_pgtable(image, level4p);
+ /*
+ * This must be last because the intermediate page table pages it
+ * allocates will not be control pages and may overlap the image.
+ */
+ return init_transition_pgtable(image, image->arch.pgd, control_page);
}
static void load_segments(void)
@@ -299,14 +304,14 @@ static void load_segments(void)
int machine_kexec_prepare(struct kimage *image)
{
- unsigned long start_pgtable;
+ unsigned long control_page;
int result;
/* Calculate the offsets */
- start_pgtable = page_to_pfn(image->control_code_page) << PAGE_SHIFT;
+ control_page = page_to_pfn(image->control_code_page) << PAGE_SHIFT;
/* Setup the identity mapped 64bit page table */
- result = init_pgtable(image, start_pgtable);
+ result = init_pgtable(image, control_page);
if (result)
return result;
@@ -353,13 +358,12 @@ void machine_kexec(struct kimage *image)
#endif
}
- control_page = page_address(image->control_code_page) + PAGE_SIZE;
+ control_page = page_address(image->control_code_page);
__memcpy(control_page, relocate_kernel, KEXEC_CONTROL_CODE_MAX_SIZE);
page_list[PA_CONTROL_PAGE] = virt_to_phys(control_page);
page_list[VA_CONTROL_PAGE] = (unsigned long)control_page;
- page_list[PA_TABLE_PAGE] =
- (unsigned long)__pa(page_address(image->control_code_page));
+ page_list[PA_TABLE_PAGE] = (unsigned long)__pa(image->arch.pgd);
if (image->type == KEXEC_TYPE_DEFAULT)
page_list[PA_SWAP_PAGE] = (page_to_pfn(image->swap_page)
@@ -578,8 +582,7 @@ static void kexec_mark_crashkres(bool protect)
/* Don't touch the control code page used in crash_kexec().*/
control = PFN_PHYS(page_to_pfn(kexec_crash_image->control_code_page));
- /* Control code page is located in the 2nd page. */
- kexec_mark_range(crashk_res.start, control + PAGE_SIZE - 1, protect);
+ kexec_mark_range(crashk_res.start, control - 1, protect);
control += KEXEC_CONTROL_PAGE_SIZE;
kexec_mark_range(control, crashk_res.end, protect);
}
--
2.39.5
From: David Woodhouse <dwmw(a)amazon.co.uk>
[ Upstream commit 4b5bc2ec9a239bce261ffeafdd63571134102323 ]
Now that the following fix:
d0ceea662d45 ("x86/mm: Add _PAGE_NOPTISHADOW bit to avoid updating userspace page tables")
stops kernel_ident_mapping_init() from scribbling over the end of a
4KiB PGD by assuming the following 4KiB will be a userspace PGD,
there's no good reason for the kexec PGD to be part of a single
8KiB allocation with the control_code_page.
( It's not clear that that was the reason for x86_64 kexec doing it that
way in the first place either; there were no comments to that effect and
it seems to have been the case even before PTI came along. It looks like
it was just a happy accident which prevented memory corruption on kexec. )
Either way, it definitely isn't needed now. Just allocate the PGD
separately on x86_64, like i386 already does.
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Vivek Goyal <vgoyal(a)redhat.com>
Cc: Dave Young <dyoung(a)redhat.com>
Cc: Eric Biederman <ebiederm(a)xmission.com>
Cc: Ard Biesheuvel <ardb(a)kernel.org>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-6-dwmw2@infradead.org
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/x86/include/asm/kexec.h | 18 +++++++++---
arch/x86/kernel/machine_kexec_64.c | 45 ++++++++++++++++--------------
2 files changed, 38 insertions(+), 25 deletions(-)
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index c9f6a6c5de3cf..d54dd7084a3e1 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -16,6 +16,7 @@
# define PAGES_NR 4
#endif
+# define KEXEC_CONTROL_PAGE_SIZE 4096
# define KEXEC_CONTROL_CODE_MAX_SIZE 2048
#ifndef __ASSEMBLY__
@@ -44,7 +45,6 @@ struct kimage;
/* Maximum address we can use for the control code buffer */
# define KEXEC_CONTROL_MEMORY_LIMIT TASK_SIZE
-# define KEXEC_CONTROL_PAGE_SIZE 4096
/* The native architecture */
# define KEXEC_ARCH KEXEC_ARCH_386
@@ -59,9 +59,6 @@ struct kimage;
/* Maximum address we can use for the control pages */
# define KEXEC_CONTROL_MEMORY_LIMIT (MAXMEM-1)
-/* Allocate one page for the pdp and the second for the code */
-# define KEXEC_CONTROL_PAGE_SIZE (4096UL + 4096UL)
-
/* The native architecture */
# define KEXEC_ARCH KEXEC_ARCH_X86_64
#endif
@@ -146,6 +143,19 @@ struct kimage_arch {
};
#else
struct kimage_arch {
+ /*
+ * This is a kimage control page, as it must not overlap with either
+ * source or destination address ranges.
+ */
+ pgd_t *pgd;
+ /*
+ * The virtual mapping of the control code page itself is used only
+ * during the transition, while the current kernel's pages are all
+ * in place. Thus the intermediate page table pages used to map it
+ * are not control pages, but instead just normal pages obtained
+ * with get_zeroed_page(). And have to be tracked (below) so that
+ * they can be freed.
+ */
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 2fa12d1dc6760..8509d809a9f1b 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -149,7 +149,8 @@ static void free_transition_pgtable(struct kimage *image)
image->arch.pte = NULL;
}
-static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
+static int init_transition_pgtable(struct kimage *image, pgd_t *pgd,
+ unsigned long control_page)
{
pgprot_t prot = PAGE_KERNEL_EXEC_NOENC;
unsigned long vaddr, paddr;
@@ -160,7 +161,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
pte_t *pte;
vaddr = (unsigned long)relocate_kernel;
- paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE);
+ paddr = control_page;
pgd += pgd_index(vaddr);
if (!pgd_present(*pgd)) {
p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);
@@ -219,7 +220,7 @@ static void *alloc_pgt_page(void *data)
return p;
}
-static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
+static int init_pgtable(struct kimage *image, unsigned long control_page)
{
struct x86_mapping_info info = {
.alloc_pgt_page = alloc_pgt_page,
@@ -228,12 +229,12 @@ static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
.kernpg_flag = _KERNPG_TABLE_NOENC,
};
unsigned long mstart, mend;
- pgd_t *level4p;
int result;
int i;
- level4p = (pgd_t *)__va(start_pgtable);
- clear_page(level4p);
+ image->arch.pgd = alloc_pgt_page(image);
+ if (!image->arch.pgd)
+ return -ENOMEM;
if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
info.page_flag |= _PAGE_ENC;
@@ -247,8 +248,8 @@ static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
mstart = pfn_mapped[i].start << PAGE_SHIFT;
mend = pfn_mapped[i].end << PAGE_SHIFT;
- result = kernel_ident_mapping_init(&info,
- level4p, mstart, mend);
+ result = kernel_ident_mapping_init(&info, image->arch.pgd,
+ mstart, mend);
if (result)
return result;
}
@@ -263,8 +264,8 @@ static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
mstart = image->segment[i].mem;
mend = mstart + image->segment[i].memsz;
- result = kernel_ident_mapping_init(&info,
- level4p, mstart, mend);
+ result = kernel_ident_mapping_init(&info, image->arch.pgd,
+ mstart, mend);
if (result)
return result;
@@ -274,15 +275,19 @@ static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
* Prepare EFI systab and ACPI tables for kexec kernel since they are
* not covered by pfn_mapped.
*/
- result = map_efi_systab(&info, level4p);
+ result = map_efi_systab(&info, image->arch.pgd);
if (result)
return result;
- result = map_acpi_tables(&info, level4p);
+ result = map_acpi_tables(&info, image->arch.pgd);
if (result)
return result;
- return init_transition_pgtable(image, level4p);
+ /*
+ * This must be last because the intermediate page table pages it
+ * allocates will not be control pages and may overlap the image.
+ */
+ return init_transition_pgtable(image, image->arch.pgd, control_page);
}
static void load_segments(void)
@@ -299,14 +304,14 @@ static void load_segments(void)
int machine_kexec_prepare(struct kimage *image)
{
- unsigned long start_pgtable;
+ unsigned long control_page;
int result;
/* Calculate the offsets */
- start_pgtable = page_to_pfn(image->control_code_page) << PAGE_SHIFT;
+ control_page = page_to_pfn(image->control_code_page) << PAGE_SHIFT;
/* Setup the identity mapped 64bit page table */
- result = init_pgtable(image, start_pgtable);
+ result = init_pgtable(image, control_page);
if (result)
return result;
@@ -360,13 +365,12 @@ void machine_kexec(struct kimage *image)
#endif
}
- control_page = page_address(image->control_code_page) + PAGE_SIZE;
+ control_page = page_address(image->control_code_page);
__memcpy(control_page, relocate_kernel, KEXEC_CONTROL_CODE_MAX_SIZE);
page_list[PA_CONTROL_PAGE] = virt_to_phys(control_page);
page_list[VA_CONTROL_PAGE] = (unsigned long)control_page;
- page_list[PA_TABLE_PAGE] =
- (unsigned long)__pa(page_address(image->control_code_page));
+ page_list[PA_TABLE_PAGE] = (unsigned long)__pa(image->arch.pgd);
if (image->type == KEXEC_TYPE_DEFAULT)
page_list[PA_SWAP_PAGE] = (page_to_pfn(image->swap_page)
@@ -574,8 +578,7 @@ static void kexec_mark_crashkres(bool protect)
/* Don't touch the control code page used in crash_kexec().*/
control = PFN_PHYS(page_to_pfn(kexec_crash_image->control_code_page));
- /* Control code page is located in the 2nd page. */
- kexec_mark_range(crashk_res.start, control + PAGE_SIZE - 1, protect);
+ kexec_mark_range(crashk_res.start, control - 1, protect);
control += KEXEC_CONTROL_PAGE_SIZE;
kexec_mark_range(control, crashk_res.end, protect);
}
--
2.39.5
From: Bard Liao <yung-chuan.liao(a)linux.intel.com>
[ Upstream commit 569922b82ca660f8b24e705f6cf674e6b1f99cc7 ]
Each cpu DAI should associate with a widget. However, the topology might
not create the right number of DAI widgets for aggregated amps. And it
will cause NULL pointer deference.
Check that the DAI widget associated with the CPU DAI is valid to prevent
NULL pointer deference due to missing DAI widgets in topologies with
aggregated amps.
Signed-off-by: Bard Liao <yung-chuan.liao(a)linux.intel.com>
Reviewed-by: Ranjani Sridharan <ranjani.sridharan(a)linux.intel.com>
Reviewed-by: Péter Ujfalusi <peter.ujfalusi(a)linux.intel.com>
Reviewed-by: Liam Girdwood <liam.r.girdwood(a)intel.com>
Link: https://patch.msgid.link/20241203104853.56956-1-yung-chuan.liao@linux.intel…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
sound/soc/sof/intel/hda-dai.c | 12 ++++++++++++
sound/soc/sof/intel/hda.c | 5 +++++
2 files changed, 17 insertions(+)
diff --git a/sound/soc/sof/intel/hda-dai.c b/sound/soc/sof/intel/hda-dai.c
index 82f46ecd94301..2e58a264da556 100644
--- a/sound/soc/sof/intel/hda-dai.c
+++ b/sound/soc/sof/intel/hda-dai.c
@@ -503,6 +503,12 @@ int sdw_hda_dai_hw_params(struct snd_pcm_substream *substream,
int ret;
int i;
+ if (!w) {
+ dev_err(cpu_dai->dev, "%s widget not found, check amp link num in the topology\n",
+ cpu_dai->name);
+ return -EINVAL;
+ }
+
ops = hda_dai_get_ops(substream, cpu_dai);
if (!ops) {
dev_err(cpu_dai->dev, "DAI widget ops not set\n");
@@ -582,6 +588,12 @@ int sdw_hda_dai_hw_params(struct snd_pcm_substream *substream,
*/
for_each_rtd_cpu_dais(rtd, i, dai) {
w = snd_soc_dai_get_widget(dai, substream->stream);
+ if (!w) {
+ dev_err(cpu_dai->dev,
+ "%s widget not found, check amp link num in the topology\n",
+ dai->name);
+ return -EINVAL;
+ }
ipc4_copier = widget_to_copier(w);
memcpy(&ipc4_copier->dma_config_tlv[cpu_dai_id], dma_config_tlv,
sizeof(*dma_config_tlv));
diff --git a/sound/soc/sof/intel/hda.c b/sound/soc/sof/intel/hda.c
index 70fc08c8fc99e..f10ed4d102501 100644
--- a/sound/soc/sof/intel/hda.c
+++ b/sound/soc/sof/intel/hda.c
@@ -63,6 +63,11 @@ static int sdw_params_stream(struct device *dev,
struct snd_soc_dapm_widget *w = snd_soc_dai_get_widget(d, params_data->substream->stream);
struct snd_sof_dai_config_data data = { 0 };
+ if (!w) {
+ dev_err(dev, "%s widget not found, check amp link num in the topology\n",
+ d->name);
+ return -EINVAL;
+ }
data.dai_index = (params_data->link_id << 8) | d->id;
data.dai_data = params_data->alh_stream_id;
data.dai_node_id = data.dai_data;
--
2.39.5
From: Stas Sergeev <stsp2(a)yandex.ru>
[ Upstream commit 3ca459eaba1bf96a8c7878de84fa8872259a01e3 ]
Currently tun checks the group permission even if the user have matched.
Besides going against the usual permission semantic, this has a
very interesting implication: if the tun group is not among the
supplementary groups of the tun user, then effectively no one can
access the tun device. CAP_SYS_ADMIN still can, but its the same as
not setting the tun ownership.
This patch relaxes the group checking so that either the user match
or the group match is enough. This avoids the situation when no one
can access the device even though the ownership is properly set.
Also I simplified the logic by removing the redundant inversions:
tun_not_capable() --> !tun_capable()
Signed-off-by: Stas Sergeev <stsp2(a)yandex.ru>
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
Acked-by: Jason Wang <jasowang(a)redhat.com>
Link: https://patch.msgid.link/20241205073614.294773-1-stsp2@yandex.ru
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/net/tun.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 0adce9bf7a1e5..87cc7d778c3cf 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -636,14 +636,18 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
return ret;
}
-static inline bool tun_not_capable(struct tun_struct *tun)
+static inline bool tun_capable(struct tun_struct *tun)
{
const struct cred *cred = current_cred();
struct net *net = dev_net(tun->dev);
- return ((uid_valid(tun->owner) && !uid_eq(cred->euid, tun->owner)) ||
- (gid_valid(tun->group) && !in_egroup_p(tun->group))) &&
- !ns_capable(net->user_ns, CAP_NET_ADMIN);
+ if (ns_capable(net->user_ns, CAP_NET_ADMIN))
+ return 1;
+ if (uid_valid(tun->owner) && uid_eq(cred->euid, tun->owner))
+ return 1;
+ if (gid_valid(tun->group) && in_egroup_p(tun->group))
+ return 1;
+ return 0;
}
static void tun_set_real_num_queues(struct tun_struct *tun)
@@ -2838,7 +2842,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
!!(tun->flags & IFF_MULTI_QUEUE))
return -EINVAL;
- if (tun_not_capable(tun))
+ if (!tun_capable(tun))
return -EPERM;
err = security_tun_dev_open(tun->security);
if (err < 0)
--
2.39.5
From: Stas Sergeev <stsp2(a)yandex.ru>
[ Upstream commit 3ca459eaba1bf96a8c7878de84fa8872259a01e3 ]
Currently tun checks the group permission even if the user have matched.
Besides going against the usual permission semantic, this has a
very interesting implication: if the tun group is not among the
supplementary groups of the tun user, then effectively no one can
access the tun device. CAP_SYS_ADMIN still can, but its the same as
not setting the tun ownership.
This patch relaxes the group checking so that either the user match
or the group match is enough. This avoids the situation when no one
can access the device even though the ownership is properly set.
Also I simplified the logic by removing the redundant inversions:
tun_not_capable() --> !tun_capable()
Signed-off-by: Stas Sergeev <stsp2(a)yandex.ru>
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
Acked-by: Jason Wang <jasowang(a)redhat.com>
Link: https://patch.msgid.link/20241205073614.294773-1-stsp2@yandex.ru
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/net/tun.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index c34c6f0d23efe..52ea9f81d388b 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -586,14 +586,18 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
return ret;
}
-static inline bool tun_not_capable(struct tun_struct *tun)
+static inline bool tun_capable(struct tun_struct *tun)
{
const struct cred *cred = current_cred();
struct net *net = dev_net(tun->dev);
- return ((uid_valid(tun->owner) && !uid_eq(cred->euid, tun->owner)) ||
- (gid_valid(tun->group) && !in_egroup_p(tun->group))) &&
- !ns_capable(net->user_ns, CAP_NET_ADMIN);
+ if (ns_capable(net->user_ns, CAP_NET_ADMIN))
+ return 1;
+ if (uid_valid(tun->owner) && uid_eq(cred->euid, tun->owner))
+ return 1;
+ if (gid_valid(tun->group) && in_egroup_p(tun->group))
+ return 1;
+ return 0;
}
static void tun_set_real_num_queues(struct tun_struct *tun)
@@ -2772,7 +2776,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
!!(tun->flags & IFF_MULTI_QUEUE))
return -EINVAL;
- if (tun_not_capable(tun))
+ if (!tun_capable(tun))
return -EPERM;
err = security_tun_dev_open(tun->security);
if (err < 0)
--
2.39.5
From: Stas Sergeev <stsp2(a)yandex.ru>
[ Upstream commit 3ca459eaba1bf96a8c7878de84fa8872259a01e3 ]
Currently tun checks the group permission even if the user have matched.
Besides going against the usual permission semantic, this has a
very interesting implication: if the tun group is not among the
supplementary groups of the tun user, then effectively no one can
access the tun device. CAP_SYS_ADMIN still can, but its the same as
not setting the tun ownership.
This patch relaxes the group checking so that either the user match
or the group match is enough. This avoids the situation when no one
can access the device even though the ownership is properly set.
Also I simplified the logic by removing the redundant inversions:
tun_not_capable() --> !tun_capable()
Signed-off-by: Stas Sergeev <stsp2(a)yandex.ru>
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
Acked-by: Jason Wang <jasowang(a)redhat.com>
Link: https://patch.msgid.link/20241205073614.294773-1-stsp2@yandex.ru
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/net/tun.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 959ca6b9cd138..a85f743aa1573 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -575,14 +575,18 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
return ret;
}
-static inline bool tun_not_capable(struct tun_struct *tun)
+static inline bool tun_capable(struct tun_struct *tun)
{
const struct cred *cred = current_cred();
struct net *net = dev_net(tun->dev);
- return ((uid_valid(tun->owner) && !uid_eq(cred->euid, tun->owner)) ||
- (gid_valid(tun->group) && !in_egroup_p(tun->group))) &&
- !ns_capable(net->user_ns, CAP_NET_ADMIN);
+ if (ns_capable(net->user_ns, CAP_NET_ADMIN))
+ return 1;
+ if (uid_valid(tun->owner) && uid_eq(cred->euid, tun->owner))
+ return 1;
+ if (gid_valid(tun->group) && in_egroup_p(tun->group))
+ return 1;
+ return 0;
}
static void tun_set_real_num_queues(struct tun_struct *tun)
@@ -2718,7 +2722,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
!!(tun->flags & IFF_MULTI_QUEUE))
return -EINVAL;
- if (tun_not_capable(tun))
+ if (!tun_capable(tun))
return -EPERM;
err = security_tun_dev_open(tun->security);
if (err < 0)
--
2.39.5
From: Stas Sergeev <stsp2(a)yandex.ru>
[ Upstream commit 3ca459eaba1bf96a8c7878de84fa8872259a01e3 ]
Currently tun checks the group permission even if the user have matched.
Besides going against the usual permission semantic, this has a
very interesting implication: if the tun group is not among the
supplementary groups of the tun user, then effectively no one can
access the tun device. CAP_SYS_ADMIN still can, but its the same as
not setting the tun ownership.
This patch relaxes the group checking so that either the user match
or the group match is enough. This avoids the situation when no one
can access the device even though the ownership is properly set.
Also I simplified the logic by removing the redundant inversions:
tun_not_capable() --> !tun_capable()
Signed-off-by: Stas Sergeev <stsp2(a)yandex.ru>
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
Acked-by: Jason Wang <jasowang(a)redhat.com>
Link: https://patch.msgid.link/20241205073614.294773-1-stsp2@yandex.ru
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/net/tun.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index ea98d93138c12..a6c9f9062dbd4 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -574,14 +574,18 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
return ret;
}
-static inline bool tun_not_capable(struct tun_struct *tun)
+static inline bool tun_capable(struct tun_struct *tun)
{
const struct cred *cred = current_cred();
struct net *net = dev_net(tun->dev);
- return ((uid_valid(tun->owner) && !uid_eq(cred->euid, tun->owner)) ||
- (gid_valid(tun->group) && !in_egroup_p(tun->group))) &&
- !ns_capable(net->user_ns, CAP_NET_ADMIN);
+ if (ns_capable(net->user_ns, CAP_NET_ADMIN))
+ return 1;
+ if (uid_valid(tun->owner) && uid_eq(cred->euid, tun->owner))
+ return 1;
+ if (gid_valid(tun->group) && in_egroup_p(tun->group))
+ return 1;
+ return 0;
}
static void tun_set_real_num_queues(struct tun_struct *tun)
@@ -2767,7 +2771,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
!!(tun->flags & IFF_MULTI_QUEUE))
return -EINVAL;
- if (tun_not_capable(tun))
+ if (!tun_capable(tun))
return -EPERM;
err = security_tun_dev_open(tun->security);
if (err < 0)
--
2.39.5
From: Suleiman Souhlal <suleiman(a)google.com>
[ Upstream commit 108ad0999085df2366dd9ef437573955cb3f5586 ]
When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, in some situations, as they end up getting additional steal
time that did not actually happen.
This is because there is a window between reading the elapsed time in
update_rq_clock() and sampling the steal time in update_rq_clock_task().
If the VCPU gets preempted between those two points, any additional
steal time is accounted to the outgoing task even though the calculated
delta did not actually contain any of that "stolen" time.
When this race happens, we can end up with steal time that exceeds the
calculated delta, and the previous code would try to catch up that excess
steal time in future clock updates, which is given to the next,
incoming task, even though it did not actually have any time stolen.
This behavior is particularly bad when steal time can be very long,
which we've seen when trying to extend steal time to contain the duration
that the host was suspended [0]. When this happens, clock_task stays
frozen, during which the running task stays running for the whole
duration, since its run time doesn't increase.
However the race can happen even under normal operation.
Ideally we would read the elapsed cpu time and the steal time atomically,
to prevent this race from happening in the first place, but doing so
is non-trivial.
Since the time between those two points isn't otherwise accounted anywhere,
neither to the outgoing task nor the incoming task (because the "end of
outgoing task" and "start of incoming task" timestamps are the same),
I would argue that the right thing to do is to simply drop any excess steal
time, in order to prevent these issues.
[0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
Signed-off-by: Suleiman Souhlal <suleiman(a)google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/core.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51ac62637e4ed..39ce8a3d8c573 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -176,13 +176,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
- steal = paravirt_steal_clock(cpu_of(rq));
+ u64 prev_steal;
+
+ steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
steal -= rq->prev_steal_time_rq;
if (unlikely(steal > delta))
steal = delta;
- rq->prev_steal_time_rq += steal;
+ rq->prev_steal_time_rq = prev_steal;
delta -= steal;
}
#endif
--
2.39.5
From: Suleiman Souhlal <suleiman(a)google.com>
[ Upstream commit 108ad0999085df2366dd9ef437573955cb3f5586 ]
When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, in some situations, as they end up getting additional steal
time that did not actually happen.
This is because there is a window between reading the elapsed time in
update_rq_clock() and sampling the steal time in update_rq_clock_task().
If the VCPU gets preempted between those two points, any additional
steal time is accounted to the outgoing task even though the calculated
delta did not actually contain any of that "stolen" time.
When this race happens, we can end up with steal time that exceeds the
calculated delta, and the previous code would try to catch up that excess
steal time in future clock updates, which is given to the next,
incoming task, even though it did not actually have any time stolen.
This behavior is particularly bad when steal time can be very long,
which we've seen when trying to extend steal time to contain the duration
that the host was suspended [0]. When this happens, clock_task stays
frozen, during which the running task stays running for the whole
duration, since its run time doesn't increase.
However the race can happen even under normal operation.
Ideally we would read the elapsed cpu time and the steal time atomically,
to prevent this race from happening in the first place, but doing so
is non-trivial.
Since the time between those two points isn't otherwise accounted anywhere,
neither to the outgoing task nor the incoming task (because the "end of
outgoing task" and "start of incoming task" timestamps are the same),
I would argue that the right thing to do is to simply drop any excess steal
time, in order to prevent these issues.
[0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
Signed-off-by: Suleiman Souhlal <suleiman(a)google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/core.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7cf45d506688c..42dad8c8d6f28 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -279,13 +279,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
- steal = paravirt_steal_clock(cpu_of(rq));
+ u64 prev_steal;
+
+ steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
steal -= rq->prev_steal_time_rq;
if (unlikely(steal > delta))
steal = delta;
- rq->prev_steal_time_rq += steal;
+ rq->prev_steal_time_rq = prev_steal;
delta -= steal;
}
#endif
--
2.39.5
From: Suleiman Souhlal <suleiman(a)google.com>
[ Upstream commit 108ad0999085df2366dd9ef437573955cb3f5586 ]
When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, in some situations, as they end up getting additional steal
time that did not actually happen.
This is because there is a window between reading the elapsed time in
update_rq_clock() and sampling the steal time in update_rq_clock_task().
If the VCPU gets preempted between those two points, any additional
steal time is accounted to the outgoing task even though the calculated
delta did not actually contain any of that "stolen" time.
When this race happens, we can end up with steal time that exceeds the
calculated delta, and the previous code would try to catch up that excess
steal time in future clock updates, which is given to the next,
incoming task, even though it did not actually have any time stolen.
This behavior is particularly bad when steal time can be very long,
which we've seen when trying to extend steal time to contain the duration
that the host was suspended [0]. When this happens, clock_task stays
frozen, during which the running task stays running for the whole
duration, since its run time doesn't increase.
However the race can happen even under normal operation.
Ideally we would read the elapsed cpu time and the steal time atomically,
to prevent this race from happening in the first place, but doing so
is non-trivial.
Since the time between those two points isn't otherwise accounted anywhere,
neither to the outgoing task nor the incoming task (because the "end of
outgoing task" and "start of incoming task" timestamps are the same),
I would argue that the right thing to do is to simply drop any excess steal
time, in order to prevent these issues.
[0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
Signed-off-by: Suleiman Souhlal <suleiman(a)google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/core.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ed92b75f7e024..691af58c156e1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -642,13 +642,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
- steal = paravirt_steal_clock(cpu_of(rq));
+ u64 prev_steal;
+
+ steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
steal -= rq->prev_steal_time_rq;
if (unlikely(steal > delta))
steal = delta;
- rq->prev_steal_time_rq += steal;
+ rq->prev_steal_time_rq = prev_steal;
delta -= steal;
}
#endif
--
2.39.5
From: Suleiman Souhlal <suleiman(a)google.com>
[ Upstream commit 108ad0999085df2366dd9ef437573955cb3f5586 ]
When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, in some situations, as they end up getting additional steal
time that did not actually happen.
This is because there is a window between reading the elapsed time in
update_rq_clock() and sampling the steal time in update_rq_clock_task().
If the VCPU gets preempted between those two points, any additional
steal time is accounted to the outgoing task even though the calculated
delta did not actually contain any of that "stolen" time.
When this race happens, we can end up with steal time that exceeds the
calculated delta, and the previous code would try to catch up that excess
steal time in future clock updates, which is given to the next,
incoming task, even though it did not actually have any time stolen.
This behavior is particularly bad when steal time can be very long,
which we've seen when trying to extend steal time to contain the duration
that the host was suspended [0]. When this happens, clock_task stays
frozen, during which the running task stays running for the whole
duration, since its run time doesn't increase.
However the race can happen even under normal operation.
Ideally we would read the elapsed cpu time and the steal time atomically,
to prevent this race from happening in the first place, but doing so
is non-trivial.
Since the time between those two points isn't otherwise accounted anywhere,
neither to the outgoing task nor the incoming task (because the "end of
outgoing task" and "start of incoming task" timestamps are the same),
I would argue that the right thing to do is to simply drop any excess steal
time, in order to prevent these issues.
[0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
Signed-off-by: Suleiman Souhlal <suleiman(a)google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/core.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 54af671e8d510..3927904984fd5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -704,13 +704,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
- steal = paravirt_steal_clock(cpu_of(rq));
+ u64 prev_steal;
+
+ steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
steal -= rq->prev_steal_time_rq;
if (unlikely(steal > delta))
steal = delta;
- rq->prev_steal_time_rq += steal;
+ rq->prev_steal_time_rq = prev_steal;
delta -= steal;
}
#endif
--
2.39.5
From: Suleiman Souhlal <suleiman(a)google.com>
[ Upstream commit 108ad0999085df2366dd9ef437573955cb3f5586 ]
When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, in some situations, as they end up getting additional steal
time that did not actually happen.
This is because there is a window between reading the elapsed time in
update_rq_clock() and sampling the steal time in update_rq_clock_task().
If the VCPU gets preempted between those two points, any additional
steal time is accounted to the outgoing task even though the calculated
delta did not actually contain any of that "stolen" time.
When this race happens, we can end up with steal time that exceeds the
calculated delta, and the previous code would try to catch up that excess
steal time in future clock updates, which is given to the next,
incoming task, even though it did not actually have any time stolen.
This behavior is particularly bad when steal time can be very long,
which we've seen when trying to extend steal time to contain the duration
that the host was suspended [0]. When this happens, clock_task stays
frozen, during which the running task stays running for the whole
duration, since its run time doesn't increase.
However the race can happen even under normal operation.
Ideally we would read the elapsed cpu time and the steal time atomically,
to prevent this race from happening in the first place, but doing so
is non-trivial.
Since the time between those two points isn't otherwise accounted anywhere,
neither to the outgoing task nor the incoming task (because the "end of
outgoing task" and "start of incoming task" timestamps are the same),
I would argue that the right thing to do is to simply drop any excess steal
time, in order to prevent these issues.
[0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
Signed-off-by: Suleiman Souhlal <suleiman(a)google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/core.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 86606fb9e6bc6..c686d826a91cf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -726,13 +726,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
- steal = paravirt_steal_clock(cpu_of(rq));
+ u64 prev_steal;
+
+ steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
steal -= rq->prev_steal_time_rq;
if (unlikely(steal > delta))
steal = delta;
- rq->prev_steal_time_rq += steal;
+ rq->prev_steal_time_rq = prev_steal;
delta -= steal;
}
#endif
--
2.39.5
From: Suleiman Souhlal <suleiman(a)google.com>
[ Upstream commit 108ad0999085df2366dd9ef437573955cb3f5586 ]
When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, in some situations, as they end up getting additional steal
time that did not actually happen.
This is because there is a window between reading the elapsed time in
update_rq_clock() and sampling the steal time in update_rq_clock_task().
If the VCPU gets preempted between those two points, any additional
steal time is accounted to the outgoing task even though the calculated
delta did not actually contain any of that "stolen" time.
When this race happens, we can end up with steal time that exceeds the
calculated delta, and the previous code would try to catch up that excess
steal time in future clock updates, which is given to the next,
incoming task, even though it did not actually have any time stolen.
This behavior is particularly bad when steal time can be very long,
which we've seen when trying to extend steal time to contain the duration
that the host was suspended [0]. When this happens, clock_task stays
frozen, during which the running task stays running for the whole
duration, since its run time doesn't increase.
However the race can happen even under normal operation.
Ideally we would read the elapsed cpu time and the steal time atomically,
to prevent this race from happening in the first place, but doing so
is non-trivial.
Since the time between those two points isn't otherwise accounted anywhere,
neither to the outgoing task nor the incoming task (because the "end of
outgoing task" and "start of incoming task" timestamps are the same),
I would argue that the right thing to do is to simply drop any excess steal
time, in order to prevent these issues.
[0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
Signed-off-by: Suleiman Souhlal <suleiman(a)google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/core.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d07dc87787dff..9015c792830c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
- steal = paravirt_steal_clock(cpu_of(rq));
+ u64 prev_steal;
+
+ steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
steal -= rq->prev_steal_time_rq;
if (unlikely(steal > delta))
steal = delta;
- rq->prev_steal_time_rq += steal;
+ rq->prev_steal_time_rq = prev_steal;
delta -= steal;
}
#endif
--
2.39.5
From: Suleiman Souhlal <suleiman(a)google.com>
[ Upstream commit 108ad0999085df2366dd9ef437573955cb3f5586 ]
When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, in some situations, as they end up getting additional steal
time that did not actually happen.
This is because there is a window between reading the elapsed time in
update_rq_clock() and sampling the steal time in update_rq_clock_task().
If the VCPU gets preempted between those two points, any additional
steal time is accounted to the outgoing task even though the calculated
delta did not actually contain any of that "stolen" time.
When this race happens, we can end up with steal time that exceeds the
calculated delta, and the previous code would try to catch up that excess
steal time in future clock updates, which is given to the next,
incoming task, even though it did not actually have any time stolen.
This behavior is particularly bad when steal time can be very long,
which we've seen when trying to extend steal time to contain the duration
that the host was suspended [0]. When this happens, clock_task stays
frozen, during which the running task stays running for the whole
duration, since its run time doesn't increase.
However the race can happen even under normal operation.
Ideally we would read the elapsed cpu time and the steal time atomically,
to prevent this race from happening in the first place, but doing so
is non-trivial.
Since the time between those two points isn't otherwise accounted anywhere,
neither to the outgoing task nor the incoming task (because the "end of
outgoing task" and "start of incoming task" timestamps are the same),
I would argue that the right thing to do is to simply drop any excess steal
time, in order to prevent these issues.
[0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
Signed-off-by: Suleiman Souhlal <suleiman(a)google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/core.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e5a6bf587f91..296e77380318e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
- steal = paravirt_steal_clock(cpu_of(rq));
+ u64 prev_steal;
+
+ steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
steal -= rq->prev_steal_time_rq;
if (unlikely(steal > delta))
steal = delta;
- rq->prev_steal_time_rq += steal;
+ rq->prev_steal_time_rq = prev_steal;
delta -= steal;
}
#endif
--
2.39.5
From: Josef Bacik <josef(a)toxicpanda.com>
[ Upstream commit 6a4730b325aaa48f7a5d5ba97aff0a955e2d9cec ]
This BUG_ON is meant to catch backref cache problems, but these can
arise from either bugs in the backref cache or corruption in the extent
tree. Fix it to be a proper error.
Reviewed-by: Boris Burkov <boris(a)bur.io>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/btrfs/relocation.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 89ad7e12c08bb..062154c6a65f6 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4789,8 +4789,18 @@ int btrfs_reloc_cow_block(struct btrfs_trans_handle *trans,
WARN_ON(!first_cow && level == 0);
node = rc->backref_cache.path[level];
- BUG_ON(node->bytenr != buf->start &&
- node->new_bytenr != buf->start);
+
+ /*
+ * If node->bytenr != buf->start and node->new_bytenr !=
+ * buf->start then we've got the wrong backref node for what we
+ * expected to see here and the cache is incorrect.
+ */
+ if (unlikely(node->bytenr != buf->start && node->new_bytenr != buf->start)) {
+ btrfs_err(fs_info,
+"bytenr %llu was found but our backref cache was expecting %llu or %llu",
+ buf->start, node->bytenr, node->new_bytenr);
+ return -EUCLEAN;
+ }
drop_node_buffer(node);
extent_buffer_get(cow);
--
2.39.5
From: Josef Bacik <josef(a)toxicpanda.com>
[ Upstream commit 6a4730b325aaa48f7a5d5ba97aff0a955e2d9cec ]
This BUG_ON is meant to catch backref cache problems, but these can
arise from either bugs in the backref cache or corruption in the extent
tree. Fix it to be a proper error.
Reviewed-by: Boris Burkov <boris(a)bur.io>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/btrfs/relocation.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 98e3b3749ec12..5b921e6ed94e2 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3976,8 +3976,18 @@ int btrfs_reloc_cow_block(struct btrfs_trans_handle *trans,
WARN_ON(!first_cow && level == 0);
node = rc->backref_cache.path[level];
- BUG_ON(node->bytenr != buf->start &&
- node->new_bytenr != buf->start);
+
+ /*
+ * If node->bytenr != buf->start and node->new_bytenr !=
+ * buf->start then we've got the wrong backref node for what we
+ * expected to see here and the cache is incorrect.
+ */
+ if (unlikely(node->bytenr != buf->start && node->new_bytenr != buf->start)) {
+ btrfs_err(fs_info,
+"bytenr %llu was found but our backref cache was expecting %llu or %llu",
+ buf->start, node->bytenr, node->new_bytenr);
+ return -EUCLEAN;
+ }
btrfs_backref_drop_node_buffer(node);
atomic_inc(&cow->refs);
--
2.39.5
From: Hao-ran Zheng <zhenghaoran154(a)gmail.com>
[ Upstream commit 5324c4e10e9c2ce307a037e904c0d9671d7137d9 ]
A data race occurs when the function `insert_ordered_extent_file_extent()`
and the function `btrfs_inode_safe_disk_i_size_write()` are executed
concurrently. The function `insert_ordered_extent_file_extent()` is not
locked when reading inode->disk_i_size, causing
`btrfs_inode_safe_disk_i_size_write()` to cause data competition when
writing inode->disk_i_size, thus affecting the value of `modify_tree`.
The specific call stack that appears during testing is as follows:
============DATA_RACE============
btrfs_drop_extents+0x89a/0xa060 [btrfs]
insert_reserved_file_extent+0xb54/0x2960 [btrfs]
insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs]
btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs]
btrfs_finish_ordered_io+0x37/0x60 [btrfs]
finish_ordered_fn+0x3e/0x50 [btrfs]
btrfs_work_helper+0x9c9/0x27a0 [btrfs]
process_scheduled_works+0x716/0xf10
worker_thread+0xb6a/0x1190
kthread+0x292/0x330
ret_from_fork+0x4d/0x80
ret_from_fork_asm+0x1a/0x30
============OTHER_INFO============
btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs]
btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs]
btrfs_finish_ordered_io+0x37/0x60 [btrfs]
finish_ordered_fn+0x3e/0x50 [btrfs]
btrfs_work_helper+0x9c9/0x27a0 [btrfs]
process_scheduled_works+0x716/0xf10
worker_thread+0xb6a/0x1190
kthread+0x292/0x330
ret_from_fork+0x4d/0x80
ret_from_fork_asm+0x1a/0x30
=================================
The main purpose of the check of the inode's disk_i_size is to avoid
taking write locks on a btree path when we have a write at or beyond
EOF, since in these cases we don't expect to find extent items in the
root to drop. However if we end up taking write locks due to a data
race on disk_i_size, everything is still correct, we only add extra
lock contention on the tree in case there's concurrency from other tasks.
If the race causes us to not take write locks when we actually need them,
then everything is functionally correct as well, since if we find out we
have extent items to drop and we took read locks (modify_tree set to 0),
we release the path and retry again with write locks.
Since this data race does not affect the correctness of the function,
it is a harmless data race, use data_race() to check inode->disk_i_size.
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Hao-ran Zheng <zhenghaoran154(a)gmail.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/btrfs/file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 44160d4ad53e0..31b25cb2f5cc3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -731,7 +731,7 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans,
if (args->drop_cache)
btrfs_drop_extent_cache(inode, args->start, args->end - 1, 0);
- if (args->start >= inode->disk_i_size && !args->replace_extent)
+ if (data_race(args->start >= inode->disk_i_size) && !args->replace_extent)
modify_tree = 0;
update_refs = (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID);
--
2.39.5
The quilt patch titled
Subject: mm/compaction: fix UBSAN shift-out-of-bounds warning
has been removed from the -mm tree. Its filename was
mm-compaction-fix-ubsan-shift-out-of-bounds-warning.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Liu Shixin <liushixin2(a)huawei.com>
Subject: mm/compaction: fix UBSAN shift-out-of-bounds warning
Date: Thu, 23 Jan 2025 10:10:29 +0800
syzkaller reported a UBSAN shift-out-of-bounds warning of (1UL << order)
in isolate_freepages_block(). The bogus compound_order can be any value
because it is union with flags. Add back the MAX_PAGE_ORDER check to fix
the warning.
Link: https://lkml.kernel.org/r/20250123021029.2826736-1-liushixin2@huawei.com
Fixes: 3da0272a4c7d ("mm/compaction: correctly return failure with bogus compound_order in strict mode")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Reviewed-by: Kemeng Shi <shikemeng(a)huaweicloud.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Kemeng Shi <shikemeng(a)huaweicloud.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Nanyong Sun <sunnanyong(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/compaction.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/compaction.c~mm-compaction-fix-ubsan-shift-out-of-bounds-warning
+++ a/mm/compaction.c
@@ -631,7 +631,8 @@ static unsigned long isolate_freepages_b
if (PageCompound(page)) {
const unsigned int order = compound_order(page);
- if (blockpfn + (1UL << order) <= end_pfn) {
+ if ((order <= MAX_PAGE_ORDER) &&
+ (blockpfn + (1UL << order) <= end_pfn)) {
blockpfn += (1UL << order) - 1;
page += (1UL << order) - 1;
nr_scanned += (1UL << order) - 1;
_
Patches currently in -mm which might be from liushixin2(a)huawei.com are
mm-page_isolation-avoid-call-folio_hstate-without-hugetlb_lock.patch
Also add a kunit testcase to make sure the function works correctly now
and doesn't regress in the future.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Thomas Weißschuh (2):
of: address: Fix empty resource handling in __of_address_resource_bounds()
of: address: Add kunit test for __of_address_resource_bounds()
drivers/of/address.c | 17 +++----
drivers/of/of_private.h | 4 ++
drivers/of/of_test.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 132 insertions(+), 9 deletions(-)
---
base-commit: ffd294d346d185b70e28b1a28abe367bbfe53c04
change-id: 20250120-of-address-overflow-a59476362885
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
From: Gabriele Monaco <gmonaco(a)redhat.com>
RV per-task monitors are implemented through a monitor structure
available for each task_struct. This structure is reset every time the
monitor is (re-)started, to avoid inconsistencies if the monitor was
activated previously.
To do so, we reset the monitor on all threads using the macro
for_each_process_thread. However, this macro excludes the idle tasks on
each CPU. Idle tasks could be considered tasks on their own right and it
should be up to the model whether to ignore them or not.
Reset monitors also on the idle tasks for each present CPU whenever we
reset all per-task monitors.
Cc: stable(a)vger.kernel.org
Cc: Juri Lelli <juri.lelli(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: John Kacur <jkacur(a)redhat.com>
Link: https://lore.kernel.org/20250115151547.605750-2-gmonaco@redhat.com
Fixes: 792575348ff7 ("rv/include: Add deterministic automata monitor definition via C macros")
Signed-off-by: Gabriele Monaco <gmonaco(a)redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
include/rv/da_monitor.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 9705b2a98e49..510c88bfabd4 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -14,6 +14,7 @@
#include <rv/automata.h>
#include <linux/rv.h>
#include <linux/bug.h>
+#include <linux/sched.h>
#ifdef CONFIG_RV_REACTORS
@@ -324,10 +325,13 @@ static inline struct da_monitor *da_get_monitor_##name(struct task_struct *tsk)
static void da_monitor_reset_all_##name(void) \
{ \
struct task_struct *g, *p; \
+ int cpu; \
\
read_lock(&tasklist_lock); \
for_each_process_thread(g, p) \
da_monitor_reset_##name(da_get_monitor_##name(p)); \
+ for_each_present_cpu(cpu) \
+ da_monitor_reset_##name(da_get_monitor_##name(idle_task(cpu))); \
read_unlock(&tasklist_lock); \
} \
\
--
2.45.2
From: Tomas Glozar <tglozar(a)redhat.com>
When using rtla timerlat with userspace threads (-u or -U), rtla
disables the OSNOISE_WORKLOAD option in
/sys/kernel/tracing/osnoise/options. This option is not re-enabled in a
subsequent run with kernel-space threads, leading to rtla collecting no
results if the previous run exited abnormally:
$ rtla timerlat top -u
^\Quit (core dumped)
$ rtla timerlat top -k -d 1s
Timer Latency
0 00:00:01 | IRQ Timer Latency (us) | Thread Timer Latency (us)
CPU COUNT | cur min avg max | cur min avg max
The issue persists until OSNOISE_WORKLOAD is set manually by running:
$ echo OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
Set OSNOISE_WORKLOAD when running rtla with kernel-space threads if
available to fix the issue.
Cc: stable(a)vger.kernel.org
Cc: John Kacur <jkacur(a)redhat.com>
Cc: Luis Goncalves <lgoncalv(a)redhat.com>
Link: https://lore.kernel.org/20250107144823.239782-4-tglozar@redhat.com
Fixes: cdca4f4e5e8e ("rtla/timerlat_top: Add timerlat user-space support")
Signed-off-by: Tomas Glozar <tglozar(a)redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
tools/tracing/rtla/src/timerlat_top.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/tools/tracing/rtla/src/timerlat_top.c b/tools/tracing/rtla/src/timerlat_top.c
index d358cd39f360..f387597d3ac2 100644
--- a/tools/tracing/rtla/src/timerlat_top.c
+++ b/tools/tracing/rtla/src/timerlat_top.c
@@ -851,12 +851,15 @@ timerlat_top_apply_config(struct osnoise_tool *top, struct timerlat_top_params *
}
}
- if (params->user_top) {
- retval = osnoise_set_workload(top->context, 0);
- if (retval) {
- err_msg("Failed to set OSNOISE_WORKLOAD option\n");
- goto out_err;
- }
+ /*
+ * Set workload according to type of thread if the kernel supports it.
+ * On kernels without support, user threads will have already failed
+ * on missing timerlat_fd, and kernel threads do not need it.
+ */
+ retval = osnoise_set_workload(top->context, params->kernel_workload);
+ if (retval < -1) {
+ err_msg("Failed to set OSNOISE_WORKLOAD option\n");
+ goto out_err;
}
if (isatty(STDOUT_FILENO) && !params->quiet)
--
2.45.2
From: Tomas Glozar <tglozar(a)redhat.com>
When using rtla timerlat with userspace threads (-u or -U), rtla
disables the OSNOISE_WORKLOAD option in
/sys/kernel/tracing/osnoise/options. This option is not re-enabled in a
subsequent run with kernel-space threads, leading to rtla collecting no
results if the previous run exited abnormally:
$ rtla timerlat hist -u
^\Quit (core dumped)
$ rtla timerlat hist -k -d 1s
Index
over:
count:
min:
avg:
max:
ALL: IRQ Thr Usr
count: 0 0 0
min: - - -
avg: - - -
max: - - -
The issue persists until OSNOISE_WORKLOAD is set manually by running:
$ echo OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
Set OSNOISE_WORKLOAD when running rtla with kernel-space threads if
available to fix the issue.
Cc: stable(a)vger.kernel.org
Cc: John Kacur <jkacur(a)redhat.com>
Cc: Luis Goncalves <lgoncalv(a)redhat.com>
Link: https://lore.kernel.org/20250107144823.239782-3-tglozar@redhat.com
Fixes: ed774f7481fa ("rtla/timerlat_hist: Add timerlat user-space support")
Signed-off-by: Tomas Glozar <tglozar(a)redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
tools/tracing/rtla/src/timerlat_hist.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/tools/tracing/rtla/src/timerlat_hist.c b/tools/tracing/rtla/src/timerlat_hist.c
index 53ded187b33a..d4bd02c01c53 100644
--- a/tools/tracing/rtla/src/timerlat_hist.c
+++ b/tools/tracing/rtla/src/timerlat_hist.c
@@ -1085,12 +1085,15 @@ timerlat_hist_apply_config(struct osnoise_tool *tool, struct timerlat_hist_param
}
}
- if (params->user_hist) {
- retval = osnoise_set_workload(tool->context, 0);
- if (retval) {
- err_msg("Failed to set OSNOISE_WORKLOAD option\n");
- goto out_err;
- }
+ /*
+ * Set workload according to type of thread if the kernel supports it.
+ * On kernels without support, user threads will have already failed
+ * on missing timerlat_fd, and kernel threads do not need it.
+ */
+ retval = osnoise_set_workload(tool->context, params->kernel_workload);
+ if (retval < -1) {
+ err_msg("Failed to set OSNOISE_WORKLOAD option\n");
+ goto out_err;
}
return 0;
--
2.45.2
From: Tomas Glozar <tglozar(a)redhat.com>
Currently, when either SIGINT from the user or SIGALRM from the duration
timer is caught by rtla-timerlat, stop_tracing is set to break out of
the main loop. This is not sufficient for cases where the timerlat
tracer is producing more data than rtla can consume, since in that case,
rtla is looping indefinitely inside tracefs_iterate_raw_events, never
reaches the check of stop_tracing and hangs.
In addition to setting stop_tracing, also stop the timerlat tracer on
received signal (SIGINT or SIGALRM). This will stop new samples so that
the existing samples may be processed and tracefs_iterate_raw_events
eventually exits.
Cc: stable(a)vger.kernel.org
Cc: John Kacur <jkacur(a)redhat.com>
Cc: Luis Goncalves <lgoncalv(a)redhat.com>
Cc: Gabriele Monaco <gmonaco(a)redhat.com>
Link: https://lore.kernel.org/20250116144931.649593-4-tglozar@redhat.com
Fixes: a828cd18bc4a ("rtla: Add timerlat tool and timelart top mode")
Signed-off-by: Tomas Glozar <tglozar(a)redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
tools/tracing/rtla/src/timerlat_top.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/tools/tracing/rtla/src/timerlat_top.c b/tools/tracing/rtla/src/timerlat_top.c
index 059b468981e4..d21a21053917 100644
--- a/tools/tracing/rtla/src/timerlat_top.c
+++ b/tools/tracing/rtla/src/timerlat_top.c
@@ -900,9 +900,12 @@ static struct osnoise_tool
}
static int stop_tracing;
+static struct trace_instance *top_inst = NULL;
static void stop_top(int sig)
{
stop_tracing = 1;
+ if (top_inst)
+ trace_instance_stop(top_inst);
}
/*
@@ -950,6 +953,13 @@ int timerlat_top_main(int argc, char *argv[])
}
trace = &top->trace;
+ /*
+ * Save trace instance into global variable so that SIGINT can stop
+ * the timerlat tracer.
+ * Otherwise, rtla could loop indefinitely when overloaded.
+ */
+ top_inst = trace;
+
retval = enable_timerlat(trace);
if (retval) {
@@ -1131,7 +1141,7 @@ int timerlat_top_main(int argc, char *argv[])
return_value = 0;
- if (trace_is_off(&top->trace, &record->trace)) {
+ if (trace_is_off(&top->trace, &record->trace) && !stop_tracing) {
printf("rtla timerlat hit stop tracing\n");
if (!params->no_aa)
--
2.45.2
From: Tomas Glozar <tglozar(a)redhat.com>
Currently, when either SIGINT from the user or SIGALRM from the duration
timer is caught by rtla-timerlat, stop_tracing is set to break out of
the main loop. This is not sufficient for cases where the timerlat
tracer is producing more data than rtla can consume, since in that case,
rtla is looping indefinitely inside tracefs_iterate_raw_events, never
reaches the check of stop_tracing and hangs.
In addition to setting stop_tracing, also stop the timerlat tracer on
received signal (SIGINT or SIGALRM). This will stop new samples so that
the existing samples may be processed and tracefs_iterate_raw_events
eventually exits.
Cc: stable(a)vger.kernel.org
Cc: John Kacur <jkacur(a)redhat.com>
Cc: Luis Goncalves <lgoncalv(a)redhat.com>
Cc: Gabriele Monaco <gmonaco(a)redhat.com>
Link: https://lore.kernel.org/20250116144931.649593-3-tglozar@redhat.com
Fixes: 1eeb6328e8b3 ("rtla/timerlat: Add timerlat hist mode")
Signed-off-by: Tomas Glozar <tglozar(a)redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
tools/tracing/rtla/src/timerlat_hist.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/tools/tracing/rtla/src/timerlat_hist.c b/tools/tracing/rtla/src/timerlat_hist.c
index 8b66387e5f35..f1edf1c8a7b0 100644
--- a/tools/tracing/rtla/src/timerlat_hist.c
+++ b/tools/tracing/rtla/src/timerlat_hist.c
@@ -1131,9 +1131,12 @@ static struct osnoise_tool
}
static int stop_tracing;
+static struct trace_instance *hist_inst = NULL;
static void stop_hist(int sig)
{
stop_tracing = 1;
+ if (hist_inst)
+ trace_instance_stop(hist_inst);
}
/*
@@ -1180,6 +1183,12 @@ int timerlat_hist_main(int argc, char *argv[])
}
trace = &tool->trace;
+ /*
+ * Save trace instance into global variable so that SIGINT can stop
+ * the timerlat tracer.
+ * Otherwise, rtla could loop indefinitely when overloaded.
+ */
+ hist_inst = trace;
retval = enable_timerlat(trace);
if (retval) {
@@ -1348,7 +1357,7 @@ int timerlat_hist_main(int argc, char *argv[])
return_value = 0;
- if (trace_is_off(&tool->trace, &record->trace)) {
+ if (trace_is_off(&tool->trace, &record->trace) && !stop_tracing) {
printf("rtla timerlat hit stop tracing\n");
if (!params->no_aa)
--
2.45.2
In the US country code, to avoid including 6 GHz rules in the 5 GHz rules
list, the number of 5 GHz rules is set to a default constant value of 4
(REG_US_5G_NUM_REG_RULES). However, if there are more than 4 valid 5 GHz
rules, the current logic will bypass the legitimate 6 GHz rules.
For example, if there are 5 valid 5 GHz rules and 1 valid 6 GHz rule, the
current logic will only consider 4 of the 5 GHz rules, treating the last
valid rule as a 6 GHz rule. Consequently, the actual 6 GHz rule is never
processed, leading to the eventual disabling of 6 GHz channels.
To fix this issue, instead of hardcoding the value to 4, use a helper
function to determine the number of 6 GHz rules present in the 5 GHz rules
list and ignore only those rules.
Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.3.1-00173-QCAHKSWPL_SILICONZ-1
Cc: stable(a)vger.kernel.org
Fixes: d889913205cf ("wifi: ath12k: driver for Qualcomm Wi-Fi 7 devices")
Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh(a)oss.qualcomm.com>
---
drivers/net/wireless/ath/ath12k/wmi.c | 61 ++++++++++++++++++++++++++---------
drivers/net/wireless/ath/ath12k/wmi.h | 1 -
2 files changed, 45 insertions(+), 17 deletions(-)
diff --git a/drivers/net/wireless/ath/ath12k/wmi.c b/drivers/net/wireless/ath/ath12k/wmi.c
index a7625a3d1a3a9b1bbcf90cee30d3c707de03c439..296238c7ee0fd95045a3e0859d730e3bd73ec906 100644
--- a/drivers/net/wireless/ath/ath12k/wmi.c
+++ b/drivers/net/wireless/ath/ath12k/wmi.c
@@ -4852,6 +4852,22 @@ static struct ath12k_reg_rule
return reg_rule_ptr;
}
+static u8 ath12k_wmi_ignore_num_extra_rules(struct ath12k_wmi_reg_rule_ext_params *rule,
+ u32 num_reg_rules)
+{
+ u8 num_invalid_5ghz_rules = 0;
+ u32 count, start_freq;
+
+ for (count = 0; count < num_reg_rules; count++) {
+ start_freq = le32_get_bits(rule[count].freq_info, REG_RULE_START_FREQ);
+
+ if (start_freq >= ATH12K_MIN_6G_FREQ)
+ num_invalid_5ghz_rules++;
+ }
+
+ return num_invalid_5ghz_rules;
+}
+
static int ath12k_pull_reg_chan_list_ext_update_ev(struct ath12k_base *ab,
struct sk_buff *skb,
struct ath12k_reg_info *reg_info)
@@ -4862,6 +4878,7 @@ static int ath12k_pull_reg_chan_list_ext_update_ev(struct ath12k_base *ab,
u32 num_2g_reg_rules, num_5g_reg_rules;
u32 num_6g_reg_rules_ap[WMI_REG_CURRENT_MAX_AP_TYPE];
u32 num_6g_reg_rules_cl[WMI_REG_CURRENT_MAX_AP_TYPE][WMI_REG_MAX_CLIENT_TYPE];
+ u8 num_invalid_5ghz_ext_rules;
u32 total_reg_rules = 0;
int ret, i, j;
@@ -4955,20 +4972,6 @@ static int ath12k_pull_reg_chan_list_ext_update_ev(struct ath12k_base *ab,
memcpy(reg_info->alpha2, &ev->alpha2, REG_ALPHA2_LEN);
- /* FIXME: Currently FW includes 6G reg rule also in 5G rule
- * list for country US.
- * Having same 6G reg rule in 5G and 6G rules list causes
- * intersect check to be true, and same rules will be shown
- * multiple times in iw cmd. So added hack below to avoid
- * parsing 6G rule from 5G reg rule list, and this can be
- * removed later, after FW updates to remove 6G reg rule
- * from 5G rules list.
- */
- if (memcmp(reg_info->alpha2, "US", 2) == 0) {
- reg_info->num_5g_reg_rules = REG_US_5G_NUM_REG_RULES;
- num_5g_reg_rules = reg_info->num_5g_reg_rules;
- }
-
reg_info->dfs_region = le32_to_cpu(ev->dfs_region);
reg_info->phybitmap = le32_to_cpu(ev->phybitmap);
reg_info->num_phy = le32_to_cpu(ev->num_phy);
@@ -5071,8 +5074,29 @@ static int ath12k_pull_reg_chan_list_ext_update_ev(struct ath12k_base *ab,
}
}
+ ext_wmi_reg_rule += num_2g_reg_rules;
+
+ /* Firmware might include 6 GHz reg rule in 5 GHz rule list
+ * for few countries along with separate 6 GHz rule.
+ * Having same 6 GHz reg rule in 5 GHz and 6 GHz rules list
+ * causes intersect check to be true, and same rules will be
+ * shown multiple times in iw cmd.
+ * Hence, avoid parsing 6 GHz rule from 5 GHz reg rule list
+ */
+ num_invalid_5ghz_ext_rules = ath12k_wmi_ignore_num_extra_rules(ext_wmi_reg_rule,
+ num_5g_reg_rules);
+
+ if (num_invalid_5ghz_ext_rules) {
+ ath12k_dbg(ab, ATH12K_DBG_WMI,
+ "CC: %s 5 GHz reg rules number %d from fw, %d number of invalid 5 GHz rules",
+ reg_info->alpha2, reg_info->num_5g_reg_rules,
+ num_invalid_5ghz_ext_rules);
+
+ num_5g_reg_rules = num_5g_reg_rules - num_invalid_5ghz_ext_rules;
+ reg_info->num_5g_reg_rules = num_5g_reg_rules;
+ }
+
if (num_5g_reg_rules) {
- ext_wmi_reg_rule += num_2g_reg_rules;
reg_info->reg_rules_5g_ptr =
create_ext_reg_rules_from_wmi(num_5g_reg_rules,
ext_wmi_reg_rule);
@@ -5084,7 +5108,12 @@ static int ath12k_pull_reg_chan_list_ext_update_ev(struct ath12k_base *ab,
}
}
- ext_wmi_reg_rule += num_5g_reg_rules;
+ /* We have adjusted the number of 5 GHz reg rules above. But still those
+ * many rules needs to be adjusted in ext_wmi_reg_rule.
+ *
+ * NOTE: num_invalid_5ghz_ext_rules will be 0 for rest other cases.
+ */
+ ext_wmi_reg_rule += (num_5g_reg_rules + num_invalid_5ghz_ext_rules);
for (i = 0; i < WMI_REG_CURRENT_MAX_AP_TYPE; i++) {
reg_info->reg_rules_6g_ap_ptr[i] =
diff --git a/drivers/net/wireless/ath/ath12k/wmi.h b/drivers/net/wireless/ath/ath12k/wmi.h
index 38aed18b99f4239e40d369181549b4bc78faa862..81248253a36872f6fd93db772b5394f3ef9d3bf9 100644
--- a/drivers/net/wireless/ath/ath12k/wmi.h
+++ b/drivers/net/wireless/ath/ath12k/wmi.h
@@ -4093,7 +4093,6 @@ struct ath12k_wmi_eht_rate_set_params {
#define MAX_REG_RULES 10
#define REG_ALPHA2_LEN 2
#define MAX_6G_REG_RULES 5
-#define REG_US_5G_NUM_REG_RULES 4
enum wmi_start_event_param {
WMI_VDEV_START_RESP_EVENT = 0,
---
base-commit: c4e4e37afc8b8d178b0f00f607c30b3b81253597
change-id: 20250123-fix_6ghz_rules_handling-ff85f1efb30b
This change is specific to Hyper-V VM user.
If the Virtual Machine Connection window is focused,
a Hyper-V VM user can unintentionally touch the keyboard/mouse
when the VM is hibernating or resuming, and consequently the
hibernation or resume operation can be aborted unexpectedly.
Fix the issue by no longer registering the keyboard/mouse as
wakeup devices (see the other two patches for the
changes to drivers/input/serio/hyperv-keyboard.c and
drivers/hid/hid-hyperv.c).
The keyboard/mouse were registered as wakeup devices because the
VM needs to be woken up from the Suspend-to-Idle state after
a user runs "echo freeze > /sys/power/state". It seems like
the Suspend-to-Idle feature has no real users in practice, so
let's no longer support that by returning -EOPNOTSUPP if a
user tries to use that.
Fixes: 1a06d017fb3f ("Drivers: hv: vmbus: Fix Suspend-to-Idle for Generation-2 VM")
Cc: stable(a)vger.kernel.org
Signed-off-by: Saurabh Sengar <ssengar(a)linux.microsoft.com>
Signed-off-by: Erni Sri Satya Vennela <ernis(a)linux.microsoft.com>>
---
Changes in v4:
* No change
Changes in v3:
* Add "Cc: stable(a)vger.kernel.org" in sign-off area.
Changes in v2:
* Add "#define vmbus_freeze NULL" when CONFIG_PM_SLEEP is not
enabled.
---
drivers/hv/vmbus_drv.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 6d89d37b069a..4df6b12bf6a1 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -900,6 +900,19 @@ static void vmbus_shutdown(struct device *child_device)
}
#ifdef CONFIG_PM_SLEEP
+/*
+ * vmbus_freeze - Suspend-to-Idle
+ */
+static int vmbus_freeze(struct device *child_device)
+{
+/*
+ * Do not support Suspend-to-Idle ("echo freeze > /sys/power/state") as
+ * that would require registering the Hyper-V synthetic mouse/keyboard
+ * devices as wakeup devices, which can abort hibernation/resume unexpectedly.
+ */
+ return -EOPNOTSUPP;
+}
+
/*
* vmbus_suspend - Suspend a vmbus device
*/
@@ -938,6 +951,7 @@ static int vmbus_resume(struct device *child_device)
return drv->resume(dev);
}
#else
+#define vmbus_freeze NULL
#define vmbus_suspend NULL
#define vmbus_resume NULL
#endif /* CONFIG_PM_SLEEP */
@@ -969,7 +983,7 @@ static void vmbus_device_release(struct device *device)
*/
static const struct dev_pm_ops vmbus_pm = {
- .suspend_noirq = NULL,
+ .suspend_noirq = vmbus_freeze,
.resume_noirq = NULL,
.freeze_noirq = vmbus_suspend,
.thaw_noirq = vmbus_resume,
--
2.34.1
Hi,
This series has two main parts: 1) add error handling all around, and 2)
update the drivers according to latest (mostly non-public) information
from TI.
The "Update UB9702 init sequences" patch basically rewrites the init
sequence from scratch, and to make that patch easier to read, the
previous patch first removes the current init sequence. In the final
version these two patches need to be squashed together.
Tomi
Signed-off-by: Tomi Valkeinen <tomi.valkeinen(a)ideasonboard.com>
---
Changes in v2:
- Add new patch "media: i2c: ds90ub913: Fix returned fmt from .set_fmt()"
- Reformat the reg write in "Speed-up I2C watchdog timer"
- Add 'it' parameter to rxport for_each macros
- Move 'enable_sscg' module parameter to DT
- Change serializer 'i2c-addr' property to serializer 'reg' property
- Split ub953 registers to a header file
- Link to v1: https://lore.kernel.org/r/20250110-ub9xx-improvements-v1-0-e0b9a1f644da@ide…
---
Jai Luthra (6):
media: i2c: ds90ub953: Speed-up I2C watchdog timer
media: dt-bindings: ti,ds90ub960: Add ti,enable-sscg property
media: dt-bindings: ti,ds90ub960: Allow setting serializer address
media: i2c: ds90ub960: Enable SSCG for UB9702
media: i2c: ds90ub960: Configure serializer using back-channel
media: i2c: ds90ub9xx: Set serializer temperature ramp
Tomi Valkeinen (16):
media: i2c: ds90ub953: Fix error prints
media: i2c: ds90ub913: Fix returned fmt from .set_fmt()
media: i2c: ds90ub913: Align ub913_read() with other similar functions
media: i2c: ds90ub9xx: Add err parameter to read/write funcs
media: i2c: ds90ub960: Add error handling to multiple places
media: i2c: ds90ub953: Add error handling to ub953_log_status()
media: i2c: ds90ub913: Add error handling to ub913_log_status()
media: i2c: ds90ub960: Move UB9702 registers to a separate section
media: i2c: ds90ub960: Add UB9702 specific registers
media: i2c: ds90ub960: Split ub960_init_tx_ports()
media: i2c: ds90ub960: Refresh ub960_init_tx_ports_ub9702()
media: i2c: ds90ub960: Add RX port iteration support
media: i2c: ds90ub960: Move all RX port init code into ub960_init_rx_ports()
media: i2c: ds90ub960: Remove old ub9702 RX port init code (SQUASH)
media: i2c: ds90ub960: Update UB9702 init sequences
media: i2c: ds90ub953: Move reg defines to a header file
.../bindings/media/i2c/ti,ds90ub953.yaml | 77 +-
.../bindings/media/i2c/ti,ds90ub960.yaml | 21 +-
drivers/media/i2c/ds90ub913.c | 82 +-
drivers/media/i2c/ds90ub953.c | 242 +--
drivers/media/i2c/ds90ub953.h | 104 +
drivers/media/i2c/ds90ub960.c | 2264 +++++++++++++++-----
6 files changed, 2070 insertions(+), 720 deletions(-)
---
base-commit: c4b7779abc6633677e6edb79e2809f4f61fde157
change-id: 20250110-ub9xx-improvements-9172b44eb0bc
Best regards,
--
Tomi Valkeinen <tomi.valkeinen(a)ideasonboard.com>
When device_register(&child->dev) failed, calling put_device() to
explicitly release child->dev. Otherwise, it could cause double free
problem.
device_register() includes device_add(). As comment of device_add()
says, 'if device_add() succeeds, you should call device_del() when you
want to get rid of it. If device_add() has not succeeded, use only
put_device() to drop the reference count'.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 4f535093cf8f ("PCI: Put pci_dev in device tree as early as possible")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
Changes in v2:
- added the bug description about the comment of device_add();
- fixed the patch as suggestions;
- added Cc and Fixes table.
---
drivers/pci/probe.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 2e81ab0f5a25..ae12f92c6a9d 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1174,7 +1174,10 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent,
add_dev:
pci_set_bus_msi_domain(child);
ret = device_register(&child->dev);
- WARN_ON(ret < 0);
+ if (WARN_ON(ret < 0)) {
+ put_device(&child->dev);
+ return NULL;
+ }
pcibios_add_bus(child);
--
2.25.1
devm_blk_crypto_profile_init() registers a cleanup handler to run when
the associated (platform-) device is being released. For UFS, the
crypto private data and pointers are stored as part of the ufs_hba's
data structure 'struct ufs_hba::crypto_profile'. This structure is
allocated as part of the underlying ufshd allocation.
During driver release or during error handling in ufshcd_pltfrm_init(),
this structure is released as part of ufshcd_dealloc_host() before the
(platform-) device associated with the crypto call above is released.
Once this device is released, the crypto cleanup code will run, using
the just-released 'struct ufs_hba::crypto_profile'. This causes a
use-after-free situation:
exynos-ufshc 14700000.ufs: ufshcd_pltfrm_init() failed -11
exynos-ufshc 14700000.ufs: probe with driver exynos-ufshc failed with error -11
Unable to handle kernel paging request at virtual address 01adafad6dadad88
Mem abort info:
ESR = 0x0000000096000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x04: level 0 translation fault
Data abort info:
ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
CM = 0, WnR = 0, TnD = 0, TagAccess = 0
GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[01adafad6dadad88] address between user and kernel address ranges
Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G W 6.13.0-rc5-next-20250106+ #70
Tainted: [W]=WARN
Hardware name: Oriole (DT)
pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : kfree+0x60/0x2d8
lr : kvfree+0x44/0x60
sp : ffff80008009ba80
x29: ffff80008009ba90 x28: 0000000000000000 x27: ffffbcc6591e0130
x26: ffffbcc659309960 x25: ffffbcc658f89c50 x24: ffffbcc659539d80
x23: ffff22e000940040 x22: ffff22e001539010 x21: ffffbcc65714b22c
x20: 6b6b6b6b6b6b6b6b x19: 01adafad6dadad80 x18: 0000000000000000
x17: ffffbcc6579fbac8 x16: ffffbcc657a04300 x15: ffffbcc657a027f4
x14: ffffbcc656f969cc x13: ffffbcc6579fdc80 x12: ffffbcc6579fb194
x11: ffffbcc6579fbc34 x10: 0000000000000000 x9 : ffffbcc65714b22c
x8 : ffff80008009b880 x7 : 0000000000000000 x6 : ffff80008009b940
x5 : ffff80008009b8c0 x4 : ffff22e000940518 x3 : ffff22e006f54f40
x2 : ffffbcc657a02268 x1 : ffff80007fffffff x0 : ffffc1ffc0000000
Call trace:
kfree+0x60/0x2d8 (P)
kvfree+0x44/0x60
blk_crypto_profile_destroy_callback+0x28/0x70
devm_action_release+0x1c/0x30
release_nodes+0x6c/0x108
devres_release_all+0x98/0x100
device_unbind_cleanup+0x20/0x70
really_probe+0x218/0x2d0
In other words, the initialisation code flow is:
platform-device probe
ufshcd_pltfrm_init()
ufshcd_alloc_host()
scsi_host_alloc()
allocation of struct ufs_hba
creation of scsi-host devices
devm_blk_crypto_profile_init()
devm registration of cleanup handler using platform-device
and during error handling of ufshcd_pltfrm_init() or during driver
removal:
ufshcd_dealloc_host()
scsi_host_put()
put_device(scsi-host)
release of struct ufs_hba
put_device(platform-device)
crypto cleanup handler
To fix this use-after free, change ufshcd_alloc_host() to register a
devres action to automatically cleanup the underlying SCSI device on
ufshcd destruction, without requiring explicit calls to
ufshcd_dealloc_host(). This way:
* the crypto profile and all other ufs_hba-owned resources are
destroyed before SCSI (as they've been registered after)
* a memleak is plugged in tc-dwc-g210-pci.c remove() as a
side-effect
* EXPORT_SYMBOL_GPL(ufshcd_dealloc_host) can be removed fully as
it's not needed anymore
* no future drivers using ufshcd_alloc_host() could ever forget
adding the cleanup
Fixes: cb77cb5abe1f ("blk-crypto: rename blk_keyslot_manager to blk_crypto_profile")
Fixes: d76d9d7d1009 ("scsi: ufs: use devm_blk_ksm_init()")
Cc: stable(a)vger.kernel.org
Signed-off-by: André Draszik <andre.draszik(a)linaro.org>
---
Changes in v3:
- rename devres action handler to ufshcd_devres_release() (Bart)
- Link to v2: https://lore.kernel.org/r/20250114-ufshcd-fix-v2-1-2dc627590a4a@linaro.org
Changes in v2:
- completely new approach using devres action for Scsi_host cleanup, to
ensure ordering
- add Fixes: and CC: stable tags (Eric)
- Link to v1: https://lore.kernel.org/r/20250113-ufshcd-fix-v1-1-ca63d1d4bd55@linaro.org
---
In my case, as per above trace I initially encountered an error in
ufshcd_verify_dev_init(), which made me notice this problem both during
error handling and release. For reproducing, it'd be possible to change
that function to just return an error, or rmmod the platform glue
driver.
Other approaches for solving this issue I see are the following, but I
believe this one here is the cleanest:
* turn 'struct ufs_hba::crypto_profile' into a dynamically allocated
pointer, in which case it doesn't matter if cleanup runs after
scsi_host_put()
* add an explicit devm_blk_crypto_profile_deinit() to be called by API
users when necessary, e.g. before ufshcd_dealloc_host() in this case
* register the crypto cleanup handler against the scsi-host device
instead, like in v1 of this patch
---
drivers/ufs/core/ufshcd.c | 27 +++++++++++++++++----------
drivers/ufs/host/ufshcd-pci.c | 2 --
drivers/ufs/host/ufshcd-pltfrm.c | 11 ++++-------
include/ufs/ufshcd.h | 1 -
4 files changed, 21 insertions(+), 20 deletions(-)
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index 43ddae7318cb..8351795296bb 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -10279,16 +10279,6 @@ int ufshcd_system_thaw(struct device *dev)
EXPORT_SYMBOL_GPL(ufshcd_system_thaw);
#endif /* CONFIG_PM_SLEEP */
-/**
- * ufshcd_dealloc_host - deallocate Host Bus Adapter (HBA)
- * @hba: pointer to Host Bus Adapter (HBA)
- */
-void ufshcd_dealloc_host(struct ufs_hba *hba)
-{
- scsi_host_put(hba->host);
-}
-EXPORT_SYMBOL_GPL(ufshcd_dealloc_host);
-
/**
* ufshcd_set_dma_mask - Set dma mask based on the controller
* addressing capability
@@ -10307,6 +10297,16 @@ static int ufshcd_set_dma_mask(struct ufs_hba *hba)
return dma_set_mask_and_coherent(hba->dev, DMA_BIT_MASK(32));
}
+/**
+ * ufshcd_devres_release - devres cleanup handler, invoked during release of
+ * hba->dev
+ * @host: pointer to SCSI host
+ */
+static void ufshcd_devres_release(void *host)
+{
+ scsi_host_put(host);
+}
+
/**
* ufshcd_alloc_host - allocate Host Bus Adapter (HBA)
* @dev: pointer to device handle
@@ -10334,6 +10334,13 @@ int ufshcd_alloc_host(struct device *dev, struct ufs_hba **hba_handle)
err = -ENOMEM;
goto out_error;
}
+
+ err = devm_add_action_or_reset(dev, ufshcd_devres_release,
+ host);
+ if (err)
+ return dev_err_probe(dev, err,
+ "failed to add ufshcd dealloc action\n");
+
host->nr_maps = HCTX_TYPE_POLL + 1;
hba = shost_priv(host);
hba->host = host;
diff --git a/drivers/ufs/host/ufshcd-pci.c b/drivers/ufs/host/ufshcd-pci.c
index ea39c5d5b8cf..9cfcaad23cf9 100644
--- a/drivers/ufs/host/ufshcd-pci.c
+++ b/drivers/ufs/host/ufshcd-pci.c
@@ -562,7 +562,6 @@ static void ufshcd_pci_remove(struct pci_dev *pdev)
pm_runtime_forbid(&pdev->dev);
pm_runtime_get_noresume(&pdev->dev);
ufshcd_remove(hba);
- ufshcd_dealloc_host(hba);
}
/**
@@ -605,7 +604,6 @@ ufshcd_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
err = ufshcd_init(hba, mmio_base, pdev->irq);
if (err) {
dev_err(&pdev->dev, "Initialization failed\n");
- ufshcd_dealloc_host(hba);
return err;
}
diff --git a/drivers/ufs/host/ufshcd-pltfrm.c b/drivers/ufs/host/ufshcd-pltfrm.c
index 505572d4fa87..adb0a65d9df5 100644
--- a/drivers/ufs/host/ufshcd-pltfrm.c
+++ b/drivers/ufs/host/ufshcd-pltfrm.c
@@ -488,13 +488,13 @@ int ufshcd_pltfrm_init(struct platform_device *pdev,
if (err) {
dev_err(dev, "%s: clock parse failed %d\n",
__func__, err);
- goto dealloc_host;
+ goto out;
}
err = ufshcd_parse_regulator_info(hba);
if (err) {
dev_err(dev, "%s: regulator init failed %d\n",
__func__, err);
- goto dealloc_host;
+ goto out;
}
ufshcd_init_lanes_per_dir(hba);
@@ -502,14 +502,14 @@ int ufshcd_pltfrm_init(struct platform_device *pdev,
err = ufshcd_parse_operating_points(hba);
if (err) {
dev_err(dev, "%s: OPP parse failed %d\n", __func__, err);
- goto dealloc_host;
+ goto out;
}
err = ufshcd_init(hba, mmio_base, irq);
if (err) {
dev_err_probe(dev, err, "Initialization failed with error %d\n",
err);
- goto dealloc_host;
+ goto out;
}
pm_runtime_set_active(dev);
@@ -517,8 +517,6 @@ int ufshcd_pltfrm_init(struct platform_device *pdev,
return 0;
-dealloc_host:
- ufshcd_dealloc_host(hba);
out:
return err;
}
@@ -534,7 +532,6 @@ void ufshcd_pltfrm_remove(struct platform_device *pdev)
pm_runtime_get_sync(&pdev->dev);
ufshcd_remove(hba);
- ufshcd_dealloc_host(hba);
pm_runtime_disable(&pdev->dev);
pm_runtime_put_noidle(&pdev->dev);
}
diff --git a/include/ufs/ufshcd.h b/include/ufs/ufshcd.h
index da0fa5c65081..58eb6e897827 100644
--- a/include/ufs/ufshcd.h
+++ b/include/ufs/ufshcd.h
@@ -1311,7 +1311,6 @@ static inline void ufshcd_rmwl(struct ufs_hba *hba, u32 mask, u32 val, u32 reg)
void ufshcd_enable_irq(struct ufs_hba *hba);
void ufshcd_disable_irq(struct ufs_hba *hba);
int ufshcd_alloc_host(struct device *, struct ufs_hba **);
-void ufshcd_dealloc_host(struct ufs_hba *);
int ufshcd_hba_enable(struct ufs_hba *hba);
int ufshcd_init(struct ufs_hba *, void __iomem *, unsigned int);
int ufshcd_link_recovery(struct ufs_hba *hba);
---
base-commit: 4e16367cfe0ce395f29d0482b78970cce8e1db73
change-id: 20250113-ufshcd-fix-52409f2d32ff
Best regards,
--
André Draszik <andre.draszik(a)linaro.org>
The patch titled
Subject: mm/compaction: fix UBSAN shift-out-of-bounds warning
has been added to the -mm mm-unstable branch. Its filename is
mm-compaction-fix-ubsan-shift-out-of-bounds-warning.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Liu Shixin <liushixin2(a)huawei.com>
Subject: mm/compaction: fix UBSAN shift-out-of-bounds warning
Date: Thu, 23 Jan 2025 10:10:29 +0800
syzkaller reported a UBSAN shift-out-of-bounds warning of (1UL << order)
in isolate_freepages_block(). The bogus compound_order can be any value
because it is union with flags. Add back the MAX_PAGE_ORDER check to fix
the warning.
Link: https://lkml.kernel.org/r/20250123021029.2826736-1-liushixin2@huawei.com
Fixes: 3da0272a4c7d ("mm/compaction: correctly return failure with bogus compound_order in strict mode")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Kemeng Shi <shikemeng(a)huaweicloud.com>
Cc: Mattew Wilcox <willy(a)infradead.org> [English fixes]
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Nanyong Sun <sunnanyong(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/compaction.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/compaction.c~mm-compaction-fix-ubsan-shift-out-of-bounds-warning
+++ a/mm/compaction.c
@@ -631,7 +631,8 @@ static unsigned long isolate_freepages_b
if (PageCompound(page)) {
const unsigned int order = compound_order(page);
- if (blockpfn + (1UL << order) <= end_pfn) {
+ if ((order <= MAX_PAGE_ORDER) &&
+ (blockpfn + (1UL << order) <= end_pfn)) {
blockpfn += (1UL << order) - 1;
page += (1UL << order) - 1;
nr_scanned += (1UL << order) - 1;
_
Patches currently in -mm which might be from liushixin2(a)huawei.com are
mm-page_isolation-avoid-call-folio_hstate-without-hugetlb_lock.patch
mm-compaction-fix-ubsan-shift-out-of-bounds-warning.patch
Protect access to fore200e->available_cell_rate with rate_mtx lock to
prevent potential data race.
In this case, since the update depends on a prior read, a data race
could lead to a wrong fore200e.available_cell_rate value.
The field fore200e.available_cell_rate is generally protected by the lock
fore200e.rate_mtx when accessed. In all other read and write cases, this
field is consistently protected by the lock, except for this case and
during initialization.
This potential bug was detected by our experimental static analysis tool,
which analyzes locking APIs and paired functions to identify data races
and atomicity violations.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable(a)vger.kernel.org
Signed-off-by: Gui-Dong Han <2045gemini(a)gmail.com>
---
v2:
* Added a description of the data race hazard in fore200e_open(), as
suggested by Jakub Kicinski and Simon Horman.
---
drivers/atm/fore200e.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/atm/fore200e.c b/drivers/atm/fore200e.c
index 4fea1149e003..f62e38571440 100644
--- a/drivers/atm/fore200e.c
+++ b/drivers/atm/fore200e.c
@@ -1374,7 +1374,9 @@ fore200e_open(struct atm_vcc *vcc)
vcc->dev_data = NULL;
+ mutex_lock(&fore200e->rate_mtx);
fore200e->available_cell_rate += vcc->qos.txtp.max_pcr;
+ mutex_unlock(&fore200e->rate_mtx);
kfree(fore200e_vcc);
return -EINVAL;
--
2.25.1
Hello,
My name is wioleta. we would like to know if you export to
Poland, as We have active
projects that require most products as seen on your website. If
yes, please kindly keep us informed upon your feedback so we can
send our preferred listing for quote. For further information or
have any questions, please do not hesitate to write us.
Sten Arnlund
Purchase Manager
wioleta.raimer(a)invpolamd.com
a: Vedwalterdige by 2, Holmerskulle, 432 68 Poland.
From: Rand Deeb <rand.sec96(a)gmail.com>
[ Upstream commit 789c17185fb0f39560496c2beab9b57ce1d0cbe7 ]
The ssb_device_uevent() function first attempts to convert the 'dev' pointer
to 'struct ssb_device *'. However, it mistakenly dereferences 'dev' before
performing the NULL check, potentially leading to a NULL pointer
dereference if 'dev' is NULL.
To fix this issue, move the NULL check before dereferencing the 'dev' pointer,
ensuring that the pointer is valid before attempting to use it.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Rand Deeb <rand.sec96(a)gmail.com>
Signed-off-by: Kalle Valo <kvalo(a)kernel.org>
Link: https://msgid.link/20240306123028.164155-1-rand.sec96@gmail.com
Signed-off-by: Imkanmod Khan <imkanmodkhan(a)gmail.com>
---
drivers/ssb/main.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/ssb/main.c b/drivers/ssb/main.c
index d52e91258e98..aae50a5dfb57 100644
--- a/drivers/ssb/main.c
+++ b/drivers/ssb/main.c
@@ -341,11 +341,13 @@ static int ssb_bus_match(struct device *dev, struct device_driver *drv)
static int ssb_device_uevent(struct device *dev, struct kobj_uevent_env *env)
{
- struct ssb_device *ssb_dev = dev_to_ssb_dev(dev);
+ struct ssb_device *ssb_dev;
if (!dev)
return -ENODEV;
+ ssb_dev = dev_to_ssb_dev(dev);
+
return add_uevent_var(env,
"MODALIAS=ssb:v%04Xid%04Xrev%02X",
ssb_dev->id.vendor, ssb_dev->id.coreid,
--
2.25.1
From: Lucas De Marchi <lucas.demarchi(a)intel.com>
Commit 70fb86a85dc9 ("drm/xe: Revert some changes that break a mesa
debug tool") partially reverted some changes to workaround breakage
caused to mesa tools. However, in doing so it also broke fetching the
GuC log via debugfs since xe_print_blob_ascii85() simply bails out.
The fix is to avoid the extra newlines: the devcoredump interface is
line-oriented and adding random newlines in the middle breaks it. If a
tool is able to parse it by looking at the data and checking for chars
that are out of the ascii85 space, it can still do so. A format change
that breaks the line-oriented output on devcoredump however needs better
coordination with existing tools.
Reviewed-by: José Roberto de Souza <jose.souza(a)intel.com>
Cc: John Harrison <John.C.Harrison(a)Intel.com>
Cc: Julia Filipchuk <julia.filipchuk(a)intel.com>
Cc: José Roberto de Souza <jose.souza(a)intel.com>
Cc: stable(a)vger.kernel.org
Fixes: 70fb86a85dc9 ("drm/xe: Revert some changes that break a mesa debug tool")
Fixes: ec1455ce7e35 ("drm/xe/devcoredump: Add ASCII85 dump helper function")
Signed-off-by: Lucas De Marchi <lucas.demarchi(a)intel.com>
---
drivers/gpu/drm/xe/xe_devcoredump.c | 30 +++++++++--------------------
drivers/gpu/drm/xe/xe_devcoredump.h | 2 +-
drivers/gpu/drm/xe/xe_guc_ct.c | 3 ++-
drivers/gpu/drm/xe/xe_guc_log.c | 4 +++-
4 files changed, 15 insertions(+), 24 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
index 81dc7795c0651..1c86e6456d60f 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.c
+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
@@ -395,42 +395,30 @@ int xe_devcoredump_init(struct xe_device *xe)
/**
* xe_print_blob_ascii85 - print a BLOB to some useful location in ASCII85
*
- * The output is split to multiple lines because some print targets, e.g. dmesg
- * cannot handle arbitrarily long lines. Note also that printing to dmesg in
- * piece-meal fashion is not possible, each separate call to drm_puts() has a
- * line-feed automatically added! Therefore, the entire output line must be
- * constructed in a local buffer first, then printed in one atomic output call.
+ * The output is split to multiple print calls because some print targets, e.g.
+ * dmesg cannot handle arbitrarily long lines. These targets may add newline
+ * between calls.
*
* There is also a scheduler yield call to prevent the 'task has been stuck for
* 120s' kernel hang check feature from firing when printing to a slow target
* such as dmesg over a serial port.
*
- * TODO: Add compression prior to the ASCII85 encoding to shrink huge buffers down.
- *
* @p: the printer object to output to
* @prefix: optional prefix to add to output string
* @blob: the Binary Large OBject to dump out
* @offset: offset in bytes to skip from the front of the BLOB, must be a multiple of sizeof(u32)
* @size: the size in bytes of the BLOB, must be a multiple of sizeof(u32)
*/
-void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix,
+void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix, char suffix,
const void *blob, size_t offset, size_t size)
{
const u32 *blob32 = (const u32 *)blob;
char buff[ASCII85_BUFSZ], *line_buff;
size_t line_pos = 0;
- /*
- * Splitting blobs across multiple lines is not compatible with the mesa
- * debug decoder tool. Note that even dropping the explicit '\n' below
- * doesn't help because the GuC log is so big some underlying implementation
- * still splits the lines at 512K characters. So just bail completely for
- * the moment.
- */
- return;
-
#define DMESG_MAX_LINE_LEN 800
-#define MIN_SPACE (ASCII85_BUFSZ + 2) /* 85 + "\n\0" */
+ /* Always leave space for the suffix char and the \0 */
+#define MIN_SPACE (ASCII85_BUFSZ + 2) /* 85 + "<suffix>\0" */
if (size & 3)
drm_printf(p, "Size not word aligned: %zu", size);
@@ -462,7 +450,6 @@ void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix,
line_pos += strlen(line_buff + line_pos);
if ((line_pos + MIN_SPACE) >= DMESG_MAX_LINE_LEN) {
- line_buff[line_pos++] = '\n';
line_buff[line_pos++] = 0;
drm_puts(p, line_buff);
@@ -474,10 +461,11 @@ void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix,
}
}
+ if (suffix)
+ line_buff[line_pos++] = suffix;
+
if (line_pos) {
- line_buff[line_pos++] = '\n';
line_buff[line_pos++] = 0;
-
drm_puts(p, line_buff);
}
diff --git a/drivers/gpu/drm/xe/xe_devcoredump.h b/drivers/gpu/drm/xe/xe_devcoredump.h
index 6a17e6d601022..5391a80a4d1ba 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.h
+++ b/drivers/gpu/drm/xe/xe_devcoredump.h
@@ -29,7 +29,7 @@ static inline int xe_devcoredump_init(struct xe_device *xe)
}
#endif
-void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix,
+void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix, char suffix,
const void *blob, size_t offset, size_t size);
#endif
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 8b65c5e959cc2..50c8076b51585 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -1724,7 +1724,8 @@ void xe_guc_ct_snapshot_print(struct xe_guc_ct_snapshot *snapshot,
snapshot->g2h_outstanding);
if (snapshot->ctb)
- xe_print_blob_ascii85(p, "CTB data", snapshot->ctb, 0, snapshot->ctb_size);
+ xe_print_blob_ascii85(p, "CTB data", '\n',
+ snapshot->ctb, 0, snapshot->ctb_size);
} else {
drm_puts(p, "CT disabled\n");
}
diff --git a/drivers/gpu/drm/xe/xe_guc_log.c b/drivers/gpu/drm/xe/xe_guc_log.c
index 80151ff6a71f8..44482ea919924 100644
--- a/drivers/gpu/drm/xe/xe_guc_log.c
+++ b/drivers/gpu/drm/xe/xe_guc_log.c
@@ -207,8 +207,10 @@ void xe_guc_log_snapshot_print(struct xe_guc_log_snapshot *snapshot, struct drm_
remain = snapshot->size;
for (i = 0; i < snapshot->num_chunks; i++) {
size_t size = min(GUC_LOG_CHUNK_SIZE, remain);
+ const char *prefix = i ? NULL : "Log data";
+ char suffix = i == snapshot->num_chunks - 1 ? '\n' : 0;
- xe_print_blob_ascii85(p, i ? NULL : "Log data", snapshot->copy[i], 0, size);
+ xe_print_blob_ascii85(p, prefix, suffix, snapshot->copy[i], 0, size);
remain -= size;
}
}
--
2.48.1
Commit 70fb86a85dc9 ("drm/xe: Revert some changes that break a mesa
debug tool") partially reverted some changes to workaround breakage
caused to mesa tools. However, in doing so it also broke fetching the
GuC log via debugfs since xe_print_blob_ascii85() simply bails out.
The fix is to avoid the extra newlines: the devcoredump interface is
line-oriented and adding random newlines in the middle breaks it. If a
tool is able to parse it by looking at the data and checking for chars
that are out of the ascii85 space, it can still do so. A format change
that breaks the line-oriented output on devcoredump however needs better
coordination with existing tools.
Cc: John Harrison <John.C.Harrison(a)Intel.com>
Cc: Julia Filipchuk <julia.filipchuk(a)intel.com>
Cc: José Roberto de Souza <jose.souza(a)intel.com>
Cc: stable(a)vger.kernel.org
Fixes: 70fb86a85dc9 ("drm/xe: Revert some changes that break a mesa debug tool")
Fixes: ec1455ce7e35 ("drm/xe/devcoredump: Add ASCII85 dump helper function")
Signed-off-by: Lucas De Marchi <lucas.demarchi(a)intel.com>
---
drivers/gpu/drm/xe/xe_devcoredump.c | 30 +++++++++--------------------
drivers/gpu/drm/xe/xe_devcoredump.h | 2 +-
drivers/gpu/drm/xe/xe_guc_ct.c | 3 ++-
drivers/gpu/drm/xe/xe_guc_log.c | 4 +++-
4 files changed, 15 insertions(+), 24 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
index a7946a76777e7..d9b71bb690860 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.c
+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
@@ -391,42 +391,30 @@ int xe_devcoredump_init(struct xe_device *xe)
/**
* xe_print_blob_ascii85 - print a BLOB to some useful location in ASCII85
*
- * The output is split to multiple lines because some print targets, e.g. dmesg
- * cannot handle arbitrarily long lines. Note also that printing to dmesg in
- * piece-meal fashion is not possible, each separate call to drm_puts() has a
- * line-feed automatically added! Therefore, the entire output line must be
- * constructed in a local buffer first, then printed in one atomic output call.
+ * The output is split to multiple print calls because some print targets, e.g.
+ * dmesg cannot handle arbitrarily long lines. These targets may add newline
+ * between calls.
*
* There is also a scheduler yield call to prevent the 'task has been stuck for
* 120s' kernel hang check feature from firing when printing to a slow target
* such as dmesg over a serial port.
*
- * TODO: Add compression prior to the ASCII85 encoding to shrink huge buffers down.
- *
* @p: the printer object to output to
* @prefix: optional prefix to add to output string
* @blob: the Binary Large OBject to dump out
* @offset: offset in bytes to skip from the front of the BLOB, must be a multiple of sizeof(u32)
* @size: the size in bytes of the BLOB, must be a multiple of sizeof(u32)
*/
-void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix,
+void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix, char suffix,
const void *blob, size_t offset, size_t size)
{
const u32 *blob32 = (const u32 *)blob;
char buff[ASCII85_BUFSZ], *line_buff;
size_t line_pos = 0;
- /*
- * Splitting blobs across multiple lines is not compatible with the mesa
- * debug decoder tool. Note that even dropping the explicit '\n' below
- * doesn't help because the GuC log is so big some underlying implementation
- * still splits the lines at 512K characters. So just bail completely for
- * the moment.
- */
- return;
-
#define DMESG_MAX_LINE_LEN 800
-#define MIN_SPACE (ASCII85_BUFSZ + 2) /* 85 + "\n\0" */
+ /* Always leave space for the suffix char and the \0 */
+#define MIN_SPACE (ASCII85_BUFSZ + 2) /* 85 + "<suffix>\0" */
if (size & 3)
drm_printf(p, "Size not word aligned: %zu", size);
@@ -458,7 +446,6 @@ void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix,
line_pos += strlen(line_buff + line_pos);
if ((line_pos + MIN_SPACE) >= DMESG_MAX_LINE_LEN) {
- line_buff[line_pos++] = '\n';
line_buff[line_pos++] = 0;
drm_puts(p, line_buff);
@@ -470,10 +457,11 @@ void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix,
}
}
+ if (suffix)
+ line_buff[line_pos++] = suffix;
+
if (line_pos) {
- line_buff[line_pos++] = '\n';
line_buff[line_pos++] = 0;
-
drm_puts(p, line_buff);
}
diff --git a/drivers/gpu/drm/xe/xe_devcoredump.h b/drivers/gpu/drm/xe/xe_devcoredump.h
index 6a17e6d601022..5391a80a4d1ba 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.h
+++ b/drivers/gpu/drm/xe/xe_devcoredump.h
@@ -29,7 +29,7 @@ static inline int xe_devcoredump_init(struct xe_device *xe)
}
#endif
-void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix,
+void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix, char suffix,
const void *blob, size_t offset, size_t size);
#endif
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 8b65c5e959cc2..50c8076b51585 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -1724,7 +1724,8 @@ void xe_guc_ct_snapshot_print(struct xe_guc_ct_snapshot *snapshot,
snapshot->g2h_outstanding);
if (snapshot->ctb)
- xe_print_blob_ascii85(p, "CTB data", snapshot->ctb, 0, snapshot->ctb_size);
+ xe_print_blob_ascii85(p, "CTB data", '\n',
+ snapshot->ctb, 0, snapshot->ctb_size);
} else {
drm_puts(p, "CT disabled\n");
}
diff --git a/drivers/gpu/drm/xe/xe_guc_log.c b/drivers/gpu/drm/xe/xe_guc_log.c
index 80151ff6a71f8..44482ea919924 100644
--- a/drivers/gpu/drm/xe/xe_guc_log.c
+++ b/drivers/gpu/drm/xe/xe_guc_log.c
@@ -207,8 +207,10 @@ void xe_guc_log_snapshot_print(struct xe_guc_log_snapshot *snapshot, struct drm_
remain = snapshot->size;
for (i = 0; i < snapshot->num_chunks; i++) {
size_t size = min(GUC_LOG_CHUNK_SIZE, remain);
+ const char *prefix = i ? NULL : "Log data";
+ char suffix = i == snapshot->num_chunks - 1 ? '\n' : 0;
- xe_print_blob_ascii85(p, i ? NULL : "Log data", snapshot->copy[i], 0, size);
+ xe_print_blob_ascii85(p, prefix, suffix, snapshot->copy[i], 0, size);
remain -= size;
}
}
--
2.48.0
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: 53dac345395c0d2493cbc2f4c85fe38aef5b63f5
Gitweb: https://git.kernel.org/tip/53dac345395c0d2493cbc2f4c85fe38aef5b63f5
Author: Frederic Weisbecker <frederic(a)kernel.org>
AuthorDate: Sat, 18 Jan 2025 00:24:33 +01:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Thu, 23 Jan 2025 20:06:35 +01:00
hrtimers: Force migrate away hrtimers queued after CPUHP_AP_HRTIMERS_DYING
hrtimers are migrated away from the dying CPU to any online target at
the CPUHP_AP_HRTIMERS_DYING stage in order not to delay bandwidth timers
handling tasks involved in the CPU hotplug forward progress.
However wakeups can still be performed by the outgoing CPU after
CPUHP_AP_HRTIMERS_DYING. Those can result again in bandwidth timers being
armed. Depending on several considerations (crystal ball power management
based election, earliest timer already enqueued, timer migration enabled or
not), the target may eventually be the current CPU even if offline. If that
happens, the timer is eventually ignored.
The most notable example is RCU which had to deal with each and every of
those wake-ups by deferring them to an online CPU, along with related
workarounds:
_ e787644caf76 (rcu: Defer RCU kthreads wakeup when CPU is dying)
_ 9139f93209d1 (rcu/nocb: Fix RT throttling hrtimer armed from offline CPU)
_ f7345ccc62a4 (rcu/nocb: Fix rcuog wake-up from offline softirq)
The problem isn't confined to RCU though as the stop machine kthread
(which runs CPUHP_AP_HRTIMERS_DYING) reports its completion at the end
of its work through cpu_stop_signal_done() and performs a wake up that
eventually arms the deadline server timer:
WARNING: CPU: 94 PID: 588 at kernel/time/hrtimer.c:1086 hrtimer_start_range_ns+0x289/0x2d0
CPU: 94 UID: 0 PID: 588 Comm: migration/94 Not tainted
Stopper: multi_cpu_stop+0x0/0x120 <- stop_machine_cpuslocked+0x66/0xc0
RIP: 0010:hrtimer_start_range_ns+0x289/0x2d0
Call Trace:
<TASK>
start_dl_timer
enqueue_dl_entity
dl_server_start
enqueue_task_fair
enqueue_task
ttwu_do_activate
try_to_wake_up
complete
cpu_stopper_thread
Instead of providing yet another bandaid to work around the situation, fix
it in the hrtimers infrastructure instead: always migrate away a timer to
an online target whenever it is enqueued from an offline CPU.
This will also allow to revert all the above RCU disgraceful hacks.
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Reported-by: Vlad Poenaru <vlad.wing(a)gmail.com>
Reported-by: Usama Arif <usamaarif642(a)gmail.com>
Signed-off-by: Frederic Weisbecker <frederic(a)kernel.org>
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Tested-by: Paul E. McKenney <paulmck(a)kernel.org>
Link: https://lore.kernel.org/all/20250117232433.24027-1-frederic@kernel.org
Closes: 20241213203739.1519801-1-usamaarif642(a)gmail.com
---
include/linux/hrtimer_defs.h | 1 +-
kernel/time/hrtimer.c | 103 +++++++++++++++++++++++++++-------
2 files changed, 83 insertions(+), 21 deletions(-)
diff --git a/include/linux/hrtimer_defs.h b/include/linux/hrtimer_defs.h
index c3b4b7e..84a5045 100644
--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -125,6 +125,7 @@ struct hrtimer_cpu_base {
ktime_t softirq_expires_next;
struct hrtimer *softirq_next_timer;
struct hrtimer_clock_base clock_base[HRTIMER_MAX_CLOCK_BASES];
+ call_single_data_t csd;
} ____cacheline_aligned;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 4fb81f8..deb1aa3 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -58,6 +58,8 @@
#define HRTIMER_ACTIVE_SOFT (HRTIMER_ACTIVE_HARD << MASK_SHIFT)
#define HRTIMER_ACTIVE_ALL (HRTIMER_ACTIVE_SOFT | HRTIMER_ACTIVE_HARD)
+static void retrigger_next_event(void *arg);
+
/*
* The timer bases:
*
@@ -111,7 +113,8 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
.clockid = CLOCK_TAI,
.get_time = &ktime_get_clocktai,
},
- }
+ },
+ .csd = CSD_INIT(retrigger_next_event, NULL)
};
static const int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
@@ -124,6 +127,14 @@ static const int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
[CLOCK_TAI] = HRTIMER_BASE_TAI,
};
+static inline bool hrtimer_base_is_online(struct hrtimer_cpu_base *base)
+{
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
+ return true;
+ else
+ return likely(base->online);
+}
+
/*
* Functions and macros which are different for UP/SMP systems are kept in a
* single place
@@ -178,27 +189,54 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
}
/*
- * We do not migrate the timer when it is expiring before the next
- * event on the target cpu. When high resolution is enabled, we cannot
- * reprogram the target cpu hardware and we would cause it to fire
- * late. To keep it simple, we handle the high resolution enabled and
- * disabled case similar.
+ * Check if the elected target is suitable considering its next
+ * event and the hotplug state of the current CPU.
+ *
+ * If the elected target is remote and its next event is after the timer
+ * to queue, then a remote reprogram is necessary. However there is no
+ * guarantee the IPI handling the operation would arrive in time to meet
+ * the high resolution deadline. In this case the local CPU becomes a
+ * preferred target, unless it is offline.
+ *
+ * High and low resolution modes are handled the same way for simplicity.
*
* Called with cpu_base->lock of target cpu held.
*/
-static int
-hrtimer_check_target(struct hrtimer *timer, struct hrtimer_clock_base *new_base)
+static bool hrtimer_suitable_target(struct hrtimer *timer, struct hrtimer_clock_base *new_base,
+ struct hrtimer_cpu_base *new_cpu_base,
+ struct hrtimer_cpu_base *this_cpu_base)
{
ktime_t expires;
+ /*
+ * The local CPU clockevent can be reprogrammed. Also get_target_base()
+ * guarantees it is online.
+ */
+ if (new_cpu_base == this_cpu_base)
+ return true;
+
+ /*
+ * The offline local CPU can't be the default target if the
+ * next remote target event is after this timer. Keep the
+ * elected new base. An IPI will we issued to reprogram
+ * it as a last resort.
+ */
+ if (!hrtimer_base_is_online(this_cpu_base))
+ return true;
+
expires = ktime_sub(hrtimer_get_expires(timer), new_base->offset);
- return expires < new_base->cpu_base->expires_next;
+
+ return expires >= new_base->cpu_base->expires_next;
}
-static inline
-struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base,
- int pinned)
+static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, int pinned)
{
+ if (!hrtimer_base_is_online(base)) {
+ int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER));
+
+ return &per_cpu(hrtimer_bases, cpu);
+ }
+
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
if (static_branch_likely(&timers_migration_enabled) && !pinned)
return &per_cpu(hrtimer_bases, get_nohz_timer_target());
@@ -249,8 +287,8 @@ again:
raw_spin_unlock(&base->cpu_base->lock);
raw_spin_lock(&new_base->cpu_base->lock);
- if (new_cpu_base != this_cpu_base &&
- hrtimer_check_target(timer, new_base)) {
+ if (!hrtimer_suitable_target(timer, new_base, new_cpu_base,
+ this_cpu_base)) {
raw_spin_unlock(&new_base->cpu_base->lock);
raw_spin_lock(&base->cpu_base->lock);
new_cpu_base = this_cpu_base;
@@ -259,8 +297,7 @@ again:
}
WRITE_ONCE(timer->base, new_base);
} else {
- if (new_cpu_base != this_cpu_base &&
- hrtimer_check_target(timer, new_base)) {
+ if (!hrtimer_suitable_target(timer, new_base, new_cpu_base, this_cpu_base)) {
new_cpu_base = this_cpu_base;
goto again;
}
@@ -706,8 +743,6 @@ static inline int hrtimer_is_hres_enabled(void)
return hrtimer_hres_enabled;
}
-static void retrigger_next_event(void *arg);
-
/*
* Switch to high resolution mode
*/
@@ -1195,6 +1230,7 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
u64 delta_ns, const enum hrtimer_mode mode,
struct hrtimer_clock_base *base)
{
+ struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases);
struct hrtimer_clock_base *new_base;
bool force_local, first;
@@ -1206,10 +1242,16 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
* and enforce reprogramming after it is queued no matter whether
* it is the new first expiring timer again or not.
*/
- force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
+ force_local = base->cpu_base == this_cpu_base;
force_local &= base->cpu_base->next_timer == timer;
/*
+ * Don't force local queuing if this enqueue happens on a unplugged
+ * CPU after hrtimer_cpu_dying() has been invoked.
+ */
+ force_local &= this_cpu_base->online;
+
+ /*
* Remove an active timer from the queue. In case it is not queued
* on the current CPU, make sure that remove_hrtimer() updates the
* remote data correctly.
@@ -1238,8 +1280,27 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
}
first = enqueue_hrtimer(timer, new_base, mode);
- if (!force_local)
- return first;
+ if (!force_local) {
+ /*
+ * If the current CPU base is online, then the timer is
+ * never queued on a remote CPU if it would be the first
+ * expiring timer there.
+ */
+ if (hrtimer_base_is_online(this_cpu_base))
+ return first;
+
+ /*
+ * Timer was enqueued remote because the current base is
+ * already offline. If the timer is the first to expire,
+ * kick the remote CPU to reprogram the clock event.
+ */
+ if (first) {
+ struct hrtimer_cpu_base *new_cpu_base = new_base->cpu_base;
+
+ smp_call_function_single_async(new_cpu_base->cpu, &new_cpu_base->csd);
+ }
+ return 0;
+ }
/*
* Timer was forced to stay on the current CPU to avoid
From: Zhihao Cheng <chengzhihao1(a)huawei.com>
commit aec8e6bf839101784f3ef037dcdb9432c3f32343 ("btrfs:
fix use-after-free of block device file in __btrfs_free_extra_devids()")
Mounting btrfs from two images (which have the same one fsid and two
different dev_uuids) in certain executing order may trigger an UAF for
variable 'device->bdev_file' in __btrfs_free_extra_devids(). And
following are the details:
1. Attach image_1 to loop0, attach image_2 to loop1, and scan btrfs
devices by ioctl(BTRFS_IOC_SCAN_DEV):
/ btrfs_device_1 → loop0
fs_device
\ btrfs_device_2 → loop1
2. mount /dev/loop0 /mnt
btrfs_open_devices
btrfs_device_1->bdev_file = btrfs_get_bdev_and_sb(loop0)
btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1)
btrfs_fill_super
open_ctree
fail: btrfs_close_devices // -ENOMEM
btrfs_close_bdev(btrfs_device_1)
fput(btrfs_device_1->bdev_file)
// btrfs_device_1->bdev_file is freed
btrfs_close_bdev(btrfs_device_2)
fput(btrfs_device_2->bdev_file)
3. mount /dev/loop1 /mnt
btrfs_open_devices
btrfs_get_bdev_and_sb(&bdev_file)
// EIO, btrfs_device_1->bdev_file is not assigned,
// which points to a freed memory area
btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1)
btrfs_fill_super
open_ctree
btrfs_free_extra_devids
if (btrfs_device_1->bdev_file)
fput(btrfs_device_1->bdev_file) // UAF !
Fix it by setting 'device->bdev_file' as 'NULL' after closing the
btrfs_device in btrfs_close_one_device().
Fixes: CVE-2024-50217
Fixes: 142388194191 ("btrfs: do not background blkdev_put()")
CC: stable(a)vger.kernel.org # 4.19+
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219408
Signed-off-by: Zhihao Cheng <chengzhihao1(a)huawei.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
(cherry picked from commit aec8e6bf839101784f3ef037dcdb9432c3f32343)
Signed-off-by: Shubham Pushpkar <spushpka(a)cisco.com>
---
fs/btrfs/volumes.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b9a0b26d08e1..ab2412542ce5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1176,6 +1176,7 @@ static void btrfs_close_one_device(struct btrfs_device *device)
if (device->bdev) {
fs_devices->open_devices--;
device->bdev = NULL;
+ device->bdev_file = NULL;
}
clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
btrfs_destroy_dev_zone_info(device);
--
2.35.6
This reverts commit 27ab05e1b7e5c5ec9b4f658e1b2464c0908298a6.
I tried to upgrade a RasPi 3B+ with Waveshare 7inch HDMI LCD
from 6.1.y to 6.6.y but found that the display is broken with
this log message:
[ 17.776315] vc4-drm soc:gpu: bound 3f400000.hvs (ops vc4_drm_unregister [vc4])
[ 17.784034] platform 3f806000.vec: deferred probe pending
Some tests revealed that while 6.1.y works, 6.2-rc1 is already broken but all
newer kernels as well. And a bisect did lead me to this patch.
I could repair several versions up to 6.13-rc7 by doing
this revert. Newer kernels have just to take care of
commit f702475b839c ("ARM: dts: bcm2835-rpi: Move duplicate firmware-clocks to bcm2835-rpi.dtsi")
but that is straightforward.
Fixes: 27ab05e1b7e5 ("ARM: dts: bcm2835-rpi: Use firmware clocks for display")
Signed-off-by: H. Nikolaus Schaller <hns(a)goldelico.com>
---
arch/arm/boot/dts/bcm2835-rpi-common.dtsi | 17 -----------------
1 file changed, 17 deletions(-)
diff --git a/arch/arm/boot/dts/bcm2835-rpi-common.dtsi b/arch/arm/boot/dts/bcm2835-rpi-common.dtsi
index 4e7b4a592da7c..8a55b6cded592 100644
--- a/arch/arm/boot/dts/bcm2835-rpi-common.dtsi
+++ b/arch/arm/boot/dts/bcm2835-rpi-common.dtsi
@@ -7,23 +7,6 @@
#include <dt-bindings/power/raspberrypi-power.h>
-&firmware {
- firmware_clocks: clocks {
- compatible = "raspberrypi,firmware-clocks";
- #clock-cells = <1>;
- };
-};
-
-&hdmi {
- clocks = <&firmware_clocks 9>,
- <&firmware_clocks 13>;
- clock-names = "pixel", "hdmi";
-};
-
&v3d {
power-domains = <&power RPI_POWER_DOMAIN_V3D>;
};
-
-&vec {
- clocks = <&firmware_clocks 15>;
-};
--
2.47.0
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: b7a110336261147ccb373f4100cc88271c54bd91
Gitweb: https://git.kernel.org/tip/b7a110336261147ccb373f4100cc88271c54bd91
Author: Frederic Weisbecker <frederic(a)kernel.org>
AuthorDate: Sat, 18 Jan 2025 00:24:33 +01:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Thu, 23 Jan 2025 11:47:23 +01:00
hrtimers: Force migrate away hrtimers queued after CPUHP_AP_HRTIMERS_DYING
hrtimers are migrated away from the dying CPU to any online target at
the CPUHP_AP_HRTIMERS_DYING stage in order not to delay bandwidth timers
handling tasks involved in the CPU hotplug forward progress.
However wakeups can still be performed by the outgoing CPU after
CPUHP_AP_HRTIMERS_DYING. Those can result again in bandwidth timers being
armed. Depending on several considerations (crystal ball power management
based election, earliest timer already enqueued, timer migration enabled or
not), the target may eventually be the current CPU even if offline. If that
happens, the timer is eventually ignored.
The most notable example is RCU which had to deal with each and every of
those wake-ups by deferring them to an online CPU, along with related
workarounds:
_ e787644caf76 (rcu: Defer RCU kthreads wakeup when CPU is dying)
_ 9139f93209d1 (rcu/nocb: Fix RT throttling hrtimer armed from offline CPU)
_ f7345ccc62a4 (rcu/nocb: Fix rcuog wake-up from offline softirq)
The problem isn't confined to RCU though as the stop machine kthread
(which runs CPUHP_AP_HRTIMERS_DYING) reports its completion at the end
of its work through cpu_stop_signal_done() and performs a wake up that
eventually arms the deadline server timer:
WARNING: CPU: 94 PID: 588 at kernel/time/hrtimer.c:1086 hrtimer_start_range_ns+0x289/0x2d0
CPU: 94 UID: 0 PID: 588 Comm: migration/94 Not tainted
Stopper: multi_cpu_stop+0x0/0x120 <- stop_machine_cpuslocked+0x66/0xc0
RIP: 0010:hrtimer_start_range_ns+0x289/0x2d0
Call Trace:
<TASK>
start_dl_timer
enqueue_dl_entity
dl_server_start
enqueue_task_fair
enqueue_task
ttwu_do_activate
try_to_wake_up
complete
cpu_stopper_thread
Instead of providing yet another bandaid to work around the situation, fix
it in the hrtimers infrastructure instead: always migrate away a timer to
an online target whenever it is enqueued from an offline CPU.
This will also allow to revert all the above RCU disgraceful hacks.
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Reported-by: Vlad Poenaru <vlad.wing(a)gmail.com>
Reported-by: Usama Arif <usamaarif642(a)gmail.com>
Signed-off-by: Frederic Weisbecker <frederic(a)kernel.org>
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Tested-by: Paul E. McKenney <paulmck(a)kernel.org>
Link: https://lore.kernel.org/all/20250117232433.24027-1-frederic@kernel.org
Closes: 20241213203739.1519801-1-usamaarif642(a)gmail.com
---
include/linux/hrtimer_defs.h | 1 +-
kernel/time/hrtimer.c | 103 +++++++++++++++++++++++++++-------
2 files changed, 83 insertions(+), 21 deletions(-)
diff --git a/include/linux/hrtimer_defs.h b/include/linux/hrtimer_defs.h
index c3b4b7e..84a5045 100644
--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -125,6 +125,7 @@ struct hrtimer_cpu_base {
ktime_t softirq_expires_next;
struct hrtimer *softirq_next_timer;
struct hrtimer_clock_base clock_base[HRTIMER_MAX_CLOCK_BASES];
+ call_single_data_t csd;
} ____cacheline_aligned;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 14bd09c..0feb38b 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -58,6 +58,8 @@
#define HRTIMER_ACTIVE_SOFT (HRTIMER_ACTIVE_HARD << MASK_SHIFT)
#define HRTIMER_ACTIVE_ALL (HRTIMER_ACTIVE_SOFT | HRTIMER_ACTIVE_HARD)
+static void retrigger_next_event(void *arg);
+
/*
* The timer bases:
*
@@ -111,7 +113,8 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
.clockid = CLOCK_TAI,
.get_time = &ktime_get_clocktai,
},
- }
+ },
+ .csd = CSD_INIT(retrigger_next_event, NULL)
};
static const int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
@@ -124,6 +127,14 @@ static const int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
[CLOCK_TAI] = HRTIMER_BASE_TAI,
};
+static inline bool hrtimer_base_is_online(struct hrtimer_cpu_base *base)
+{
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
+ return true;
+ else
+ return likely(base->online);
+}
+
/*
* Functions and macros which are different for UP/SMP systems are kept in a
* single place
@@ -183,27 +194,54 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
}
/*
- * We do not migrate the timer when it is expiring before the next
- * event on the target cpu. When high resolution is enabled, we cannot
- * reprogram the target cpu hardware and we would cause it to fire
- * late. To keep it simple, we handle the high resolution enabled and
- * disabled case similar.
+ * Check if the elected target is suitable considering its next
+ * event and the hotplug state of the current CPU.
+ *
+ * If the elected target is remote and its next event is after the timer
+ * to queue, then a remote reprogram is necessary. However there is no
+ * guarantee the IPI handling the operation would arrive in time to meet
+ * the high resolution deadline. In this case the local CPU becomes a
+ * preferred target, unless it is offline.
+ *
+ * High and low resolution modes are handled the same way for simplicity.
*
* Called with cpu_base->lock of target cpu held.
*/
-static int
-hrtimer_check_target(struct hrtimer *timer, struct hrtimer_clock_base *new_base)
+static bool hrtimer_suitable_target(struct hrtimer *timer, struct hrtimer_clock_base *new_base,
+ struct hrtimer_cpu_base *new_cpu_base,
+ struct hrtimer_cpu_base *this_cpu_base)
{
ktime_t expires;
+ /*
+ * The local CPU clockevent can be reprogrammed. Also get_target_base()
+ * guarantees it is online.
+ */
+ if (new_cpu_base == this_cpu_base)
+ return true;
+
+ /*
+ * The offline local CPU can't be the default target if the
+ * next remote target event is after this timer. Keep the
+ * elected new base. An IPI will we issued to reprogram
+ * it as a last resort.
+ */
+ if (!hrtimer_base_is_online(this_cpu_base))
+ return true;
+
expires = ktime_sub(hrtimer_get_expires(timer), new_base->offset);
- return expires < new_base->cpu_base->expires_next;
+
+ return expires >= new_base->cpu_base->expires_next;
}
-static inline
-struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base,
- int pinned)
+static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, int pinned)
{
+ if (!hrtimer_base_is_online(base)) {
+ int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER));
+
+ return &per_cpu(hrtimer_bases, cpu);
+ }
+
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
if (static_branch_likely(&timers_migration_enabled) && !pinned)
return &per_cpu(hrtimer_bases, get_nohz_timer_target());
@@ -254,8 +292,8 @@ again:
raw_spin_unlock(&base->cpu_base->lock);
raw_spin_lock(&new_base->cpu_base->lock);
- if (new_cpu_base != this_cpu_base &&
- hrtimer_check_target(timer, new_base)) {
+ if (!hrtimer_suitable_target(timer, new_base, new_cpu_base,
+ this_cpu_base)) {
raw_spin_unlock(&new_base->cpu_base->lock);
raw_spin_lock(&base->cpu_base->lock);
new_cpu_base = this_cpu_base;
@@ -264,8 +302,7 @@ again:
}
WRITE_ONCE(timer->base, new_base);
} else {
- if (new_cpu_base != this_cpu_base &&
- hrtimer_check_target(timer, new_base)) {
+ if (!hrtimer_suitable_target(timer, new_base, new_cpu_base, this_cpu_base)) {
new_cpu_base = this_cpu_base;
goto again;
}
@@ -716,8 +753,6 @@ static inline int hrtimer_is_hres_enabled(void)
return hrtimer_hres_enabled;
}
-static void retrigger_next_event(void *arg);
-
/*
* Switch to high resolution mode
*/
@@ -1205,6 +1240,7 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
u64 delta_ns, const enum hrtimer_mode mode,
struct hrtimer_clock_base *base)
{
+ struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases);
struct hrtimer_clock_base *new_base;
bool force_local, first;
@@ -1216,10 +1252,16 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
* and enforce reprogramming after it is queued no matter whether
* it is the new first expiring timer again or not.
*/
- force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
+ force_local = base->cpu_base == this_cpu_base;
force_local &= base->cpu_base->next_timer == timer;
/*
+ * Don't force local queuing if this enqueue happens on a unplugged
+ * CPU after hrtimer_cpu_dying() has been invoked.
+ */
+ force_local &= this_cpu_base->online;
+
+ /*
* Remove an active timer from the queue. In case it is not queued
* on the current CPU, make sure that remove_hrtimer() updates the
* remote data correctly.
@@ -1248,8 +1290,27 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
}
first = enqueue_hrtimer(timer, new_base, mode);
- if (!force_local)
- return first;
+ if (!force_local) {
+ /*
+ * If the current CPU base is online, then the timer is
+ * never queued on a remote CPU if it would be the first
+ * expiring timer there.
+ */
+ if (hrtimer_base_is_online(this_cpu_base))
+ return first;
+
+ /*
+ * Timer was enqueued remote because the current base is
+ * already offline. If the timer is the first to expire,
+ * kick the remote CPU to reprogram the clock event.
+ */
+ if (first) {
+ struct hrtimer_cpu_base *new_cpu_base = new_base->cpu_base;
+
+ smp_call_function_single_async(new_cpu_base->cpu, &new_cpu_base->csd);
+ }
+ return 0;
+ }
/*
* Timer was forced to stay on the current CPU to avoid
The NCSI state machine as it's currently implemented assumes that
transition to the next logical state is performed either explicitly by
calling `schedule_work(&ndp->work)` to re-queue itself or implicitly
after processing the predefined (ndp->pending_req_num) number of
replies. Thus to avoid the configuration FSM from advancing prematurely
and getting out of sync with the process it's essential to not skip
waiting for a reply.
This patch makes the code wait for reception of the Deselect Package
response for the last package probed before proceeding to channel
configuration.
Thanks go to Potin Lai and Cosmo Chou for the initial investigation and
testing.
Fixes: 8e13f70be05e ("net/ncsi: Probe single packages to avoid conflict")
Cc: stable(a)vger.kernel.org
Signed-off-by: Paul Fertser <fercerpav(a)gmail.com>
---
net/ncsi/ncsi-manage.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index 5cf55bde366d..bf8e27b84a66 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -1373,6 +1373,12 @@ static void ncsi_probe_channel(struct ncsi_dev_priv *ndp)
nd->state = ncsi_dev_state_probe_package;
break;
case ncsi_dev_state_probe_package:
+ if (ndp->package_probe_id >= 8) {
+ /* Last package probed, finishing */
+ ndp->flags |= NCSI_DEV_PROBED;
+ break;
+ }
+
ndp->pending_req_num = 1;
nca.type = NCSI_PKT_CMD_SP;
@@ -1489,13 +1495,8 @@ static void ncsi_probe_channel(struct ncsi_dev_priv *ndp)
if (ret)
goto error;
- /* Probe next package */
+ /* Probe next package after receiving response */
ndp->package_probe_id++;
- if (ndp->package_probe_id >= 8) {
- /* Probe finished */
- ndp->flags |= NCSI_DEV_PROBED;
- break;
- }
nd->state = ncsi_dev_state_probe_package;
ndp->active_package = NULL;
break;
--
2.34.1
On Wed, Jan 15, 2025 at 12:03:27PM -0700, Keith Busch wrote:
> On Wed, Jan 15, 2025 at 06:10:05PM +0100, Paolo Bonzini wrote:
> > You can implement something like pthread_once():
>
> ...
>
> > Where to put it I don't know. It doesn't belong in
> > include/linux/once.h. I'm okay with arch/x86/kvm/call_once.h and just
> > pull it with #include "call_once.h".
>
> Thanks for the suggestion, I can work with that. As to where to put it,
> I think the new 'struct once' needs to be a member of struct kvm_arch,
> so I've put it in arch/x86/include/asm/.
>
> Here's the result with that folded in. If this is okay, I'll send a v2,
> and can split out the call_once as a prep patch with your attribution if
> you like.
Has there been any progress here? I'm also affected by the crosvm
regression, and it's been backported to the LTS stable kernel.
(CCing the stable and regressions lists to make sure the regression is
tracked.)
#regzbot introduced: d96c77bd4eeb
From: Nick Child <nnac123(a)linux.ibm.com>
From: Nick Child <nnac123(a)linux.ibm.com>
commit 0983d288caf984de0202c66641577b739caad561 upstream.
Below is a summary of how the driver stores a reference to an skb during
transmit:
tx_buff[free_map[consumer_index]]->skb = new_skb;
free_map[consumer_index] = IBMVNIC_INVALID_MAP;
consumer_index ++;
Where variable data looks like this:
free_map == [4, IBMVNIC_INVALID_MAP, IBMVNIC_INVALID_MAP, 0, 3]
consumer_index^
tx_buff == [skb=null, skb=<ptr>, skb=<ptr>, skb=null, skb=null]
The driver has checks to ensure that free_map[consumer_index] pointed to
a valid index but there was no check to ensure that this index pointed
to an unused/null skb address. So, if, by some chance, our free_map and
tx_buff lists become out of sync then we were previously risking an
skb memory leak. This could then cause tcp congestion control to stop
sending packets, eventually leading to ETIMEDOUT.
Therefore, add a conditional to ensure that the skb address is null. If
not then warn the user (because this is still a bug that should be
patched) and free the old pointer to prevent memleak/tcp problems.
Signed-off-by: Nick Child <nnac123(a)linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
[Denis: minor fix to resolve merge conflict.]
Signed-off-by: Denis Arefev <arefev(a)swemel.ru>
---
Backport fix for CVE-2024-41066
Link: https://nvd.nist.gov/vuln/detail/CVE-2024-41066
---
drivers/net/ethernet/ibm/ibmvnic.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 84da6ccaf339..439796975cbf 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1625,6 +1625,18 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
(tx_pool->consumer_index + 1) % tx_pool->num_buffers;
tx_buff = &tx_pool->tx_buff[index];
+
+ /* Sanity checks on our free map to make sure it points to an index
+ * that is not being occupied by another skb. If skb memory is
+ * not freed then we see congestion control kick in and halt tx.
+ */
+ if (unlikely(tx_buff->skb)) {
+ dev_warn_ratelimited(dev, "TX free map points to untracked skb (%s %d idx=%d)\n",
+ skb_is_gso(skb) ? "tso_pool" : "tx_pool",
+ queue_num, bufidx);
+ dev_kfree_skb_any(tx_buff->skb);
+ }
+
tx_buff->skb = skb;
tx_buff->data_dma[0] = data_dma_addr;
tx_buff->data_len[0] = skb->len;
--
2.43.0
#regzbot introduced v6.6.69..v6.6.70
#regzbot introduced: ad91a2dacbf8c26a446658cdd55e8324dfeff1e7
We hit this regression when updating our guest vm kernel from 6.6.69
to 6.6.70 -- bisecting, this problem was introduced in
ad91a2dacbf8c26a446658cdd55e8324dfeff1e7 -- net: restrict SO_REUSEPORT
to inet sockets
We're getting a timeout when trying to connect to the vsocket in the
guest VM when launching a kata containers 3.10.1 agent which
unsurprisingly ... uses a vsocket to communicate back to the host.
We updated this commit and added an additional sk_is_vsock check and
recompiled and this works correctly for us.
- if (valbool && !sk_is_inet(sk))
+ if (valbool && !(sk_is_inet(sk) || sk_is_vsock(sk)))
My understanding is limited here so I've added Stefano as he is likely
to better understand what makes sense here.
This commit was backported from v6.13 to v6.12.8..6.12.9.
-Simon
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
[ Upstream commit a3616a3c02722d1edb95acc7fceade242f6553ba ]
In the fpsp040 code when copyin or copyout fails call
force_sigsegv(SIGSEGV) instead of do_exit(SIGSEGV).
This solves a couple of problems. Because do_exit embeds the ptrace
stop PTRACE_EVENT_EXIT a complete stack frame needs to be present for
that to work correctly. There is always the information needed for a
ptrace stop where get_signal is called. So exiting with a signal
solves the ptrace issue.
Further exiting with a signal ensures that all of the threads in a
process are killed not just the thread that malfunctioned. Which
avoids confusing userspace.
To make force_sigsegv(SIGSEGV) work in fpsp040_die modify the code to
save all of the registers and jump to ret_from_exception (which
ultimately calls get_signal) after fpsp040_die returns.
v2: Updated the branches to use gas's pseudo ops that automatically
calculate the best branch instruction to use for the purpose.
v1: https://lkml.kernel.org/r/87a6m8kgtx.fsf_-_@disp2133
Link: https://lkml.kernel.org/r/87tukghjfs.fsf_-_@disp2133
Acked-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
Signed-off-by: Finn Thain <fthain(a)linux-m68k.org>
---
arch/m68k/fpsp040/skeleton.S | 3 ++-
arch/m68k/kernel/traps.c | 2 +-
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/m68k/fpsp040/skeleton.S b/arch/m68k/fpsp040/skeleton.S
index 31a9c634c81e..081922c72daa 100644
--- a/arch/m68k/fpsp040/skeleton.S
+++ b/arch/m68k/fpsp040/skeleton.S
@@ -502,7 +502,8 @@ in_ea:
.section .fixup,"ax"
.even
1:
- jbra fpsp040_die
+ jbsr fpsp040_die
+ jbra .Lnotkern
.section __ex_table,"a"
.align 4
diff --git a/arch/m68k/kernel/traps.c b/arch/m68k/kernel/traps.c
index 35f706d836c5..c6f18dc5884b 100644
--- a/arch/m68k/kernel/traps.c
+++ b/arch/m68k/kernel/traps.c
@@ -1155,7 +1155,7 @@ asmlinkage void set_esp0(unsigned long ssp)
*/
asmlinkage void fpsp040_die(void)
{
- do_exit(SIGSEGV);
+ force_sigsegv(SIGSEGV);
}
#ifdef CONFIG_M68KFPU_EMU
From: Liam Howlett <liam.howlett(a)oracle.com>
[ Upstream commit f829b4b212a315b912cb23fd10aaf30534bb5ce9 ]
When the superuser flushes the entire cache, the mmap_read_lock() is not
taken, but mmap_read_unlock() is called. Add the missing
mmap_read_lock() call.
Fixes: cd2567b6850b1648 ("m68k: call find_vma with the mmap_sem held in sys_cacheflush()")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Link: https://lore.kernel.org/r/20210407200032.764445-1-Liam.Howlett@Oracle.com
Signed-off-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
[ mmap_read_lock() open-coded using down_read() as was done prior to v5.8 ]
Signed-off-by: Finn Thain <fthain(a)linux-m68k.org>
---
arch/m68k/kernel/sys_m68k.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/m68k/kernel/sys_m68k.c b/arch/m68k/kernel/sys_m68k.c
index 6363ec83a290..38dcc1a2097d 100644
--- a/arch/m68k/kernel/sys_m68k.c
+++ b/arch/m68k/kernel/sys_m68k.c
@@ -388,6 +388,8 @@ sys_cacheflush (unsigned long addr, int scope, int cache, unsigned long len)
ret = -EPERM;
if (!capable(CAP_SYS_ADMIN))
goto out;
+
+ down_read(¤t->mm->mmap_sem);
} else {
struct vm_area_struct *vma;
From: Omid Ehtemam-Haghighi <omid.ehtemamhaghighi(a)menlosecurity.com>
[ Upstream commit d9ccb18f83ea2bb654289b6ecf014fd267cc988b ]
Soft lockups have been observed on a cluster of Linux-based edge routers
located in a highly dynamic environment. Using the `bird` service, these
routers continuously update BGP-advertised routes due to frequently
changing nexthop destinations, while also managing significant IPv6
traffic. The lockups occur during the traversal of the multipath
circular linked-list in the `fib6_select_path` function, particularly
while iterating through the siblings in the list. The issue typically
arises when the nodes of the linked list are unexpectedly deleted
concurrently on a different core—indicated by their 'next' and
'previous' elements pointing back to the node itself and their reference
count dropping to zero. This results in an infinite loop, leading to a
soft lockup that triggers a system panic via the watchdog timer.
Apply RCU primitives in the problematic code sections to resolve the
issue. Where necessary, update the references to fib6_siblings to
annotate or use the RCU APIs.
Include a test script that reproduces the issue. The script
periodically updates the routing table while generating a heavy load
of outgoing IPv6 traffic through multiple iperf3 clients. It
consistently induces infinite soft lockups within a couple of minutes.
Kernel log:
0 [ffffbd13003e8d30] machine_kexec at ffffffff8ceaf3eb
1 [ffffbd13003e8d90] __crash_kexec at ffffffff8d0120e3
2 [ffffbd13003e8e58] panic at ffffffff8cef65d4
3 [ffffbd13003e8ed8] watchdog_timer_fn at ffffffff8d05cb03
4 [ffffbd13003e8f08] __hrtimer_run_queues at ffffffff8cfec62f
5 [ffffbd13003e8f70] hrtimer_interrupt at ffffffff8cfed756
6 [ffffbd13003e8fd0] __sysvec_apic_timer_interrupt at ffffffff8cea01af
7 [ffffbd13003e8ff0] sysvec_apic_timer_interrupt at ffffffff8df1b83d
-- <IRQ stack> --
8 [ffffbd13003d3708] asm_sysvec_apic_timer_interrupt at ffffffff8e000ecb
[exception RIP: fib6_select_path+299]
RIP: ffffffff8ddafe7b RSP: ffffbd13003d37b8 RFLAGS: 00000287
RAX: ffff975850b43600 RBX: ffff975850b40200 RCX: 0000000000000000
RDX: 000000003fffffff RSI: 0000000051d383e4 RDI: ffff975850b43618
RBP: ffffbd13003d3800 R8: 0000000000000000 R9: ffff975850b40200
R10: 0000000000000000 R11: 0000000000000000 R12: ffffbd13003d3830
R13: ffff975850b436a8 R14: ffff975850b43600 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
9 [ffffbd13003d3808] ip6_pol_route at ffffffff8ddb030c
10 [ffffbd13003d3888] ip6_pol_route_input at ffffffff8ddb068c
11 [ffffbd13003d3898] fib6_rule_lookup at ffffffff8ddf02b5
12 [ffffbd13003d3928] ip6_route_input at ffffffff8ddb0f47
13 [ffffbd13003d3a18] ip6_rcv_finish_core.constprop.0 at ffffffff8dd950d0
14 [ffffbd13003d3a30] ip6_list_rcv_finish.constprop.0 at ffffffff8dd96274
15 [ffffbd13003d3a98] ip6_sublist_rcv at ffffffff8dd96474
16 [ffffbd13003d3af8] ipv6_list_rcv at ffffffff8dd96615
17 [ffffbd13003d3b60] __netif_receive_skb_list_core at ffffffff8dc16fec
18 [ffffbd13003d3be0] netif_receive_skb_list_internal at ffffffff8dc176b3
19 [ffffbd13003d3c50] napi_gro_receive at ffffffff8dc565b9
20 [ffffbd13003d3c80] ice_receive_skb at ffffffffc087e4f5 [ice]
21 [ffffbd13003d3c90] ice_clean_rx_irq at ffffffffc0881b80 [ice]
22 [ffffbd13003d3d20] ice_napi_poll at ffffffffc088232f [ice]
23 [ffffbd13003d3d80] __napi_poll at ffffffff8dc18000
24 [ffffbd13003d3db8] net_rx_action at ffffffff8dc18581
25 [ffffbd13003d3e40] __do_softirq at ffffffff8df352e9
26 [ffffbd13003d3eb0] run_ksoftirqd at ffffffff8ceffe47
27 [ffffbd13003d3ec0] smpboot_thread_fn at ffffffff8cf36a30
28 [ffffbd13003d3ee8] kthread at ffffffff8cf2b39f
29 [ffffbd13003d3f28] ret_from_fork at ffffffff8ce5fa64
30 [ffffbd13003d3f50] ret_from_fork_asm at ffffffff8ce03cbb
Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Adrian Oliver <kernel(a)aoliver.ca>
Signed-off-by: Omid Ehtemam-Haghighi <omid.ehtemamhaghighi(a)menlosecurity.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Ido Schimmel <idosch(a)idosch.org>
Cc: Kuniyuki Iwashima <kuniyu(a)amazon.com>
Cc: Simon Horman <horms(a)kernel.org>
Reviewed-by: David Ahern <dsahern(a)kernel.org>
Link: https://patch.msgid.link/20241106010236.1239299-1-omid.ehtemamhaghighi@menl…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Rajani Kantha <rajanikantha(a)engineer.com>
---
net/ipv6/ip6_fib.c | 8 +-
net/ipv6/route.c | 45 ++-
tools/testing/selftests/net/Makefile | 1 +
.../net/ipv6_route_update_soft_lockup.sh | 262 ++++++++++++++++++
4 files changed, 297 insertions(+), 19 deletions(-)
create mode 100755 tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 0b45ef8b7ee2..b6a7cbd6bee0 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1180,8 +1180,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
while (sibling) {
if (sibling->fib6_metric == rt->fib6_metric &&
rt6_qualify_for_ecmp(sibling)) {
- list_add_tail(&rt->fib6_siblings,
- &sibling->fib6_siblings);
+ list_add_tail_rcu(&rt->fib6_siblings,
+ &sibling->fib6_siblings);
break;
}
sibling = rcu_dereference_protected(sibling->fib6_next,
@@ -1242,7 +1242,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
fib6_siblings)
sibling->fib6_nsiblings--;
rt->fib6_nsiblings = 0;
- list_del_init(&rt->fib6_siblings);
+ list_del_rcu(&rt->fib6_siblings);
rt6_multipath_rebalance(next_sibling);
return err;
}
@@ -1955,7 +1955,7 @@ static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn,
&rt->fib6_siblings, fib6_siblings)
sibling->fib6_nsiblings--;
rt->fib6_nsiblings = 0;
- list_del_init(&rt->fib6_siblings);
+ list_del_rcu(&rt->fib6_siblings);
rt6_multipath_rebalance(next_sibling);
}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 5ae3ff6ffb7e..f3268bac9f19 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -420,8 +420,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
struct flowi6 *fl6, int oif, bool have_oif_match,
const struct sk_buff *skb, int strict)
{
- struct fib6_info *sibling, *next_sibling;
struct fib6_info *match = res->f6i;
+ struct fib6_info *sibling;
if (!match->nh && (!match->fib6_nsiblings || have_oif_match))
goto out;
@@ -447,8 +447,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
if (fl6->mp_hash <= atomic_read(&match->fib6_nh->fib_nh_upper_bound))
goto out;
- list_for_each_entry_safe(sibling, next_sibling, &match->fib6_siblings,
- fib6_siblings) {
+ list_for_each_entry_rcu(sibling, &match->fib6_siblings,
+ fib6_siblings) {
const struct fib6_nh *nh = sibling->fib6_nh;
int nh_upper_bound;
@@ -5189,14 +5189,18 @@ static void ip6_route_mpath_notify(struct fib6_info *rt,
* nexthop. Since sibling routes are always added at the end of
* the list, find the first sibling of the last route appended
*/
+ rcu_read_lock();
+
if ((nlflags & NLM_F_APPEND) && rt_last && rt_last->fib6_nsiblings) {
- rt = list_first_entry(&rt_last->fib6_siblings,
- struct fib6_info,
- fib6_siblings);
+ rt = list_first_or_null_rcu(&rt_last->fib6_siblings,
+ struct fib6_info,
+ fib6_siblings);
}
if (rt)
inet6_rt_notify(RTM_NEWROUTE, rt, info, nlflags);
+
+ rcu_read_unlock();
}
static bool ip6_route_mpath_should_notify(const struct fib6_info *rt)
@@ -5541,17 +5545,21 @@ static size_t rt6_nlmsg_size(struct fib6_info *f6i)
nexthop_for_each_fib6_nh(f6i->nh, rt6_nh_nlmsg_size,
&nexthop_len);
} else {
- struct fib6_info *sibling, *next_sibling;
struct fib6_nh *nh = f6i->fib6_nh;
+ struct fib6_info *sibling;
nexthop_len = 0;
if (f6i->fib6_nsiblings) {
rt6_nh_nlmsg_size(nh, &nexthop_len);
- list_for_each_entry_safe(sibling, next_sibling,
- &f6i->fib6_siblings, fib6_siblings) {
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(sibling, &f6i->fib6_siblings,
+ fib6_siblings) {
rt6_nh_nlmsg_size(sibling->fib6_nh, &nexthop_len);
}
+
+ rcu_read_unlock();
}
nexthop_len += lwtunnel_get_encap_size(nh->fib_nh_lws);
}
@@ -5715,7 +5723,7 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb,
lwtunnel_fill_encap(skb, dst->lwtstate, RTA_ENCAP, RTA_ENCAP_TYPE) < 0)
goto nla_put_failure;
} else if (rt->fib6_nsiblings) {
- struct fib6_info *sibling, *next_sibling;
+ struct fib6_info *sibling;
struct nlattr *mp;
mp = nla_nest_start_noflag(skb, RTA_MULTIPATH);
@@ -5727,14 +5735,21 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb,
0) < 0)
goto nla_put_failure;
- list_for_each_entry_safe(sibling, next_sibling,
- &rt->fib6_siblings, fib6_siblings) {
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(sibling, &rt->fib6_siblings,
+ fib6_siblings) {
if (fib_add_nexthop(skb, &sibling->fib6_nh->nh_common,
sibling->fib6_nh->fib_nh_weight,
- AF_INET6, 0) < 0)
+ AF_INET6, 0) < 0) {
+ rcu_read_unlock();
+
goto nla_put_failure;
+ }
}
+ rcu_read_unlock();
+
nla_nest_end(skb, mp);
} else if (rt->nh) {
if (nla_put_u32(skb, RTA_NH_ID, rt->nh->id))
@@ -6171,7 +6186,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
err = -ENOBUFS;
seq = info->nlh ? info->nlh->nlmsg_seq : 0;
- skb = nlmsg_new(rt6_nlmsg_size(rt), gfp_any());
+ skb = nlmsg_new(rt6_nlmsg_size(rt), GFP_ATOMIC);
if (!skb)
goto errout;
@@ -6184,7 +6199,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
goto errout;
}
rtnl_notify(skb, net, info->portid, RTNLGRP_IPV6_ROUTE,
- info->nlh, gfp_any());
+ info->nlh, GFP_ATOMIC);
return;
errout:
if (err < 0)
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 48d1a68be1d5..080860a8826b 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -72,6 +72,7 @@ TEST_GEN_PROGS += sk_bind_sendto_listen
TEST_GEN_PROGS += sk_connect_zero_addr
TEST_PROGS += test_ingress_egress_chaining.sh
TEST_GEN_FILES += nat6to4.o
+TEST_PROGS += ipv6_route_update_soft_lockup.sh
TEST_FILES := settings
diff --git a/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
new file mode 100755
index 000000000000..a6b2b1f9c641
--- /dev/null
+++ b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
@@ -0,0 +1,262 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Testing for potential kernel soft lockup during IPv6 routing table
+# refresh under heavy outgoing IPv6 traffic. If a kernel soft lockup
+# occurs, a kernel panic will be triggered to prevent associated issues.
+#
+#
+# Test Environment Layout
+#
+# ┌----------------┐ ┌----------------┐
+# | SOURCE_NS | | SINK_NS |
+# | NAMESPACE | | NAMESPACE |
+# |(iperf3 clients)| |(iperf3 servers)|
+# | | | |
+# | | | |
+# | ┌-----------| nexthops |---------┐ |
+# | |veth_source|<--------------------------------------->|veth_sink|<┐ |
+# | └-----------|2001:0DB8:1::0:1/96 2001:0DB8:1::1:1/96 |---------┘ | |
+# | | ^ 2001:0DB8:1::1:2/96 | | |
+# | | . . | fwd | |
+# | ┌---------┐ | . . | | |
+# | | IPv6 | | . . | V |
+# | | routing | | . 2001:0DB8:1::1:80/96| ┌-----┐ |
+# | | table | | . | | lo | |
+# | | nexthop | | . └--------┴-----┴-┘
+# | | update | | ............................> 2001:0DB8:2::1:1/128
+# | └-------- ┘ |
+# └----------------┘
+#
+# The test script sets up two network namespaces, source_ns and sink_ns,
+# connected via a veth link. Within source_ns, it continuously updates the
+# IPv6 routing table by flushing and inserting IPV6_NEXTHOP_ADDR_COUNT nexthop
+# IPs destined for SINK_LOOPBACK_IP_ADDR in sink_ns. This refresh occurs at a
+# rate of 1/ROUTING_TABLE_REFRESH_PERIOD per second for TEST_DURATION seconds.
+#
+# Simultaneously, multiple iperf3 clients within source_ns generate heavy
+# outgoing IPv6 traffic. Each client is assigned a unique port number starting
+# at 5000 and incrementing sequentially. Each client targets a unique iperf3
+# server running in sink_ns, connected to the SINK_LOOPBACK_IFACE interface
+# using the same port number.
+#
+# The number of iperf3 servers and clients is set to half of the total
+# available cores on each machine.
+#
+# NOTE: We have tested this script on machines with various CPU specifications,
+# ranging from lower to higher performance as listed below. The test script
+# effectively triggered a kernel soft lockup on machines running an unpatched
+# kernel in under a minute:
+#
+# - 1x Intel Xeon E-2278G 8-Core Processor @ 3.40GHz
+# - 1x Intel Xeon E-2378G Processor 8-Core @ 2.80GHz
+# - 1x AMD EPYC 7401P 24-Core Processor @ 2.00GHz
+# - 1x AMD EPYC 7402P 24-Core Processor @ 2.80GHz
+# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz
+# - 1x Ampere Altra Q80-30 80-Core Processor @ 3.00GHz
+# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz
+# - 2x Intel Xeon Silver 4214 24-Core Processor @ 2.20GHz
+# - 1x AMD EPYC 7502P 32-Core @ 2.50GHz
+# - 1x Intel Xeon Gold 6314U 32-Core Processor @ 2.30GHz
+# - 2x Intel Xeon Gold 6338 32-Core Processor @ 2.00GHz
+#
+# On less performant machines, you may need to increase the TEST_DURATION
+# parameter to enhance the likelihood of encountering a race condition leading
+# to a kernel soft lockup and avoid a false negative result.
+#
+# NOTE: The test may not produce the expected result in virtualized
+# environments (e.g., qemu) due to differences in timing and CPU handling,
+# which can affect the conditions needed to trigger a soft lockup.
+
+source lib.sh
+source net_helper.sh
+
+TEST_DURATION=300
+ROUTING_TABLE_REFRESH_PERIOD=0.01
+
+IPERF3_BITRATE="300m"
+
+
+IPV6_NEXTHOP_ADDR_COUNT="128"
+IPV6_NEXTHOP_ADDR_MASK="96"
+IPV6_NEXTHOP_PREFIX="2001:0DB8:1"
+
+
+SOURCE_TEST_IFACE="veth_source"
+SOURCE_TEST_IP_ADDR="2001:0DB8:1::0:1/96"
+
+SINK_TEST_IFACE="veth_sink"
+# ${SINK_TEST_IFACE} is populated with the following range of IPv6 addresses:
+# 2001:0DB8:1::1:1 to 2001:0DB8:1::1:${IPV6_NEXTHOP_ADDR_COUNT}
+SINK_LOOPBACK_IFACE="lo"
+SINK_LOOPBACK_IP_MASK="128"
+SINK_LOOPBACK_IP_ADDR="2001:0DB8:2::1:1"
+
+nexthop_ip_list=""
+termination_signal=""
+kernel_softlokup_panic_prev_val=""
+
+terminate_ns_processes_by_pattern() {
+ local ns=$1
+ local pattern=$2
+
+ for pid in $(ip netns pids ${ns}); do
+ [ -e /proc/$pid/cmdline ] && grep -qe "${pattern}" /proc/$pid/cmdline && kill -9 $pid
+ done
+}
+
+cleanup() {
+ echo "info: cleaning up namespaces and terminating all processes within them..."
+
+
+ # Terminate iperf3 instances running in the source_ns. To avoid race
+ # conditions, first iterate over the PIDs and terminate those
+ # associated with the bash shells running the
+ # `while true; do iperf3 -c ...; done` loops. In a second iteration,
+ # terminate the individual `iperf3 -c ...` instances.
+ terminate_ns_processes_by_pattern ${source_ns} while
+ terminate_ns_processes_by_pattern ${source_ns} iperf3
+
+ # Repeat the same process for sink_ns
+ terminate_ns_processes_by_pattern ${sink_ns} while
+ terminate_ns_processes_by_pattern ${sink_ns} iperf3
+
+ # Check if any iperf3 instances are still running. This could happen
+ # if a core has entered an infinite loop and the timeout for detecting
+ # the soft lockup has not expired, but either the test interval has
+ # already elapsed or the test was terminated manually (e.g., with ^C)
+ for pid in $(ip netns pids ${source_ns}); do
+ if [ -e /proc/$pid/cmdline ] && grep -qe 'iperf3' /proc/$pid/cmdline; then
+ echo "FAIL: unable to terminate some iperf3 instances. Soft lockup is underway. A kernel panic is on the way!"
+ exit ${ksft_fail}
+ fi
+ done
+
+ if [ "$termination_signal" == "SIGINT" ]; then
+ echo "SKIP: Termination due to ^C (SIGINT)"
+ elif [ "$termination_signal" == "SIGALRM" ]; then
+ echo "PASS: No kernel soft lockup occurred during this ${TEST_DURATION} second test"
+ fi
+
+ cleanup_ns ${source_ns} ${sink_ns}
+
+ sysctl -qw kernel.softlockup_panic=${kernel_softlokup_panic_prev_val}
+}
+
+setup_prepare() {
+ setup_ns source_ns sink_ns
+
+ ip -n ${source_ns} link add name ${SOURCE_TEST_IFACE} type veth peer name ${SINK_TEST_IFACE} netns ${sink_ns}
+
+ # Setting up the Source namespace
+ ip -n ${source_ns} addr add ${SOURCE_TEST_IP_ADDR} dev ${SOURCE_TEST_IFACE}
+ ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} qlen 10000
+ ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} up
+ ip netns exec ${source_ns} sysctl -qw net.ipv6.fib_multipath_hash_policy=1
+
+ # Setting up the Sink namespace
+ ip -n ${sink_ns} addr add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} dev ${SINK_LOOPBACK_IFACE}
+ ip -n ${sink_ns} link set dev ${SINK_LOOPBACK_IFACE} up
+ ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_LOOPBACK_IFACE}.forwarding=1
+
+ ip -n ${sink_ns} link set ${SINK_TEST_IFACE} up
+ ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_TEST_IFACE}.forwarding=1
+
+
+ # Populate nexthop IPv6 addresses on the test interface in the sink_ns
+ echo "info: populating ${IPV6_NEXTHOP_ADDR_COUNT} IPv6 addresses on the ${SINK_TEST_IFACE} interface ..."
+ for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do
+ ip -n ${sink_ns} addr add ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" "${IP}")/${IPV6_NEXTHOP_ADDR_MASK} dev ${SINK_TEST_IFACE};
+ done
+
+ # Preparing list of nexthops
+ for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do
+ nexthop_ip_list=$nexthop_ip_list" nexthop via ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" $IP) dev ${SOURCE_TEST_IFACE} weight 1"
+ done
+}
+
+
+test_soft_lockup_during_routing_table_refresh() {
+ # Start num_of_iperf_servers iperf3 servers in the sink_ns namespace,
+ # each listening on ports starting at 5001 and incrementing
+ # sequentially. Since iperf3 instances may terminate unexpectedly, a
+ # while loop is used to automatically restart them in such cases.
+ echo "info: starting ${num_of_iperf_servers} iperf3 servers in the sink_ns namespace ..."
+ for i in $(seq 1 ${num_of_iperf_servers}); do
+ cmd="iperf3 --bind ${SINK_LOOPBACK_IP_ADDR} -s -p $(printf '5%03d' ${i}) --rcv-timeout 200 &>/dev/null"
+ ip netns exec ${sink_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null
+ done
+
+ # Wait for the iperf3 servers to be ready
+ for i in $(seq ${num_of_iperf_servers}); do
+ port=$(printf '5%03d' ${i});
+ wait_local_port_listen ${sink_ns} ${port} tcp
+ done
+
+ # Continuously refresh the routing table in the background within
+ # the source_ns namespace
+ ip netns exec ${source_ns} bash -c "
+ while \$(ip netns list | grep -q ${source_ns}); do
+ ip -6 route add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} ${nexthop_ip_list};
+ sleep ${ROUTING_TABLE_REFRESH_PERIOD};
+ ip -6 route delete ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK};
+ done &"
+
+ # Start num_of_iperf_servers iperf3 clients in the source_ns namespace,
+ # each sending TCP traffic on sequential ports starting at 5001.
+ # Since iperf3 instances may terminate unexpectedly (e.g., if the route
+ # to the server is deleted in the background during a route refresh), a
+ # while loop is used to automatically restart them in such cases.
+ echo "info: starting ${num_of_iperf_servers} iperf3 clients in the source_ns namespace ..."
+ for i in $(seq 1 ${num_of_iperf_servers}); do
+ cmd="iperf3 -c ${SINK_LOOPBACK_IP_ADDR} -p $(printf '5%03d' ${i}) --length 64 --bitrate ${IPERF3_BITRATE} -t 0 --connect-timeout 150 &>/dev/null"
+ ip netns exec ${source_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null
+ done
+
+ echo "info: IPv6 routing table is being updated at the rate of $(echo "1/${ROUTING_TABLE_REFRESH_PERIOD}" | bc)/s for ${TEST_DURATION} seconds ..."
+ echo "info: A kernel soft lockup, if detected, results in a kernel panic!"
+
+ wait
+}
+
+# Make sure 'iperf3' is installed, skip the test otherwise
+if [ ! -x "$(command -v "iperf3")" ]; then
+ echo "SKIP: 'iperf3' is not installed. Skipping the test."
+ exit ${ksft_skip}
+fi
+
+# Determine the number of cores on the machine
+num_of_iperf_servers=$(( $(nproc)/2 ))
+
+# Check if we are running on a multi-core machine, skip the test otherwise
+if [ "${num_of_iperf_servers}" -eq 0 ]; then
+ echo "SKIP: This test is not valid on a single core machine!"
+ exit ${ksft_skip}
+fi
+
+# Since the kernel soft lockup we're testing causes at least one core to enter
+# an infinite loop, destabilizing the host and likely affecting subsequent
+# tests, we trigger a kernel panic instead of reporting a failure and
+# continuing
+kernel_softlokup_panic_prev_val=$(sysctl -n kernel.softlockup_panic)
+sysctl -qw kernel.softlockup_panic=1
+
+handle_sigint() {
+ termination_signal="SIGINT"
+ cleanup
+ exit ${ksft_skip}
+}
+
+handle_sigalrm() {
+ termination_signal="SIGALRM"
+ cleanup
+ exit ${ksft_pass}
+}
+
+trap handle_sigint SIGINT
+trap handle_sigalrm SIGALRM
+
+(sleep ${TEST_DURATION} && kill -s SIGALRM $$)&
+
+setup_prepare
+test_soft_lockup_during_routing_table_refresh
--
2.35.3
From: Selvin Xavier <selvin.xavier(a)broadcom.com>
[ Upstream commit 8be3e5b0c96beeefe9d5486b96575d104d3e7d17 ]
Driver waits indefinitely for the fifo occupancy to go below a threshold
as soon as the pacing interrupt is received. This can cause soft lockup on
one of the processors, if the rate of DB is very high.
Add a loop count for FPGA and exit the __wait_for_fifo_occupancy_below_th
if the loop is taking more time. Pacing will be continuing until the
occupancy is below the threshold. This is ensured by the checks in
bnxt_re_pacing_timer_exp and further scheduling the work for pacing based
on the fifo occupancy.
Fixes: 2ad4e6303a6d ("RDMA/bnxt_re: Implement doorbell pacing algorithm")
Link: https://patch.msgid.link/r/1728373302-19530-7-git-send-email-selvin.xavier@…
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil(a)broadcom.com>
Reviewed-by: Chandramohan Akula <chandramohan.akula(a)broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier(a)broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
[ Add the declaration of variable pacing_data to make it work on 6.6.y ]
Signed-off-by: Alva Lan <alvalan9(a)foxmail.com>
---
drivers/infiniband/hw/bnxt_re/main.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/infiniband/hw/bnxt_re/main.c b/drivers/infiniband/hw/bnxt_re/main.c
index c7e51cc2ea26..082a383c4913 100644
--- a/drivers/infiniband/hw/bnxt_re/main.c
+++ b/drivers/infiniband/hw/bnxt_re/main.c
@@ -485,6 +485,8 @@ static void bnxt_re_set_default_pacing_data(struct bnxt_re_dev *rdev)
static void __wait_for_fifo_occupancy_below_th(struct bnxt_re_dev *rdev)
{
u32 read_val, fifo_occup;
+ struct bnxt_qplib_db_pacing_data *pacing_data = rdev->qplib_res.pacing_data;
+ u32 retry_fifo_check = 1000;
/* loop shouldn't run infintely as the occupancy usually goes
* below pacing algo threshold as soon as pacing kicks in.
@@ -500,6 +502,14 @@ static void __wait_for_fifo_occupancy_below_th(struct bnxt_re_dev *rdev)
if (fifo_occup < rdev->qplib_res.pacing_data->pacing_th)
break;
+ if (!retry_fifo_check--) {
+ dev_info_once(rdev_to_dev(rdev),
+ "%s: fifo_occup = 0x%xfifo_max_depth = 0x%x pacing_th = 0x%x\n",
+ __func__, fifo_occup, pacing_data->fifo_max_depth,
+ pacing_data->pacing_th);
+ break;
+ }
+
}
}
--
2.43.0
From: Al Viro <viro(a)zeniv.linux.org.uk>
[ Upstream commit 50e43a57334400668952f8e551c9d87d3ed2dfef ]
We get there when sigreturn has performed obscene acts on kernel stack;
in particular, the location of pt_regs has shifted. We are about to call
syscall_trace(), which might stop for tracer. If that happens, we'd better
have task_pt_regs() returning correct result...
Fucked-up-by: Al Viro <viro(a)zeniv.linux.org.uk>
Fixes: bd6f56a75bb2 ("m68k: Missing syscall_trace() on sigreturn")
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Tested-by: Michael Schmitz <schmitzmic(a)gmail.com>
Reviewed-by: Michael Schmitz <schmitzmic(a)gmail.com>
Tested-by: Finn Thain <fthain(a)linux-m68k.org>
Link: https://lore.kernel.org/r/YP2dMWeV1LkHiOpr@zeniv-ca.linux.org.uk
Signed-off-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
Signed-off-by: Finn Thain <fthain(a)linux-m68k.org>
---
arch/m68k/kernel/entry.S | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/m68k/kernel/entry.S b/arch/m68k/kernel/entry.S
index 417d8f0e8962..0d03b4f2077b 100644
--- a/arch/m68k/kernel/entry.S
+++ b/arch/m68k/kernel/entry.S
@@ -182,6 +182,8 @@ ENTRY(ret_from_signal)
movel %curptr@(TASK_STACK),%a1
tstb %a1@(TINFO_FLAGS+2)
jge 1f
+ lea %sp@(SWITCH_STACK_SIZE),%a1
+ movel %a1,%curptr@(TASK_THREAD+THREAD_ESP0)
jbsr syscall_trace
1: RESTORE_SWITCH_STACK
addql #4,%sp
From: Esben Haabendal <esben(a)geanix.com>
commit e533e4c62e9993e62e947ae9bbec34e4c7ae81c2 upstream.
By waiting at most 1 second for USR2_TXDC to be set, we avoid a potential
deadlock.
In case of the timeout, there is not much we can do, so we simply ignore
the transmitter state and optimistically try to continue.
Signed-off-by: Esben Haabendal <esben(a)geanix.com>
Acked-by: Marc Kleine-Budde <mkl(a)pengutronix.de>
Link: https://lore.kernel.org/r/919647898c337a46604edcabaf13d42d80c0915d.17128376…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
[Denis: minor fix to resolve merge conflict.]
Signed-off-by: Denis Arefev <arefev(a)swemel.ru>
---
Backport fix for CVE-2024-40967
Link: https://nvd.nist.gov/vuln/detail/CVE-2024-40967
---
drivers/tty/serial/imx.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
index 6e49928bb864..5abf6685fe3c 100644
--- a/drivers/tty/serial/imx.c
+++ b/drivers/tty/serial/imx.c
@@ -27,6 +27,7 @@
#include <linux/of.h>
#include <linux/of_device.h>
#include <linux/io.h>
+#include <linux/iopoll.h>
#include <linux/dma-mapping.h>
#include <asm/irq.h>
@@ -2006,8 +2007,8 @@ imx_uart_console_write(struct console *co, const char *s, unsigned int count)
{
struct imx_port *sport = imx_uart_ports[co->index];
struct imx_port_ucrs old_ucr;
- unsigned int ucr1;
- unsigned long flags = 0;
+ unsigned long flags;
+ unsigned int ucr1, usr2;
int locked = 1;
if (sport->port.sysrq)
@@ -2038,8 +2039,8 @@ imx_uart_console_write(struct console *co, const char *s, unsigned int count)
* Finally, wait for transmitter to become empty
* and restore UCR1/2/3
*/
- while (!(imx_uart_readl(sport, USR2) & USR2_TXDC));
-
+ read_poll_timeout_atomic(imx_uart_readl, usr2, usr2 & USR2_TXDC,
+ 0, USEC_PER_SEC, false, sport, USR2);
imx_uart_ucrs_restore(sport, &old_ucr);
if (locked)
--
2.43.0
From: Omid Ehtemam-Haghighi <omid.ehtemamhaghighi(a)menlosecurity.com>
[ Upstream commit d9ccb18f83ea2bb654289b6ecf014fd267cc988b ]
Soft lockups have been observed on a cluster of Linux-based edge routers
located in a highly dynamic environment. Using the `bird` service, these
routers continuously update BGP-advertised routes due to frequently
changing nexthop destinations, while also managing significant IPv6
traffic. The lockups occur during the traversal of the multipath
circular linked-list in the `fib6_select_path` function, particularly
while iterating through the siblings in the list. The issue typically
arises when the nodes of the linked list are unexpectedly deleted
concurrently on a different core—indicated by their 'next' and
'previous' elements pointing back to the node itself and their reference
count dropping to zero. This results in an infinite loop, leading to a
soft lockup that triggers a system panic via the watchdog timer.
Apply RCU primitives in the problematic code sections to resolve the
issue. Where necessary, update the references to fib6_siblings to
annotate or use the RCU APIs.
Include a test script that reproduces the issue. The script
periodically updates the routing table while generating a heavy load
of outgoing IPv6 traffic through multiple iperf3 clients. It
consistently induces infinite soft lockups within a couple of minutes.
Kernel log:
0 [ffffbd13003e8d30] machine_kexec at ffffffff8ceaf3eb
1 [ffffbd13003e8d90] __crash_kexec at ffffffff8d0120e3
2 [ffffbd13003e8e58] panic at ffffffff8cef65d4
3 [ffffbd13003e8ed8] watchdog_timer_fn at ffffffff8d05cb03
4 [ffffbd13003e8f08] __hrtimer_run_queues at ffffffff8cfec62f
5 [ffffbd13003e8f70] hrtimer_interrupt at ffffffff8cfed756
6 [ffffbd13003e8fd0] __sysvec_apic_timer_interrupt at ffffffff8cea01af
7 [ffffbd13003e8ff0] sysvec_apic_timer_interrupt at ffffffff8df1b83d
-- <IRQ stack> --
8 [ffffbd13003d3708] asm_sysvec_apic_timer_interrupt at ffffffff8e000ecb
[exception RIP: fib6_select_path+299]
RIP: ffffffff8ddafe7b RSP: ffffbd13003d37b8 RFLAGS: 00000287
RAX: ffff975850b43600 RBX: ffff975850b40200 RCX: 0000000000000000
RDX: 000000003fffffff RSI: 0000000051d383e4 RDI: ffff975850b43618
RBP: ffffbd13003d3800 R8: 0000000000000000 R9: ffff975850b40200
R10: 0000000000000000 R11: 0000000000000000 R12: ffffbd13003d3830
R13: ffff975850b436a8 R14: ffff975850b43600 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
9 [ffffbd13003d3808] ip6_pol_route at ffffffff8ddb030c
10 [ffffbd13003d3888] ip6_pol_route_input at ffffffff8ddb068c
11 [ffffbd13003d3898] fib6_rule_lookup at ffffffff8ddf02b5
12 [ffffbd13003d3928] ip6_route_input at ffffffff8ddb0f47
13 [ffffbd13003d3a18] ip6_rcv_finish_core.constprop.0 at ffffffff8dd950d0
14 [ffffbd13003d3a30] ip6_list_rcv_finish.constprop.0 at ffffffff8dd96274
15 [ffffbd13003d3a98] ip6_sublist_rcv at ffffffff8dd96474
16 [ffffbd13003d3af8] ipv6_list_rcv at ffffffff8dd96615
17 [ffffbd13003d3b60] __netif_receive_skb_list_core at ffffffff8dc16fec
18 [ffffbd13003d3be0] netif_receive_skb_list_internal at ffffffff8dc176b3
19 [ffffbd13003d3c50] napi_gro_receive at ffffffff8dc565b9
20 [ffffbd13003d3c80] ice_receive_skb at ffffffffc087e4f5 [ice]
21 [ffffbd13003d3c90] ice_clean_rx_irq at ffffffffc0881b80 [ice]
22 [ffffbd13003d3d20] ice_napi_poll at ffffffffc088232f [ice]
23 [ffffbd13003d3d80] __napi_poll at ffffffff8dc18000
24 [ffffbd13003d3db8] net_rx_action at ffffffff8dc18581
25 [ffffbd13003d3e40] __do_softirq at ffffffff8df352e9
26 [ffffbd13003d3eb0] run_ksoftirqd at ffffffff8ceffe47
27 [ffffbd13003d3ec0] smpboot_thread_fn at ffffffff8cf36a30
28 [ffffbd13003d3ee8] kthread at ffffffff8cf2b39f
29 [ffffbd13003d3f28] ret_from_fork at ffffffff8ce5fa64
30 [ffffbd13003d3f50] ret_from_fork_asm at ffffffff8ce03cbb
Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Adrian Oliver <kernel(a)aoliver.ca>
Signed-off-by: Omid Ehtemam-Haghighi <omid.ehtemamhaghighi(a)menlosecurity.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Ido Schimmel <idosch(a)idosch.org>
Cc: Kuniyuki Iwashima <kuniyu(a)amazon.com>
Cc: Simon Horman <horms(a)kernel.org>
Reviewed-by: David Ahern <dsahern(a)kernel.org>
Link: https://patch.msgid.link/20241106010236.1239299-1-omid.ehtemamhaghighi@menl…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
Signed-off-by: Rajani Kantha <rajanikantha(a)engineer.com>
---
net/ipv6/ip6_fib.c | 8 +-
net/ipv6/route.c | 45 ++-
tools/testing/selftests/net/Makefile | 1 +
.../net/ipv6_route_update_soft_lockup.sh | 262 ++++++++++++++++++
4 files changed, 297 insertions(+), 19 deletions(-)
create mode 100755 tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index afa9073567dc..023ac39041a2 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1179,8 +1179,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
while (sibling) {
if (sibling->fib6_metric == rt->fib6_metric &&
rt6_qualify_for_ecmp(sibling)) {
- list_add_tail(&rt->fib6_siblings,
- &sibling->fib6_siblings);
+ list_add_tail_rcu(&rt->fib6_siblings,
+ &sibling->fib6_siblings);
break;
}
sibling = rcu_dereference_protected(sibling->fib6_next,
@@ -1241,7 +1241,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
fib6_siblings)
sibling->fib6_nsiblings--;
rt->fib6_nsiblings = 0;
- list_del_init(&rt->fib6_siblings);
+ list_del_rcu(&rt->fib6_siblings);
rt6_multipath_rebalance(next_sibling);
return err;
}
@@ -1954,7 +1954,7 @@ static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn,
&rt->fib6_siblings, fib6_siblings)
sibling->fib6_nsiblings--;
rt->fib6_nsiblings = 0;
- list_del_init(&rt->fib6_siblings);
+ list_del_rcu(&rt->fib6_siblings);
rt6_multipath_rebalance(next_sibling);
}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index fc5c53462025..c5cee40a658b 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -418,8 +418,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
struct flowi6 *fl6, int oif, bool have_oif_match,
const struct sk_buff *skb, int strict)
{
- struct fib6_info *sibling, *next_sibling;
struct fib6_info *match = res->f6i;
+ struct fib6_info *sibling;
if (!match->nh && (!match->fib6_nsiblings || have_oif_match))
goto out;
@@ -445,8 +445,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
if (fl6->mp_hash <= atomic_read(&match->fib6_nh->fib_nh_upper_bound))
goto out;
- list_for_each_entry_safe(sibling, next_sibling, &match->fib6_siblings,
- fib6_siblings) {
+ list_for_each_entry_rcu(sibling, &match->fib6_siblings,
+ fib6_siblings) {
const struct fib6_nh *nh = sibling->fib6_nh;
int nh_upper_bound;
@@ -5186,14 +5186,18 @@ static void ip6_route_mpath_notify(struct fib6_info *rt,
* nexthop. Since sibling routes are always added at the end of
* the list, find the first sibling of the last route appended
*/
+ rcu_read_lock();
+
if ((nlflags & NLM_F_APPEND) && rt_last && rt_last->fib6_nsiblings) {
- rt = list_first_entry(&rt_last->fib6_siblings,
- struct fib6_info,
- fib6_siblings);
+ rt = list_first_or_null_rcu(&rt_last->fib6_siblings,
+ struct fib6_info,
+ fib6_siblings);
}
if (rt)
inet6_rt_notify(RTM_NEWROUTE, rt, info, nlflags);
+
+ rcu_read_unlock();
}
static bool ip6_route_mpath_should_notify(const struct fib6_info *rt)
@@ -5538,17 +5542,21 @@ static size_t rt6_nlmsg_size(struct fib6_info *f6i)
nexthop_for_each_fib6_nh(f6i->nh, rt6_nh_nlmsg_size,
&nexthop_len);
} else {
- struct fib6_info *sibling, *next_sibling;
struct fib6_nh *nh = f6i->fib6_nh;
+ struct fib6_info *sibling;
nexthop_len = 0;
if (f6i->fib6_nsiblings) {
rt6_nh_nlmsg_size(nh, &nexthop_len);
- list_for_each_entry_safe(sibling, next_sibling,
- &f6i->fib6_siblings, fib6_siblings) {
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(sibling, &f6i->fib6_siblings,
+ fib6_siblings) {
rt6_nh_nlmsg_size(sibling->fib6_nh, &nexthop_len);
}
+
+ rcu_read_unlock();
}
nexthop_len += lwtunnel_get_encap_size(nh->fib_nh_lws);
}
@@ -5712,7 +5720,7 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb,
lwtunnel_fill_encap(skb, dst->lwtstate, RTA_ENCAP, RTA_ENCAP_TYPE) < 0)
goto nla_put_failure;
} else if (rt->fib6_nsiblings) {
- struct fib6_info *sibling, *next_sibling;
+ struct fib6_info *sibling;
struct nlattr *mp;
mp = nla_nest_start_noflag(skb, RTA_MULTIPATH);
@@ -5724,14 +5732,21 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb,
0) < 0)
goto nla_put_failure;
- list_for_each_entry_safe(sibling, next_sibling,
- &rt->fib6_siblings, fib6_siblings) {
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(sibling, &rt->fib6_siblings,
+ fib6_siblings) {
if (fib_add_nexthop(skb, &sibling->fib6_nh->nh_common,
sibling->fib6_nh->fib_nh_weight,
- AF_INET6, 0) < 0)
+ AF_INET6, 0) < 0) {
+ rcu_read_unlock();
+
goto nla_put_failure;
+ }
}
+ rcu_read_unlock();
+
nla_nest_end(skb, mp);
} else if (rt->nh) {
if (nla_put_u32(skb, RTA_NH_ID, rt->nh->id))
@@ -6168,7 +6183,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
err = -ENOBUFS;
seq = info->nlh ? info->nlh->nlmsg_seq : 0;
- skb = nlmsg_new(rt6_nlmsg_size(rt), gfp_any());
+ skb = nlmsg_new(rt6_nlmsg_size(rt), GFP_ATOMIC);
if (!skb)
goto errout;
@@ -6181,7 +6196,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
goto errout;
}
rtnl_notify(skb, net, info->portid, RTNLGRP_IPV6_ROUTE,
- info->nlh, gfp_any());
+ info->nlh, GFP_ATOMIC);
return;
errout:
if (err < 0)
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 91a48efb140b..efaf0e0bc459 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -91,6 +91,7 @@ TEST_PROGS += test_vxlan_mdb.sh
TEST_PROGS += test_bridge_neigh_suppress.sh
TEST_PROGS += test_vxlan_nolocalbypass.sh
TEST_PROGS += test_bridge_backup_port.sh
+TEST_PROGS += ipv6_route_update_soft_lockup.sh
TEST_FILES := settings
TEST_FILES += in_netns.sh lib.sh net_helper.sh setup_loopback.sh setup_veth.sh
diff --git a/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
new file mode 100755
index 000000000000..a6b2b1f9c641
--- /dev/null
+++ b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
@@ -0,0 +1,262 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Testing for potential kernel soft lockup during IPv6 routing table
+# refresh under heavy outgoing IPv6 traffic. If a kernel soft lockup
+# occurs, a kernel panic will be triggered to prevent associated issues.
+#
+#
+# Test Environment Layout
+#
+# ┌----------------┐ ┌----------------┐
+# | SOURCE_NS | | SINK_NS |
+# | NAMESPACE | | NAMESPACE |
+# |(iperf3 clients)| |(iperf3 servers)|
+# | | | |
+# | | | |
+# | ┌-----------| nexthops |---------┐ |
+# | |veth_source|<--------------------------------------->|veth_sink|<┐ |
+# | └-----------|2001:0DB8:1::0:1/96 2001:0DB8:1::1:1/96 |---------┘ | |
+# | | ^ 2001:0DB8:1::1:2/96 | | |
+# | | . . | fwd | |
+# | ┌---------┐ | . . | | |
+# | | IPv6 | | . . | V |
+# | | routing | | . 2001:0DB8:1::1:80/96| ┌-----┐ |
+# | | table | | . | | lo | |
+# | | nexthop | | . └--------┴-----┴-┘
+# | | update | | ............................> 2001:0DB8:2::1:1/128
+# | └-------- ┘ |
+# └----------------┘
+#
+# The test script sets up two network namespaces, source_ns and sink_ns,
+# connected via a veth link. Within source_ns, it continuously updates the
+# IPv6 routing table by flushing and inserting IPV6_NEXTHOP_ADDR_COUNT nexthop
+# IPs destined for SINK_LOOPBACK_IP_ADDR in sink_ns. This refresh occurs at a
+# rate of 1/ROUTING_TABLE_REFRESH_PERIOD per second for TEST_DURATION seconds.
+#
+# Simultaneously, multiple iperf3 clients within source_ns generate heavy
+# outgoing IPv6 traffic. Each client is assigned a unique port number starting
+# at 5000 and incrementing sequentially. Each client targets a unique iperf3
+# server running in sink_ns, connected to the SINK_LOOPBACK_IFACE interface
+# using the same port number.
+#
+# The number of iperf3 servers and clients is set to half of the total
+# available cores on each machine.
+#
+# NOTE: We have tested this script on machines with various CPU specifications,
+# ranging from lower to higher performance as listed below. The test script
+# effectively triggered a kernel soft lockup on machines running an unpatched
+# kernel in under a minute:
+#
+# - 1x Intel Xeon E-2278G 8-Core Processor @ 3.40GHz
+# - 1x Intel Xeon E-2378G Processor 8-Core @ 2.80GHz
+# - 1x AMD EPYC 7401P 24-Core Processor @ 2.00GHz
+# - 1x AMD EPYC 7402P 24-Core Processor @ 2.80GHz
+# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz
+# - 1x Ampere Altra Q80-30 80-Core Processor @ 3.00GHz
+# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz
+# - 2x Intel Xeon Silver 4214 24-Core Processor @ 2.20GHz
+# - 1x AMD EPYC 7502P 32-Core @ 2.50GHz
+# - 1x Intel Xeon Gold 6314U 32-Core Processor @ 2.30GHz
+# - 2x Intel Xeon Gold 6338 32-Core Processor @ 2.00GHz
+#
+# On less performant machines, you may need to increase the TEST_DURATION
+# parameter to enhance the likelihood of encountering a race condition leading
+# to a kernel soft lockup and avoid a false negative result.
+#
+# NOTE: The test may not produce the expected result in virtualized
+# environments (e.g., qemu) due to differences in timing and CPU handling,
+# which can affect the conditions needed to trigger a soft lockup.
+
+source lib.sh
+source net_helper.sh
+
+TEST_DURATION=300
+ROUTING_TABLE_REFRESH_PERIOD=0.01
+
+IPERF3_BITRATE="300m"
+
+
+IPV6_NEXTHOP_ADDR_COUNT="128"
+IPV6_NEXTHOP_ADDR_MASK="96"
+IPV6_NEXTHOP_PREFIX="2001:0DB8:1"
+
+
+SOURCE_TEST_IFACE="veth_source"
+SOURCE_TEST_IP_ADDR="2001:0DB8:1::0:1/96"
+
+SINK_TEST_IFACE="veth_sink"
+# ${SINK_TEST_IFACE} is populated with the following range of IPv6 addresses:
+# 2001:0DB8:1::1:1 to 2001:0DB8:1::1:${IPV6_NEXTHOP_ADDR_COUNT}
+SINK_LOOPBACK_IFACE="lo"
+SINK_LOOPBACK_IP_MASK="128"
+SINK_LOOPBACK_IP_ADDR="2001:0DB8:2::1:1"
+
+nexthop_ip_list=""
+termination_signal=""
+kernel_softlokup_panic_prev_val=""
+
+terminate_ns_processes_by_pattern() {
+ local ns=$1
+ local pattern=$2
+
+ for pid in $(ip netns pids ${ns}); do
+ [ -e /proc/$pid/cmdline ] && grep -qe "${pattern}" /proc/$pid/cmdline && kill -9 $pid
+ done
+}
+
+cleanup() {
+ echo "info: cleaning up namespaces and terminating all processes within them..."
+
+
+ # Terminate iperf3 instances running in the source_ns. To avoid race
+ # conditions, first iterate over the PIDs and terminate those
+ # associated with the bash shells running the
+ # `while true; do iperf3 -c ...; done` loops. In a second iteration,
+ # terminate the individual `iperf3 -c ...` instances.
+ terminate_ns_processes_by_pattern ${source_ns} while
+ terminate_ns_processes_by_pattern ${source_ns} iperf3
+
+ # Repeat the same process for sink_ns
+ terminate_ns_processes_by_pattern ${sink_ns} while
+ terminate_ns_processes_by_pattern ${sink_ns} iperf3
+
+ # Check if any iperf3 instances are still running. This could happen
+ # if a core has entered an infinite loop and the timeout for detecting
+ # the soft lockup has not expired, but either the test interval has
+ # already elapsed or the test was terminated manually (e.g., with ^C)
+ for pid in $(ip netns pids ${source_ns}); do
+ if [ -e /proc/$pid/cmdline ] && grep -qe 'iperf3' /proc/$pid/cmdline; then
+ echo "FAIL: unable to terminate some iperf3 instances. Soft lockup is underway. A kernel panic is on the way!"
+ exit ${ksft_fail}
+ fi
+ done
+
+ if [ "$termination_signal" == "SIGINT" ]; then
+ echo "SKIP: Termination due to ^C (SIGINT)"
+ elif [ "$termination_signal" == "SIGALRM" ]; then
+ echo "PASS: No kernel soft lockup occurred during this ${TEST_DURATION} second test"
+ fi
+
+ cleanup_ns ${source_ns} ${sink_ns}
+
+ sysctl -qw kernel.softlockup_panic=${kernel_softlokup_panic_prev_val}
+}
+
+setup_prepare() {
+ setup_ns source_ns sink_ns
+
+ ip -n ${source_ns} link add name ${SOURCE_TEST_IFACE} type veth peer name ${SINK_TEST_IFACE} netns ${sink_ns}
+
+ # Setting up the Source namespace
+ ip -n ${source_ns} addr add ${SOURCE_TEST_IP_ADDR} dev ${SOURCE_TEST_IFACE}
+ ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} qlen 10000
+ ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} up
+ ip netns exec ${source_ns} sysctl -qw net.ipv6.fib_multipath_hash_policy=1
+
+ # Setting up the Sink namespace
+ ip -n ${sink_ns} addr add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} dev ${SINK_LOOPBACK_IFACE}
+ ip -n ${sink_ns} link set dev ${SINK_LOOPBACK_IFACE} up
+ ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_LOOPBACK_IFACE}.forwarding=1
+
+ ip -n ${sink_ns} link set ${SINK_TEST_IFACE} up
+ ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_TEST_IFACE}.forwarding=1
+
+
+ # Populate nexthop IPv6 addresses on the test interface in the sink_ns
+ echo "info: populating ${IPV6_NEXTHOP_ADDR_COUNT} IPv6 addresses on the ${SINK_TEST_IFACE} interface ..."
+ for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do
+ ip -n ${sink_ns} addr add ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" "${IP}")/${IPV6_NEXTHOP_ADDR_MASK} dev ${SINK_TEST_IFACE};
+ done
+
+ # Preparing list of nexthops
+ for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do
+ nexthop_ip_list=$nexthop_ip_list" nexthop via ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" $IP) dev ${SOURCE_TEST_IFACE} weight 1"
+ done
+}
+
+
+test_soft_lockup_during_routing_table_refresh() {
+ # Start num_of_iperf_servers iperf3 servers in the sink_ns namespace,
+ # each listening on ports starting at 5001 and incrementing
+ # sequentially. Since iperf3 instances may terminate unexpectedly, a
+ # while loop is used to automatically restart them in such cases.
+ echo "info: starting ${num_of_iperf_servers} iperf3 servers in the sink_ns namespace ..."
+ for i in $(seq 1 ${num_of_iperf_servers}); do
+ cmd="iperf3 --bind ${SINK_LOOPBACK_IP_ADDR} -s -p $(printf '5%03d' ${i}) --rcv-timeout 200 &>/dev/null"
+ ip netns exec ${sink_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null
+ done
+
+ # Wait for the iperf3 servers to be ready
+ for i in $(seq ${num_of_iperf_servers}); do
+ port=$(printf '5%03d' ${i});
+ wait_local_port_listen ${sink_ns} ${port} tcp
+ done
+
+ # Continuously refresh the routing table in the background within
+ # the source_ns namespace
+ ip netns exec ${source_ns} bash -c "
+ while \$(ip netns list | grep -q ${source_ns}); do
+ ip -6 route add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} ${nexthop_ip_list};
+ sleep ${ROUTING_TABLE_REFRESH_PERIOD};
+ ip -6 route delete ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK};
+ done &"
+
+ # Start num_of_iperf_servers iperf3 clients in the source_ns namespace,
+ # each sending TCP traffic on sequential ports starting at 5001.
+ # Since iperf3 instances may terminate unexpectedly (e.g., if the route
+ # to the server is deleted in the background during a route refresh), a
+ # while loop is used to automatically restart them in such cases.
+ echo "info: starting ${num_of_iperf_servers} iperf3 clients in the source_ns namespace ..."
+ for i in $(seq 1 ${num_of_iperf_servers}); do
+ cmd="iperf3 -c ${SINK_LOOPBACK_IP_ADDR} -p $(printf '5%03d' ${i}) --length 64 --bitrate ${IPERF3_BITRATE} -t 0 --connect-timeout 150 &>/dev/null"
+ ip netns exec ${source_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null
+ done
+
+ echo "info: IPv6 routing table is being updated at the rate of $(echo "1/${ROUTING_TABLE_REFRESH_PERIOD}" | bc)/s for ${TEST_DURATION} seconds ..."
+ echo "info: A kernel soft lockup, if detected, results in a kernel panic!"
+
+ wait
+}
+
+# Make sure 'iperf3' is installed, skip the test otherwise
+if [ ! -x "$(command -v "iperf3")" ]; then
+ echo "SKIP: 'iperf3' is not installed. Skipping the test."
+ exit ${ksft_skip}
+fi
+
+# Determine the number of cores on the machine
+num_of_iperf_servers=$(( $(nproc)/2 ))
+
+# Check if we are running on a multi-core machine, skip the test otherwise
+if [ "${num_of_iperf_servers}" -eq 0 ]; then
+ echo "SKIP: This test is not valid on a single core machine!"
+ exit ${ksft_skip}
+fi
+
+# Since the kernel soft lockup we're testing causes at least one core to enter
+# an infinite loop, destabilizing the host and likely affecting subsequent
+# tests, we trigger a kernel panic instead of reporting a failure and
+# continuing
+kernel_softlokup_panic_prev_val=$(sysctl -n kernel.softlockup_panic)
+sysctl -qw kernel.softlockup_panic=1
+
+handle_sigint() {
+ termination_signal="SIGINT"
+ cleanup
+ exit ${ksft_skip}
+}
+
+handle_sigalrm() {
+ termination_signal="SIGALRM"
+ cleanup
+ exit ${ksft_pass}
+}
+
+trap handle_sigint SIGINT
+trap handle_sigalrm SIGALRM
+
+(sleep ${TEST_DURATION} && kill -s SIGALRM $$)&
+
+setup_prepare
+test_soft_lockup_during_routing_table_refresh
--
2.35.3
From: Al Viro <viro(a)zeniv.linux.org.uk>
[ Upstream commit 50e43a57334400668952f8e551c9d87d3ed2dfef ]
We get there when sigreturn has performed obscene acts on kernel stack;
in particular, the location of pt_regs has shifted. We are about to call
syscall_trace(), which might stop for tracer. If that happens, we'd better
have task_pt_regs() returning correct result...
Fucked-up-by: Al Viro <viro(a)zeniv.linux.org.uk>
Fixes: bd6f56a75bb2 ("m68k: Missing syscall_trace() on sigreturn")
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Tested-by: Michael Schmitz <schmitzmic(a)gmail.com>
Reviewed-by: Michael Schmitz <schmitzmic(a)gmail.com>
Tested-by: Finn Thain <fthain(a)linux-m68k.org>
Link: https://lore.kernel.org/r/YP2dMWeV1LkHiOpr@zeniv-ca.linux.org.uk
Signed-off-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
Signed-off-by: Finn Thain <fthain(a)linux-m68k.org>
---
arch/m68k/kernel/entry.S | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/m68k/kernel/entry.S b/arch/m68k/kernel/entry.S
index 417d8f0e8962..0d03b4f2077b 100644
--- a/arch/m68k/kernel/entry.S
+++ b/arch/m68k/kernel/entry.S
@@ -182,6 +182,8 @@ ENTRY(ret_from_signal)
movel %curptr@(TASK_STACK),%a1
tstb %a1@(TINFO_FLAGS+2)
jge 1f
+ lea %sp@(SWITCH_STACK_SIZE),%a1
+ movel %a1,%curptr@(TASK_THREAD+THREAD_ESP0)
jbsr syscall_trace
1: RESTORE_SWITCH_STACK
addql #4,%sp
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
[ Upstream commit a3616a3c02722d1edb95acc7fceade242f6553ba ]
In the fpsp040 code when copyin or copyout fails call
force_sigsegv(SIGSEGV) instead of do_exit(SIGSEGV).
This solves a couple of problems. Because do_exit embeds the ptrace
stop PTRACE_EVENT_EXIT a complete stack frame needs to be present for
that to work correctly. There is always the information needed for a
ptrace stop where get_signal is called. So exiting with a signal
solves the ptrace issue.
Further exiting with a signal ensures that all of the threads in a
process are killed not just the thread that malfunctioned. Which
avoids confusing userspace.
To make force_sigsegv(SIGSEGV) work in fpsp040_die modify the code to
save all of the registers and jump to ret_from_exception (which
ultimately calls get_signal) after fpsp040_die returns.
v2: Updated the branches to use gas's pseudo ops that automatically
calculate the best branch instruction to use for the purpose.
v1: https://lkml.kernel.org/r/87a6m8kgtx.fsf_-_@disp2133
Link: https://lkml.kernel.org/r/87tukghjfs.fsf_-_@disp2133
Acked-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
Signed-off-by: Finn Thain <fthain(a)linux-m68k.org>
---
arch/m68k/fpsp040/skeleton.S | 3 ++-
arch/m68k/kernel/traps.c | 2 +-
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/m68k/fpsp040/skeleton.S b/arch/m68k/fpsp040/skeleton.S
index 31a9c634c81e..081922c72daa 100644
--- a/arch/m68k/fpsp040/skeleton.S
+++ b/arch/m68k/fpsp040/skeleton.S
@@ -502,7 +502,8 @@ in_ea:
.section .fixup,"ax"
.even
1:
- jbra fpsp040_die
+ jbsr fpsp040_die
+ jbra .Lnotkern
.section __ex_table,"a"
.align 4
diff --git a/arch/m68k/kernel/traps.c b/arch/m68k/kernel/traps.c
index 35f706d836c5..c6f18dc5884b 100644
--- a/arch/m68k/kernel/traps.c
+++ b/arch/m68k/kernel/traps.c
@@ -1155,7 +1155,7 @@ asmlinkage void set_esp0(unsigned long ssp)
*/
asmlinkage void fpsp040_die(void)
{
- do_exit(SIGSEGV);
+ force_sigsegv(SIGSEGV);
}
#ifdef CONFIG_M68KFPU_EMU
Since commit 25f39d3dcb48 ("s390/pci: Ignore RID for isolated VFs") PFs
which are not initially configured but in standby are considered
isolated. That is they create only a single function PCI domain. Due to
the PCI domains being created on discovery, this means that even if they
are configured later on, sibling PFs and their child VFs will not be
added to their PCI domain breaking SR-IOV expectations.
The reason the referenced commit ignored standby PFs for the creation of
multi-function PCI subhierarchies, was to work around a PCI domain
renumbering scenario on reboot. The renumbering would occur after
removing a previously in standby PF, whose domain number is used for its
configured sibling PFs and their child VFs, but which itself remained in
standby. When this is followed by a reboot, the sibling PF is used
instead to determine the PCI domain number of it and its child VFs.
In principle it is not possible to know which standby PFs will be
configured later and which may be removed. The PCI domain and root bus
are pre-requisites for hotplug slots so the decision of which functions
belong to which domain can not be postponed. With the renumbering
occurring only in rare circumstances and being generally benign, accept
it as an oddity and fix SR-IOV for initially standby PFs simply by
allowing them to create PCI domains.
Cc: stable(a)vger.kernel.org
Reviewed-by: Gerd Bayer <gbayer(a)linux.ibm.com>
Fixes: 25f39d3dcb48 ("s390/pci: Ignore RID for isolated VFs")
Signed-off-by: Niklas Schnelle <schnelle(a)linux.ibm.com>
---
Changes in v3:
- Add R-b from Gerd
- Add Cc: stable…
- Add commas (Sandy)
Changes in v2:
- Reword commit message
---
arch/s390/pci/pci_bus.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/s390/pci/pci_bus.c b/arch/s390/pci/pci_bus.c
index d5ace00d10f04285f899284481f1e426187d4ff4..857afbc4828f0c677f88cc80dd4a5fff104a615a 100644
--- a/arch/s390/pci/pci_bus.c
+++ b/arch/s390/pci/pci_bus.c
@@ -171,7 +171,6 @@ void zpci_bus_scan_busses(void)
static bool zpci_bus_is_multifunction_root(struct zpci_dev *zdev)
{
return !s390_pci_no_rid && zdev->rid_available &&
- zpci_is_device_configured(zdev) &&
!zdev->vfn;
}
---
base-commit: 6b7afe1a2b6905e42fe45bd7015f20baa856e28e
change-id: 20250116-fix_standby_pf-e1d51394e9b3
Best regards,
--
Niklas Schnelle
On pSeries, when user attempts to use the same vfio container used by
different iommu group, the spapr_tce_set_window() returns -EPERM
and the subsequent cleanup leads to the below crash.
Kernel attempted to read user page (308) - exploit attempt?
BUG: Kernel NULL pointer dereference on read at 0x00000308
Faulting instruction address: 0xc0000000001ce358
Oops: Kernel access of bad area, sig: 11 [#1]
NIP: c0000000001ce358 LR: c0000000001ce05c CTR: c00000000005add0
<snip>
NIP [c0000000001ce358] spapr_tce_unset_window+0x3b8/0x510
LR [c0000000001ce05c] spapr_tce_unset_window+0xbc/0x510
Call Trace:
spapr_tce_unset_window+0xbc/0x510 (unreliable)
tce_iommu_attach_group+0x24c/0x340 [vfio_iommu_spapr_tce]
vfio_container_attach_group+0xec/0x240 [vfio]
vfio_group_fops_unl_ioctl+0x548/0xb00 [vfio]
sys_ioctl+0x754/0x1580
system_call_exception+0x13c/0x330
system_call_vectored_common+0x15c/0x2ec
<snip>
--- interrupt: 3000
Fix this by having null check for the tbl passed to the
spapr_tce_unset_window().
Fixes: f431a8cde7f1 ("powerpc/iommu: Reimplement the iommu_table_group_ops for pSeries")
Cc: stable(a)vger.kernel.org
Reported-by: Vaishnavi Bhat <vaish123(a)in.ibm.com>
Signed-off-by: Shivaprasad G Bhat <sbhat(a)linux.ibm.com>
---
arch/powerpc/platforms/pseries/iommu.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 534cd159e9ab..78b895b568b3 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -2205,6 +2205,9 @@ static long spapr_tce_unset_window(struct iommu_table_group *table_group, int nu
const char *win_name;
int ret = -ENODEV;
+ if (!tbl) /* The table was never created OR window was never opened */
+ return 0;
+
mutex_lock(&dma_win_init_mutex);
if ((num == 0) && is_default_window_table(table_group, tbl))
The PE Reset State "0" returned by RTAS calls
"ibm_read_slot_reset_[state|state2]" indicates that the reset is
deactivated and the PE is in a state where MMIO and DMA are allowed.
However, the current implementation of "pseries_eeh_get_state()" does
not reflect this, causing drivers to incorrectly assume that MMIO and
DMA operations cannot be resumed.
The userspace drivers as a part of EEH recovery using VFIO ioctls fail
to detect when the recovery process is complete. The VFIO_EEH_PE_GET_STATE
ioctl does not report the expected EEH_PE_STATE_NORMAL state, preventing
userspace drivers from functioning properly on pseries systems.
The patch addresses this issue by updating 'pseries_eeh_get_state()'
to include "EEH_STATE_MMIO_ENABLED" and "EEH_STATE_DMA_ENABLED" in
the result mask for PE Reset State "0". This ensures correct state
reporting to the callers, aligning the behavior with the PAPR specification
and fixing the bug in EEH recovery for VFIO user workflows.
Fixes: 00ba05a12b3c ("powerpc/pseries: Cleanup on pseries_eeh_get_state()")
Cc: <stable(a)vger.kernel.org>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list(a)gmail.com>
Signed-off-by: Narayana Murty N <nnmlinux(a)linux.ibm.com>
---
Changelog:
V1:https://lore.kernel.org/all/20241107042027.338065-1-nnmlinux@linux.ibm.c…
--added Fixes tag for "powerpc/pseries: Cleanup on
pseries_eeh_get_state()".
V2:https://lore.kernel.org/stable/20241212075044.10563-1-nnmlinux%40linux.i…
--Updated the patch description to include it in the stable kernel tree.
V3:https://lore.kernel.org/all/87v7vm8pwz.fsf@gmail.com/
--Updated commit description.
---
arch/powerpc/platforms/pseries/eeh_pseries.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
index 1893f66371fa..b12ef382fec7 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -580,8 +580,10 @@ static int pseries_eeh_get_state(struct eeh_pe *pe, int *delay)
switch(rets[0]) {
case 0:
- result = EEH_STATE_MMIO_ACTIVE |
- EEH_STATE_DMA_ACTIVE;
+ result = EEH_STATE_MMIO_ACTIVE |
+ EEH_STATE_DMA_ACTIVE |
+ EEH_STATE_MMIO_ENABLED |
+ EEH_STATE_DMA_ENABLED;
break;
case 1:
result = EEH_STATE_RESET_ACTIVE |
--
2.47.1
There is a period of time after returning from a KVM_RUN ioctl where
userspace may use SVE without trapping, but the kernel can unexpectedly
discard the live SVE state. Eric Auger has observed this causing QEMU
crashes where SVE is used by memmove():
https://issues.redhat.com/browse/RHEL-68997
The only state discarded is the user SVE state of the task which issued
the KVM_RUN ioctl. Other tasks are unaffected, plain FPSIMD state is
unaffected, and kernel state is unaffected.
This happens because fpsimd_kvm_prepare() incorrectly manipulates the
FPSIMD/SVE state. When the vCPU is loaded, fpsimd_kvm_prepare()
unconditionally clears TIF_SVE but does not reconfigure CPACR_EL1.ZEN to
trap userspace SVE usage. If the vCPU does not use FPSIMD/SVE and hyp
does not save the host's FPSIMD/SVE state, the kernel may return to
userspace with TIF_SVE clear while SVE is still enabled in
CPACR_EL1.ZEN. Subsequent userspace usage of SVE will not be trapped,
and the next save of userspace FPSIMD/SVE state will only store the
FPSIMD portion due to TIF_SVE being clear, discarding any SVE state.
The broken logic was originally introduced in commit:
93ae6b01bafee8fa ("KVM: arm64: Discard any SVE state when entering KVM guests")
... though at the time fp_user_discard() would reconfigure CPACR_EL1.ZEN
to trap subsequent SVE usage, masking the issue until that logic was
removed in commit:
8c845e2731041f0f ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
Avoid this issue by reconfiguring CPACR_EL1.ZEN when clearing
TIF_SVE. At the same time, add a comment to explain why
current->thread.fp_type must be set even though the FPSIMD state is not
foreign. A similar issue exists when SME is enabled, and will require
further rework. As SME currently depends on BROKEN, a BUILD_BUG() and
comment are added for now, and this issue will need to be fixed properly
in a follow-up patch.
Commit 93ae6b01bafee8fa also introduced an unintended ptrace ABI change.
Unconditionally clearing TIF_SVE regardless of whether the state is
foreign discards saved SVE state created by ptrace after syscall entry.
Avoid this by only clearing TIF_SVE when the FPSIMD/SVE state is not
foreign. When the state is foreign, KVM hyp code does not need to save
any host state, and so this will not affect KVM.
There appear to be further issues with unintentional SVE state
discarding, largely impacting ptrace and signal handling, which will
need to be addressed in separate patches.
Reported-by: Eric Auger <eauger(a)redhat.com>
Reported-by: Wilco Dijkstra <wilco.dijkstra(a)arm.com>
Cc: stable(a)vger.kernel.org
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Florian Weimer <fweimer(a)redhat.com>
Cc: Jeremy Linton <jeremy.linton(a)arm.com>
Cc: Marc Zyngier <maz(a)kernel.org>
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Oliver Upton <oliver.upton(a)linux.dev>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Will Deacon <will(a)kernel.org>
Signed-off-by: Mark Rutland <mark.rutland(a)arm.com>
---
arch/arm64/kernel/fpsimd.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
I believe there are some other issues in this area, but I'm sending this
out on its own because I beleive the other issues are more complex while
this is self-contained, and people are actively hitting this case in
production.
I intend to follow-up with fixes for the other cases I mention in the
commit message, and for the SME case with the BUILD_BUG_ON().
Mark.
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 8c4c1a2186cc5..e4053a90ed240 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -1711,8 +1711,24 @@ void fpsimd_kvm_prepare(void)
*/
get_cpu_fpsimd_context();
- if (test_and_clear_thread_flag(TIF_SVE)) {
- sve_to_fpsimd(current);
+ if (!test_thread_flag(TIF_FOREIGN_FPSTATE) &&
+ test_and_clear_thread_flag(TIF_SVE)) {
+ sve_user_disable();
+
+ /*
+ * The KVM hyp code doesn't set fp_type when saving the host's
+ * FPSIMD state. Set fp_type here in case the hyp code saves
+ * the host state.
+ *
+ * If hyp code does not save the host state, then the host
+ * state remains live on the CPU and saved fp_type is
+ * irrelevant until it is overwritten by a later call to
+ * fpsimd_save_user_state().
+ *
+ * This is *NOT* sufficient when CONFIG_ARM64_SME=y, where
+ * fp_type can be FP_STATE_SVE regardless of TIF_SVE.
+ */
+ BUILD_BUG_ON(IS_ENABLED(CONFIG_ARM64_SME));
current->thread.fp_type = FP_STATE_FPSIMD;
}
--
2.30.2