Hello:
This series was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni(a)redhat.com>:
On Fri, 14 Mar 2025 21:11:30 +0100 you wrote:
> Here are 3 unrelated fixes for the net tree.
>
> - Patch 1: fix data stream corruption when ending up not sending an
> ADD_ADDR.
>
> - Patch 2: fix missing getsockopt(IPV6_V6ONLY) support -- the set part
> is supported.
>
> [...]
Here is the summary with links:
- [net,1/3] mptcp: Fix data stream corruption in the address announcement
(no matching commit)
- [net,2/3] mptcp: sockopt: fix getting IPV6_V6ONLY
https://git.kernel.org/netdev/net-next/c/8c3963375988
- [net,3/3] mptcp: sockopt: fix getting freebind & transparent
https://git.kernel.org/netdev/net-next/c/e2f4ac7bab22
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
Hello,
The following change seems to cause kexec to sometimes fail (not every
time but about 50% chance) for the stable 6.1 kernels, 6.1.129 and later:
6821918f4519 ("x86/kexec: Allocate PGD for x86_64 transition page tables
separately")
The commit message for that commit states that it is dependent on
another change "x86/mm: Add _PAGE_NOPTISHADOW bit to avoid updating
userspace page tables" but that change does not seem to have been done
in the 6.1 kernel series, which could explain why the 6821918f4519
change causes problems for 6.1.
This appears to be a problem only for 6.1, for the 6.6 and later stable
kernels there is no problem.
I think the reason this problem is seen only for 6.1 and not for 6.6 and
later is that the change "x86/kexec: Allocate PGD for x86_64 transition
page tables separately" relies on things that are not available in 6.1.
In the tests I have done, kexec is called via u-root.
More details are available here:
https://git.glasklar.is/system-transparency/core/stboot/-/issues/227
Cheers,
Elias
Hello,
Outdated or inaccurate contact data wastes valuable time. We provide highly
accurate, *verified contact lists of Engineers, Manufacturers and
Purchasing Managers Contacts*, ensuring you connect with the right people.
Would you like to see a sample? If so, please let me know your *target
geography* I’ll be happy to share details.
Just a note that we can customize the lists to meet any specific
requirements based on your criteria.
Wishing you a wonderful day!
Debbie McCoy
Manager, Demand Generation
If you ever feel like opting out, simply reply with "remove me" in the
subject line
Hello all,
I encountered an issue with the CPU affinity of tasks launched by
systemd in a
slice, after updating from systemd 254 to by systemd >= 256, on the LTS
5.15.x
branch (tested on v5.15.179).
Despite the slice file stipulating AllowedCPUS=2 (and confirming this
was set in
/sys/fs/cgroup/test.slice/cpuset.cpus) tasks launched in the slice would
have
the CPU affinity of the system.slice (i.e all by default) rather than 2.
To reproduce:
* Check kernel version and systemd version (I used a debian testing
image for
testing)
```
# uname -r
5.15.179
# systemctl --version
systemd 257 (257.4-3)
...
```
* Create a test.slice with AllowedCPUS=2
```
# cat <<EOF > /usr/lib/systemd/system/test.slice
[Unit]
Description=Test slice
Before=slices.target
[Slice]
AllowedCPUs=2
[Install]
WantedBy=slices.target
EOF
# systemctl daemon-reload && systemctl start test.slice
```
* Confirm cpuset
```
# cat /sys/fs/cgroup/test.slice/cpuset.cpus
2
```
* Launch task in slice
```
# systemd-run --slice test.slice yes
Running as unit: run-r9187b97c6958498aad5bba213289ac56.service;
invocation ID:
f470f74047ac43b7a60861d03a7ef6f9
# cat
/sys/fs/cgroup/test.slice/run-r9187b97c6958498aad5bba213289ac56.service/cgroup.procs
317
```
# Check affinity
```
# taskset -pc 317
pid 317's current affinity list: 0-7
```
This issue is fixed by applying upstream commits:
18f9a4d47527772515ad6cbdac796422566e6440
cgroup/cpuset: Skip spread flags update on v2
and
42a11bf5c5436e91b040aeb04063be1710bb9f9c
cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly
With these applied:
```
# systemd-run --slice test.slice yes
Running as unit: run-r442c444559ff49f48c6c2b8325b3b500.service;
invocation ID:
5211167267154e9292cb6b854585cb91
# cat
/sys/fs/cgroup/test.slice/run-r442c444559ff49f48c6c2b8325b3b500.service/cgroup.procs
291
# taskset -pc 291
pid 291's current affinity list: 2
```
Perhaps these are a good candidate for backport onto the 5.15 LTS
branch?
Thanks
James
When get_block is called with a buffer_head allocated on the stack, such
as do_mpage_readpage, stack corruption due to buffer_head UAF may occur in
the following race condition situation.
<CPU 0> <CPU 1>
mpage_read_folio
<<bh on stack>>
do_mpage_readpage
exfat_get_block
bh_read
__bh_read
get_bh(bh)
submit_bh
wait_on_buffer
...
end_buffer_read_sync
__end_buffer_read_notouch
unlock_buffer
<<keep going>>
...
...
...
...
<<bh is not valid out of mpage_read_folio>>
.
.
another_function
<<variable A on stack>>
put_bh(bh)
atomic_dec(bh->b_count)
* stack corruption here *
This patch returns -EAGAIN if a folio does not have buffers when bh_read
needs to be called. By doing this, the caller can fallback to functions
like block_read_full_folio(), create a buffer_head in the folio, and then
call get_block again.
Let's do not call bh_read() with on-stack buffer_head.
Fixes: 11a347fb6cef ("exfat: change to get file size from DataLength")
Cc: stable(a)vger.kernel.org
Tested-by: Yeongjin Gil <youngjin.gil(a)samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo(a)samsung.com>
---
v2:
- clear_buffer_mapped if there is any errors.
- remove unnecessary BUG_ON()
---
fs/exfat/inode.c | 39 +++++++++++++++++++++++++++++++++------
1 file changed, 33 insertions(+), 6 deletions(-)
diff --git a/fs/exfat/inode.c b/fs/exfat/inode.c
index 96952d4acb50..f3fdba9f4d21 100644
--- a/fs/exfat/inode.c
+++ b/fs/exfat/inode.c
@@ -344,7 +344,8 @@ static int exfat_get_block(struct inode *inode, sector_t iblock,
* The block has been partially written,
* zero the unwritten part and map the block.
*/
- loff_t size, off, pos;
+ loff_t size, pos;
+ void *addr;
max_blocks = 1;
@@ -355,17 +356,41 @@ static int exfat_get_block(struct inode *inode, sector_t iblock,
if (!bh_result->b_folio)
goto done;
+ /*
+ * No buffer_head is allocated.
+ * (1) bmap: It's enough to fill bh_result without I/O.
+ * (2) read: The unwritten part should be filled with 0
+ * If a folio does not have any buffers,
+ * let's returns -EAGAIN to fallback to
+ * per-bh IO like block_read_full_folio().
+ */
+ if (!folio_buffers(bh_result->b_folio)) {
+ err = -EAGAIN;
+ goto done;
+ }
+
pos = EXFAT_BLK_TO_B(iblock, sb);
size = ei->valid_size - pos;
- off = pos & (PAGE_SIZE - 1);
+ addr = folio_address(bh_result->b_folio) +
+ offset_in_folio(bh_result->b_folio, pos);
+
+ /* Check if bh->b_data points to proper addr in folio */
+ if (bh_result->b_data != addr) {
+ exfat_fs_error_ratelimit(sb,
+ "b_data(%p) != folio_addr(%p)",
+ bh_result->b_data, addr);
+ err = -EINVAL;
+ goto done;
+ }
- folio_set_bh(bh_result, bh_result->b_folio, off);
+ /* Read a block */
err = bh_read(bh_result, 0);
if (err < 0)
- goto unlock_ret;
+ goto done;
- folio_zero_segment(bh_result->b_folio, off + size,
- off + sb->s_blocksize);
+ /* Zero unwritten part of a block */
+ memset(bh_result->b_data + size, 0,
+ bh_result->b_size - size);
} else {
/*
* The range has not been written, clear the mapped flag
@@ -376,6 +401,8 @@ static int exfat_get_block(struct inode *inode, sector_t iblock,
}
done:
bh_result->b_size = EXFAT_BLK_TO_B(max_blocks, sb);
+ if (err < 0)
+ clear_buffer_mapped(bh_result);
unlock_ret:
mutex_unlock(&sbi->s_lock);
return err;
--
2.25.1
Hi again,
also for 6.1.132-rc1 the review hasn't started yet, but as it was
already available on [1], our CI has also tried to built it for ia64
in the morning. Unfortunately that failed, too - I assume due to the
following **missing** upstream commit:
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/comm…
[1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.…
Build failure (see [2]):
```
[...]
In file included from ./include/linux/string.h:5,
from ./include/linux/uuid.h:12,
from ./fs/xfs/xfs_linux.h:10,
from ./fs/xfs/xfs.h:22,
from fs/xfs/libxfs/xfs_alloc.c:6:
fs/xfs/libxfs/xfs_alloc.c: In function '__xfs_free_extent_later':
fs/xfs/libxfs/xfs_alloc.c:2551:51: error: 'mp' undeclared (first use in this function); did you mean 'tp'?
2551 | if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, bno, len)))
| ^~
./include/linux/compiler.h:78:45: note: in definition of macro 'unlikely'
78 | # define unlikely(x) __builtin_expect(!!(x), 0)
| ^
fs/xfs/libxfs/xfs_alloc.c:2551:13: note: in expansion of macro 'XFS_IS_CORRUPT'
2551 | if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, bno, len)))
| ^~~~~~~~~~~~~~
fs/xfs/libxfs/xfs_alloc.c:2551:51: note: each undeclared identifier is reported only once for each function it appears in
2551 | if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, bno, len)))
| ^~
./include/linux/compiler.h:78:45: note: in definition of macro 'unlikely'
78 | # define unlikely(x) __builtin_expect(!!(x), 0)
| ^
fs/xfs/libxfs/xfs_alloc.c:2551:13: note: in expansion of macro 'XFS_IS_CORRUPT'
2551 | if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, bno, len)))
| ^~~~~~~~~~~~~~
./fs/xfs/xfs_linux.h:225:63: warning: left-hand operand of comma expression has no effect [-Wunused-value]
225 | __this_address), \
| ^
fs/xfs/libxfs/xfs_alloc.c:2551:13: note: in expansion of macro 'XFS_IS_CORRUPT'
2551 | if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, bno, len)))
| ^~~~~~~~~~~~~~
make[5]: *** [scripts/Makefile.build:250: fs/xfs/libxfs/xfs_alloc.o] Error 1
[...]
```
[2]: https://github.com/linux-ia64/linux-stable-rc/actions/runs/13914712427/job/…
[3] (7dfee17b13e5024c5c0ab1911859ded4182de3e5 upstream) introduced
the XFS_IS_CORRUPT macro call now in `fs/xfs/libxfs/xfs_alloc.c:2551`,
but the struct "mp" is only there when DEBUG is defined in 6.1.132-rc1.
The above upstream commit (f6b3846) moves "mp" out of that guard and
hence should fix that specific build regression IIUC. Again not
build-tested yet, though.
[3]: https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.…
Cheers,
Frank
From: Ran Xiaokai <ran.xiaokai(a)zte.com.cn>
Lockdep reports this deadlock log:
osnoise: could not start sampling thread
============================================
WARNING: possible recursive locking detected
--------------------------------------------
CPU0
----
lock(cpu_hotplug_lock);
lock(cpu_hotplug_lock);
Call Trace:
<TASK>
print_deadlock_bug+0x282/0x3c0
__lock_acquire+0x1610/0x29a0
lock_acquire+0xcb/0x2d0
cpus_read_lock+0x49/0x120
stop_per_cpu_kthreads+0x7/0x60
start_kthread+0x103/0x120
osnoise_hotplug_workfn+0x5e/0x90
process_one_work+0x44f/0xb30
worker_thread+0x33e/0x5e0
kthread+0x206/0x3b0
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x11/0x20
</TASK>
This is the deadlock scenario:
osnoise_hotplug_workfn()
guard(cpus_read_lock)(); // first lock call
start_kthread(cpu)
if (IS_ERR(kthread)) {
stop_per_cpu_kthreads(); {
cpus_read_lock(); // second lock call. Cause the AA deadlock
}
}
It is not necessary to call stop_per_cpu_kthreads() which stops osnoise
kthread for every other CPUs in the system if a failure occurs during
hotplug of a certain CPU.
For start_per_cpu_kthreads(), if the start_kthread() call fails,
this function calls stop_per_cpu_kthreads() to handle the error.
Therefore, similarly, there is no need to call stop_per_cpu_kthreads()
again within start_kthread().
So just remove stop_per_cpu_kthreads() from start_kthread to solve this issue.
Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations")
Signed-off-by: Ran Xiaokai <ran.xiaokai(a)zte.com.cn>
---
kernel/trace/trace_osnoise.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 647516a73fa2..e732c9e37e14 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2006,7 +2006,6 @@ static int start_kthread(unsigned int cpu)
if (IS_ERR(kthread)) {
pr_err(BANNER "could not start sampling thread\n");
- stop_per_cpu_kthreads();
return -ENOMEM;
}
--
2.17.1
When get_block is called with a buffer_head allocated on the stack, such
as do_mpage_readpage, stack corruption due to buffer_head UAF may occur in
the following race condition situation.
<CPU 0> <CPU 1>
mpage_read_folio
<<bh on stack>>
do_mpage_readpage
exfat_get_block
bh_read
__bh_read
get_bh(bh)
submit_bh
wait_on_buffer
...
end_buffer_read_sync
__end_buffer_read_notouch
unlock_buffer
<<keep going>>
...
...
...
...
<<bh is not valid out of mpage_read_folio>>
.
.
another_function
<<variable A on stack>>
put_bh(bh)
atomic_dec(bh->b_count)
* stack corruption here *
This patch returns -EAGAIN if a folio does not have buffers when bh_read
needs to be called. By doing this, the caller can fallback to functions
like block_read_full_folio(), create a buffer_head in the folio, and then
call get_block again.
Let's do not call bh_read() with on-stack buffer_head.
Fixes: 11a347fb6cef ("exfat: change to get file size from DataLength")
Cc: stable(a)vger.kernel.org
Tested-by: Yeongjin Gil <youngjin.gil(a)samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo(a)samsung.com>
---
fs/exfat/inode.c | 37 ++++++++++++++++++++++++++++++++-----
1 file changed, 32 insertions(+), 5 deletions(-)
diff --git a/fs/exfat/inode.c b/fs/exfat/inode.c
index 96952d4acb50..b8ea910586e4 100644
--- a/fs/exfat/inode.c
+++ b/fs/exfat/inode.c
@@ -344,7 +344,8 @@ static int exfat_get_block(struct inode *inode, sector_t iblock,
* The block has been partially written,
* zero the unwritten part and map the block.
*/
- loff_t size, off, pos;
+ loff_t size, pos;
+ void *addr;
max_blocks = 1;
@@ -355,17 +356,43 @@ static int exfat_get_block(struct inode *inode, sector_t iblock,
if (!bh_result->b_folio)
goto done;
+ /*
+ * No buffer_head is allocated.
+ * (1) bmap: It's enough to fill bh_result without I/O.
+ * (2) read: The unwritten part should be filled with 0
+ * If a folio does not have any buffers,
+ * let's returns -EAGAIN to fallback to
+ * per-bh IO like block_read_full_folio().
+ */
+ if (!folio_buffers(bh_result->b_folio)) {
+ err = -EAGAIN;
+ goto done;
+ }
+
pos = EXFAT_BLK_TO_B(iblock, sb);
size = ei->valid_size - pos;
- off = pos & (PAGE_SIZE - 1);
+ addr = folio_address(bh_result->b_folio) +
+ offset_in_folio(bh_result->b_folio, pos);
+
+ BUG_ON(size > sb->s_blocksize);
+
+ /* Check if bh->b_data points to proper addr in folio */
+ if (bh_result->b_data != addr) {
+ exfat_fs_error_ratelimit(sb,
+ "b_data(%p) != folio_addr(%p)",
+ bh_result->b_data, addr);
+ err = -EINVAL;
+ goto done;
+ }
- folio_set_bh(bh_result, bh_result->b_folio, off);
+ /* Read a block */
err = bh_read(bh_result, 0);
if (err < 0)
goto unlock_ret;
- folio_zero_segment(bh_result->b_folio, off + size,
- off + sb->s_blocksize);
+ /* Zero unwritten part of a block */
+ memset(bh_result->b_data + size, 0,
+ bh_result->b_size - size);
} else {
/*
* The range has not been written, clear the mapped flag
--
2.25.1
Even when X86_FEATURE_PKU and X86_FEATURE_OSPKE are available,
XFEATURE_PKRU can be missing.
In such a case, pkeys has to be disabled to avoid hanging up.
WARNING: CPU: 0 PID: 1 at arch/x86/kernel/fpu/xstate.c:1003 get_xsave_addr_user+0x28/0x40
(...)
Call Trace:
<TASK>
? get_xsave_addr_user+0x28/0x40
? __warn.cold+0x8e/0xea
? get_xsave_addr_user+0x28/0x40
? report_bug+0xff/0x140
? handle_bug+0x3b/0x70
? exc_invalid_op+0x17/0x70
? asm_exc_invalid_op+0x1a/0x20
? get_xsave_addr_user+0x28/0x40
copy_fpstate_to_sigframe+0x1be/0x380
? __put_user_8+0x11/0x20
get_sigframe+0xf1/0x280
x64_setup_rt_frame+0x67/0x2c0
arch_do_signal_or_restart+0x1b3/0x240
syscall_exit_to_user_mode+0xb0/0x130
do_syscall_64+0xab/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
This fix is known to be needed on Apple Virtualization.
Tested with macOS 13.5.2 running on MacBook Pro 2020 with
Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz.
Fixes: 70044df250d0 ("x86/pkeys: Update PKRU to enable all pkeys before XSAVE")
Link: https://lore.kernel.org/regressions/CAG8fp8QvH71Wi_y7b7tgFp7knK38rfrF7rRHh-…
Link: https://github.com/lima-vm/lima/issues/3334
Signed-off-by: Akihiro Suda <akihiro.suda.cz(a)hco.ntt.co.jp>
---
arch/x86/kernel/cpu/common.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index e9464fe411ac..4c2c268af214 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -517,7 +517,8 @@ static bool pku_disabled;
static __always_inline void setup_pku(struct cpuinfo_x86 *c)
{
if (c == &boot_cpu_data) {
- if (pku_disabled || !cpu_feature_enabled(X86_FEATURE_PKU))
+ if (pku_disabled || !cpu_feature_enabled(X86_FEATURE_PKU) ||
+ !cpu_has_xfeatures(XFEATURE_PKRU, NULL))
return;
/*
* Setting CR4.PKE will cause the X86_FEATURE_OSPKE cpuid
--
2.45.2
On 3/20/25 6:07 AM, James Thomas wrote:
> Hello all,
>
> I encountered an issue with the CPU affinity of tasks launched by systemd in a
> slice, after updating from systemd 254 to by systemd >= 256, on the LTS 5.15.x
> branch (tested on v5.15.179).
>
> Despite the slice file stipulating AllowedCPUS=2 (and confirming this was set in
> /sys/fs/cgroup/test.slice/cpuset.cpus) tasks launched in the slice would have
> the CPU affinity of the system.slice (i.e all by default) rather than 2.
>
> To reproduce:
>
> * Check kernel version and systemd version (I used a debian testing image for
> testing)
>
> ```
> # uname -r
> 5.15.179
> # systemctl --version
> systemd 257 (257.4-3)
> ...
> ```
>
> * Create a test.slice with AllowedCPUS=2
>
> ```
> # cat <<EOF > /usr/lib/systemd/system/test.slice
> [Unit]
> Description=Test slice
> Before=slices.target
> [Slice]
> AllowedCPUs=2
> [Install]
> WantedBy=slices.target
> EOF
> # systemctl daemon-reload && systemctl start test.slice
> ```
>
> * Confirm cpuset
>
> ```
> # cat /sys/fs/cgroup/test.slice/cpuset.cpus
> 2
> ```
>
> * Launch task in slice
>
> ```
> # systemd-run --slice test.slice yes
> Running as unit: run-r9187b97c6958498aad5bba213289ac56.service; invocation ID:
> f470f74047ac43b7a60861d03a7ef6f9
> # cat
> /sys/fs/cgroup/test.slice/run-r9187b97c6958498aad5bba213289ac56.service/cgroup.procs
>
> 317
> ```
>
> # Check affinity
>
> ```
> # taskset -pc 317
> pid 317's current affinity list: 0-7
> ```
>
> This issue is fixed by applying upstream commits:
>
> 18f9a4d47527772515ad6cbdac796422566e6440
> cgroup/cpuset: Skip spread flags update on v2
> and
> 42a11bf5c5436e91b040aeb04063be1710bb9f9c
> cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly
>
> With these applied:
>
> ```
> # systemd-run --slice test.slice yes
> Running as unit: run-r442c444559ff49f48c6c2b8325b3b500.service; invocation ID:
> 5211167267154e9292cb6b854585cb91
> # cat /sys/fs/cgroup/test.slice/run-r442c444559ff49f48c6c2b8325b3b500.service
> 291
> # taskset -pc 291
> pid 291's current affinity list: 2
> ```
>
> Perhaps these are a good candidate for backport onto the 5.15 LTS branch?
>
> Thanks
> James
>
You should also send this email to stable(a)vger.kernel.org for
consideration into the 5.15 LTS branch.
Cheers,
Longman
Once a key's reference count has been reduced to 0, the garbage collector
thread may destroy it at any time and so key_put() is not allowed to touch
the key after that point. The most key_put() is normally allowed to do is
to touch key_gc_work as that's a static global variable.
However, in an effort to speed up the reclamation of quota, this is now
done in key_put() once the key's usage is reduced to 0 - but now the code
is looking at the key after the deadline, which is forbidden.
Fix this by using a flag to indicate that a key can be gc'd now rather than
looking at the key's refcount in the garbage collector.
Fixes: 9578e327b2b4 ("keys: update key quotas in key_put()")
Reported-by: syzbot+6105ffc1ded71d194d6d(a)syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells(a)redhat.com>
Tested-by: syzbot+6105ffc1ded71d194d6d(a)syzkaller.appspotmail.com
cc: Jarkko Sakkinen <jarkko(a)kernel.org>
cc: Oleg Nesterov <oleg(a)redhat.com>
cc: Kees Cook <kees(a)kernel.org>
cc: Hillf Danton <hdanton(a)sina.com>,
cc: keyrings(a)vger.kernel.org
Cc: stable(a)vger.kernel.org # v6.10+
---
include/linux/key.h | 1 +
security/keys/gc.c | 4 +++-
security/keys/key.c | 2 ++
3 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/key.h b/include/linux/key.h
index 074dca3222b9..ba05de8579ec 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -236,6 +236,7 @@ struct key {
#define KEY_FLAG_ROOT_CAN_INVAL 7 /* set if key can be invalidated by root without permission */
#define KEY_FLAG_KEEP 8 /* set if key should not be removed */
#define KEY_FLAG_UID_KEYRING 9 /* set if key is a user or user session keyring */
+#define KEY_FLAG_FINAL_PUT 10 /* set if final put has happened on key */
/* the key type and key description string
* - the desc is used to match a key against search criteria
diff --git a/security/keys/gc.c b/security/keys/gc.c
index 7d687b0962b1..f27223ea4578 100644
--- a/security/keys/gc.c
+++ b/security/keys/gc.c
@@ -218,8 +218,10 @@ static void key_garbage_collector(struct work_struct *work)
key = rb_entry(cursor, struct key, serial_node);
cursor = rb_next(cursor);
- if (refcount_read(&key->usage) == 0)
+ if (test_bit(KEY_FLAG_FINAL_PUT, &key->flags)) {
+ smp_mb(); /* Clobber key->user after FINAL_PUT seen. */
goto found_unreferenced_key;
+ }
if (unlikely(gc_state & KEY_GC_REAPING_DEAD_1)) {
if (key->type == key_gc_dead_keytype) {
diff --git a/security/keys/key.c b/security/keys/key.c
index 3d7d185019d3..7198cd2ac3a3 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -658,6 +658,8 @@ void key_put(struct key *key)
key->user->qnbytes -= key->quotalen;
spin_unlock_irqrestore(&key->user->lock, flags);
}
+ smp_mb(); /* key->user before FINAL_PUT set. */
+ set_bit(KEY_FLAG_FINAL_PUT, &key->flags);
schedule_work(&key_gc_work);
}
}
From: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
commit 4a1d3acd6ea86075e77fcc1188c3fc372833ba73 upstream.
The nft_counter uses two s64 counters for statistics. Those two are
protected by a seqcount to ensure that the 64bit variable is always
properly seen during updates even on 32bit architectures where the store
is performed by two writes. A side effect is that the two counter (bytes
and packet) are written and read together in the same window.
This can be replaced with u64_stats_t. write_seqcount_begin()/ end() is
replaced with u64_stats_update_begin()/ end() and behaves the same way
as with seqcount_t on 32bit architectures. Additionally there is a
preempt_disable on PREEMPT_RT to ensure that a reader does not preempt a
writer.
On 64bit architectures the macros are removed and the reads happen
without any retries. This also means that the reader can observe one
counter (bytes) from before the update and the other counter (packets)
but that is okay since there is no requirement to have both counter from
the same update window.
Convert the statistic to u64_stats_t. There is one optimisation:
nft_counter_do_init() and nft_counter_clone() allocate a new per-CPU
counter and assign a value to it. During this assignment preemption is
disabled which is not needed because the counter is not yet exposed to
the system so there can not be another writer or reader. Therefore
disabling preemption is omitted and raw_cpu_ptr() is used to obtain a
pointer to a counter for the assignment.
Cc: Eric Dumazet <edumazet(a)google.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org>
Signed-off-by: Felix Moessbauer <felix.moessbauer(a)siemens.com>
---
I propose the backport, as this is a performance improvement. Note,
that this is a bugfix on RT kernels.
net/netfilter/nft_counter.c | 90 +++++++++++++++++++------------------
1 file changed, 46 insertions(+), 44 deletions(-)
diff --git a/net/netfilter/nft_counter.c b/net/netfilter/nft_counter.c
index 781d3a26f5df..8d19bd001277 100644
--- a/net/netfilter/nft_counter.c
+++ b/net/netfilter/nft_counter.c
@@ -8,7 +8,7 @@
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/module.h>
-#include <linux/seqlock.h>
+#include <linux/u64_stats_sync.h>
#include <linux/netlink.h>
#include <linux/netfilter.h>
#include <linux/netfilter/nf_tables.h>
@@ -17,6 +17,11 @@
#include <net/netfilter/nf_tables_offload.h>
struct nft_counter {
+ u64_stats_t bytes;
+ u64_stats_t packets;
+};
+
+struct nft_counter_tot {
s64 bytes;
s64 packets;
};
@@ -25,25 +30,24 @@ struct nft_counter_percpu_priv {
struct nft_counter __percpu *counter;
};
-static DEFINE_PER_CPU(seqcount_t, nft_counter_seq);
+static DEFINE_PER_CPU(struct u64_stats_sync, nft_counter_sync);
static inline void nft_counter_do_eval(struct nft_counter_percpu_priv *priv,
struct nft_regs *regs,
const struct nft_pktinfo *pkt)
{
+ struct u64_stats_sync *nft_sync;
struct nft_counter *this_cpu;
- seqcount_t *myseq;
local_bh_disable();
this_cpu = this_cpu_ptr(priv->counter);
- myseq = this_cpu_ptr(&nft_counter_seq);
-
- write_seqcount_begin(myseq);
+ nft_sync = this_cpu_ptr(&nft_counter_sync);
- this_cpu->bytes += pkt->skb->len;
- this_cpu->packets++;
+ u64_stats_update_begin(nft_sync);
+ u64_stats_add(&this_cpu->bytes, pkt->skb->len);
+ u64_stats_inc(&this_cpu->packets);
+ u64_stats_update_end(nft_sync);
- write_seqcount_end(myseq);
local_bh_enable();
}
@@ -66,17 +70,16 @@ static int nft_counter_do_init(const struct nlattr * const tb[],
if (cpu_stats == NULL)
return -ENOMEM;
- preempt_disable();
- this_cpu = this_cpu_ptr(cpu_stats);
+ this_cpu = raw_cpu_ptr(cpu_stats);
if (tb[NFTA_COUNTER_PACKETS]) {
- this_cpu->packets =
- be64_to_cpu(nla_get_be64(tb[NFTA_COUNTER_PACKETS]));
+ u64_stats_set(&this_cpu->packets,
+ be64_to_cpu(nla_get_be64(tb[NFTA_COUNTER_PACKETS])));
}
if (tb[NFTA_COUNTER_BYTES]) {
- this_cpu->bytes =
- be64_to_cpu(nla_get_be64(tb[NFTA_COUNTER_BYTES]));
+ u64_stats_set(&this_cpu->bytes,
+ be64_to_cpu(nla_get_be64(tb[NFTA_COUNTER_BYTES])));
}
- preempt_enable();
+
priv->counter = cpu_stats;
return 0;
}
@@ -104,40 +107,41 @@ static void nft_counter_obj_destroy(const struct nft_ctx *ctx,
}
static void nft_counter_reset(struct nft_counter_percpu_priv *priv,
- struct nft_counter *total)
+ struct nft_counter_tot *total)
{
+ struct u64_stats_sync *nft_sync;
struct nft_counter *this_cpu;
- seqcount_t *myseq;
local_bh_disable();
this_cpu = this_cpu_ptr(priv->counter);
- myseq = this_cpu_ptr(&nft_counter_seq);
+ nft_sync = this_cpu_ptr(&nft_counter_sync);
+
+ u64_stats_update_begin(nft_sync);
+ u64_stats_add(&this_cpu->packets, -total->packets);
+ u64_stats_add(&this_cpu->bytes, -total->bytes);
+ u64_stats_update_end(nft_sync);
- write_seqcount_begin(myseq);
- this_cpu->packets -= total->packets;
- this_cpu->bytes -= total->bytes;
- write_seqcount_end(myseq);
local_bh_enable();
}
static void nft_counter_fetch(struct nft_counter_percpu_priv *priv,
- struct nft_counter *total)
+ struct nft_counter_tot *total)
{
struct nft_counter *this_cpu;
- const seqcount_t *myseq;
u64 bytes, packets;
unsigned int seq;
int cpu;
memset(total, 0, sizeof(*total));
for_each_possible_cpu(cpu) {
- myseq = per_cpu_ptr(&nft_counter_seq, cpu);
+ struct u64_stats_sync *nft_sync = per_cpu_ptr(&nft_counter_sync, cpu);
+
this_cpu = per_cpu_ptr(priv->counter, cpu);
do {
- seq = read_seqcount_begin(myseq);
- bytes = this_cpu->bytes;
- packets = this_cpu->packets;
- } while (read_seqcount_retry(myseq, seq));
+ seq = u64_stats_fetch_begin(nft_sync);
+ bytes = u64_stats_read(&this_cpu->bytes);
+ packets = u64_stats_read(&this_cpu->packets);
+ } while (u64_stats_fetch_retry(nft_sync, seq));
total->bytes += bytes;
total->packets += packets;
@@ -148,7 +152,7 @@ static int nft_counter_do_dump(struct sk_buff *skb,
struct nft_counter_percpu_priv *priv,
bool reset)
{
- struct nft_counter total;
+ struct nft_counter_tot total;
nft_counter_fetch(priv, &total);
@@ -236,7 +240,7 @@ static int nft_counter_clone(struct nft_expr *dst, const struct nft_expr *src, g
struct nft_counter_percpu_priv *priv_clone = nft_expr_priv(dst);
struct nft_counter __percpu *cpu_stats;
struct nft_counter *this_cpu;
- struct nft_counter total;
+ struct nft_counter_tot total;
nft_counter_fetch(priv, &total);
@@ -244,11 +248,9 @@ static int nft_counter_clone(struct nft_expr *dst, const struct nft_expr *src, g
if (cpu_stats == NULL)
return -ENOMEM;
- preempt_disable();
- this_cpu = this_cpu_ptr(cpu_stats);
- this_cpu->packets = total.packets;
- this_cpu->bytes = total.bytes;
- preempt_enable();
+ this_cpu = raw_cpu_ptr(cpu_stats);
+ u64_stats_set(&this_cpu->packets, total.packets);
+ u64_stats_set(&this_cpu->bytes, total.bytes);
priv_clone->counter = cpu_stats;
return 0;
@@ -266,17 +268,17 @@ static void nft_counter_offload_stats(struct nft_expr *expr,
const struct flow_stats *stats)
{
struct nft_counter_percpu_priv *priv = nft_expr_priv(expr);
+ struct u64_stats_sync *nft_sync;
struct nft_counter *this_cpu;
- seqcount_t *myseq;
local_bh_disable();
this_cpu = this_cpu_ptr(priv->counter);
- myseq = this_cpu_ptr(&nft_counter_seq);
+ nft_sync = this_cpu_ptr(&nft_counter_sync);
- write_seqcount_begin(myseq);
- this_cpu->packets += stats->pkts;
- this_cpu->bytes += stats->bytes;
- write_seqcount_end(myseq);
+ u64_stats_update_begin(nft_sync);
+ u64_stats_add(&this_cpu->packets, stats->pkts);
+ u64_stats_add(&this_cpu->bytes, stats->bytes);
+ u64_stats_update_end(nft_sync);
local_bh_enable();
}
@@ -285,7 +287,7 @@ void nft_counter_init_seqcount(void)
int cpu;
for_each_possible_cpu(cpu)
- seqcount_init(per_cpu_ptr(&nft_counter_seq, cpu));
+ u64_stats_init(per_cpu_ptr(&nft_counter_sync, cpu));
}
struct nft_expr_type nft_counter_type;
--
2.49.0
From: Kang Yang <quic_kangyang(a)quicinc.com>
[ Upstream commit 95c38953cb1ecf40399a676a1f85dfe2b5780a9a ]
When running 'rmmod ath10k', ath10k_sdio_remove() will free sdio
workqueue by destroy_workqueue(). But if CONFIG_INIT_ON_FREE_DEFAULT_ON
is set to yes, kernel panic will happen:
Call trace:
destroy_workqueue+0x1c/0x258
ath10k_sdio_remove+0x84/0x94
sdio_bus_remove+0x50/0x16c
device_release_driver_internal+0x188/0x25c
device_driver_detach+0x20/0x2c
This is because during 'rmmod ath10k', ath10k_sdio_remove() will call
ath10k_core_destroy() before destroy_workqueue(). wiphy_dev_release()
will finally be called in ath10k_core_destroy(). This function will free
struct cfg80211_registered_device *rdev and all its members, including
wiphy, dev and the pointer of sdio workqueue. Then the pointer of sdio
workqueue will be set to NULL due to CONFIG_INIT_ON_FREE_DEFAULT_ON.
After device release, destroy_workqueue() will use NULL pointer then the
kernel panic happen.
Call trace:
ath10k_sdio_remove
->ath10k_core_unregister
……
->ath10k_core_stop
->ath10k_hif_stop
->ath10k_sdio_irq_disable
->ath10k_hif_power_down
->del_timer_sync(&ar_sdio->sleep_timer)
->ath10k_core_destroy
->ath10k_mac_destroy
->ieee80211_free_hw
->wiphy_free
……
->wiphy_dev_release
->destroy_workqueue
Need to call destroy_workqueue() before ath10k_core_destroy(), free
the work queue buffer first and then free pointer of work queue by
ath10k_core_destroy(). This order matches the error path order in
ath10k_sdio_probe().
No work will be queued on sdio workqueue between it is destroyed and
ath10k_core_destroy() is called. Based on the call_stack above, the
reason is:
Only ath10k_sdio_sleep_timer_handler(), ath10k_sdio_hif_tx_sg() and
ath10k_sdio_irq_disable() will queue work on sdio workqueue.
Sleep timer will be deleted before ath10k_core_destroy() in
ath10k_hif_power_down().
ath10k_sdio_irq_disable() only be called in ath10k_hif_stop().
ath10k_core_unregister() will call ath10k_hif_power_down() to stop hif
bus, so ath10k_sdio_hif_tx_sg() won't be called anymore.
Tested-on: QCA6174 hw3.2 SDIO WLAN.RMH.4.4.1-00189
Signed-off-by: Kang Yang <quic_kangyang(a)quicinc.com>
Tested-by: David Ruth <druth(a)chromium.org>
Reviewed-by: David Ruth <druth(a)chromium.org>
Link: https://patch.msgid.link/20241008022246.1010-1-quic_kangyang@quicinc.com
Signed-off-by: Jeff Johnson <quic_jjohnson(a)quicinc.com>
Signed-off-by: Alva Lan <alvalan9(a)foxmail.com>
---
drivers/net/wireless/ath/ath10k/sdio.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
index 9d1b0890f310..418e40560f59 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -3,6 +3,7 @@
* Copyright (c) 2004-2011 Atheros Communications Inc.
* Copyright (c) 2011-2012,2017 Qualcomm Atheros, Inc.
* Copyright (c) 2016-2017 Erik Stromdahl <erik.stromdahl(a)gmail.com>
+ * Copyright (c) 2022-2024 Qualcomm Innovation Center, Inc. All rights reserved.
*/
#include <linux/module.h>
@@ -2649,9 +2650,9 @@ static void ath10k_sdio_remove(struct sdio_func *func)
netif_napi_del(&ar->napi);
- ath10k_core_destroy(ar);
-
destroy_workqueue(ar_sdio->workqueue);
+
+ ath10k_core_destroy(ar);
}
static const struct sdio_device_id ath10k_sdio_devices[] = {
--
2.34.1
Hi,
Do we have any rule regarding whether a patch that adds a new test case
in tools/testing/selftests/ can be considered for backport?
For example, consider commit 0a5d2efa3827 ("selftests/bpf: Add test case
for the freeing of bpf_timer"), it adds a test case for the issue
addressed in the same series -- commit 58f038e6d209 ("bpf: Cancel the
running bpf_timer through kworker for PREEMPT_RT"). The latter has been
backported to 6.12.y.
Would commit 0a5d2efa3827 be a worthwhile add to 6.12.y as well?
IMO having such test case added would be helpful to check whether the
backported fix really works (assuming someone is willing to do the extra
work of finding, testing, and sending such tests); yet it does not seem
to fit into the current stable kernel rule set of:
- It or an equivalent fix must already exist in Linux mainline (upstream).
- It must be obviously correct and tested.
- It cannot be bigger than 100 lines, with context.
- It must follow the Documentation/process/submitting-patches.rst rules.
- It must either fix a real bug that bothers people or just add a device ID
Appreciate any clarification and/or feedback on this matter.
Thanks,
Shung-Hsi Yu
From: Sven Eckelmann <sven(a)narfation.org>
An OGMv1 and OGMv2 packet receive processing were not only limited by the
number of bytes in the received packet but also by the nodes maximum
aggregation packet size limit. But this limit is relevant for TX and not
for RX. It must not be enforced by batadv_(i)v_ogm_aggr_packet to avoid
loss of information in case of a different limit for sender and receiver.
This has a minor side effect for B.A.T.M.A.N. IV because the
batadv_iv_ogm_aggr_packet is also used for the preprocessing for the TX.
But since the aggregation code itself will not allow more than
BATADV_MAX_AGGREGATION_BYTES bytes, this check was never triggering (in
this context) prior of removing it.
Cc: stable(a)vger.kernel.org
Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol")
Fixes: 9323158ef9f4 ("batman-adv: OGMv2 - implement originators logic")
Signed-off-by: Sven Eckelmann <sven(a)narfation.org>
Signed-off-by: Simon Wunderlich <sw(a)simonwunderlich.de>
---
net/batman-adv/bat_iv_ogm.c | 3 +--
net/batman-adv/bat_v_ogm.c | 3 +--
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/net/batman-adv/bat_iv_ogm.c b/net/batman-adv/bat_iv_ogm.c
index 07ae5dd1f150..b12645949ae5 100644
--- a/net/batman-adv/bat_iv_ogm.c
+++ b/net/batman-adv/bat_iv_ogm.c
@@ -325,8 +325,7 @@ batadv_iv_ogm_aggr_packet(int buff_pos, int packet_len,
/* check if there is enough space for the optional TVLV */
next_buff_pos += ntohs(ogm_packet->tvlv_len);
- return (next_buff_pos <= packet_len) &&
- (next_buff_pos <= BATADV_MAX_AGGREGATION_BYTES);
+ return next_buff_pos <= packet_len;
}
/* send a batman ogm to a given interface */
diff --git a/net/batman-adv/bat_v_ogm.c b/net/batman-adv/bat_v_ogm.c
index e503ee0d896b..8f89ffe6020c 100644
--- a/net/batman-adv/bat_v_ogm.c
+++ b/net/batman-adv/bat_v_ogm.c
@@ -839,8 +839,7 @@ batadv_v_ogm_aggr_packet(int buff_pos, int packet_len,
/* check if there is enough space for the optional TVLV */
next_buff_pos += ntohs(ogm2_packet->tvlv_len);
- return (next_buff_pos <= packet_len) &&
- (next_buff_pos <= BATADV_MAX_AGGREGATION_BYTES);
+ return next_buff_pos <= packet_len;
}
/**
--
2.39.5
Hello:
This series was applied to netdev/net.git (main)
by Paolo Abeni <pabeni(a)redhat.com>:
On Fri, 14 Mar 2025 21:11:30 +0100 you wrote:
> Here are 3 unrelated fixes for the net tree.
>
> - Patch 1: fix data stream corruption when ending up not sending an
> ADD_ADDR.
>
> - Patch 2: fix missing getsockopt(IPV6_V6ONLY) support -- the set part
> is supported.
>
> [...]
Here is the summary with links:
- [net,1/3] mptcp: Fix data stream corruption in the address announcement
https://git.kernel.org/netdev/net/c/2c1f97a52cb8
- [net,2/3] mptcp: sockopt: fix getting IPV6_V6ONLY
(no matching commit)
- [net,3/3] mptcp: sockopt: fix getting freebind & transparent
(no matching commit)
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
From: Kuniyuki Iwashima <kuniyu(a)amazon.com>
[ Upstream commit e8c526f2bdf1845bedaf6a478816a3d06fa78b8f ]
Martin KaFai Lau reported use-after-free [0] in reqsk_timer_handler().
"""
We are seeing a use-after-free from a bpf prog attached to
trace_tcp_retransmit_synack. The program passes the req->sk to the
bpf_sk_storage_get_tracing kernel helper which does check for null
before using it.
"""
The commit 83fccfc3940c ("inet: fix potential deadlock in
reqsk_queue_unlink()") added timer_pending() in reqsk_queue_unlink() not
to call del_timer_sync() from reqsk_timer_handler(), but it introduced a
small race window.
Before the timer is called, expire_timers() calls detach_timer(timer, true)
to clear timer->entry.pprev and marks it as not pending.
If reqsk_queue_unlink() checks timer_pending() just after expire_timers()
calls detach_timer(), TCP will miss del_timer_sync(); the reqsk timer will
continue running and send multiple SYN+ACKs until it expires.
The reported UAF could happen if req->sk is close()d earlier than the timer
expiration, which is 63s by default.
The scenario would be
1. inet_csk_complete_hashdance() calls inet_csk_reqsk_queue_drop(),
but del_timer_sync() is missed
2. reqsk timer is executed and scheduled again
3. req->sk is accept()ed and reqsk_put() decrements rsk_refcnt, but
reqsk timer still has another one, and inet_csk_accept() does not
clear req->sk for non-TFO sockets
4. sk is close()d
5. reqsk timer is executed again, and BPF touches req->sk
Let's not use timer_pending() by passing the caller context to
__inet_csk_reqsk_queue_drop().
Note that reqsk timer is pinned, so the issue does not happen in most
use cases. [1]
[0]
BUG: KFENCE: use-after-free read in bpf_sk_storage_get_tracing+0x2e/0x1b0
Use-after-free read at 0x00000000a891fb3a (in kfence-#1):
bpf_sk_storage_get_tracing+0x2e/0x1b0
bpf_prog_5ea3e95db6da0438_tcp_retransmit_synack+0x1d20/0x1dda
bpf_trace_run2+0x4c/0xc0
tcp_rtx_synack+0xf9/0x100
reqsk_timer_handler+0xda/0x3d0
run_timer_softirq+0x292/0x8a0
irq_exit_rcu+0xf5/0x320
sysvec_apic_timer_interrupt+0x6d/0x80
asm_sysvec_apic_timer_interrupt+0x16/0x20
intel_idle_irq+0x5a/0xa0
cpuidle_enter_state+0x94/0x273
cpu_startup_entry+0x15e/0x260
start_secondary+0x8a/0x90
secondary_startup_64_no_verify+0xfa/0xfb
kfence-#1: 0x00000000a72cc7b6-0x00000000d97616d9, size=2376, cache=TCPv6
allocated by task 0 on cpu 9 at 260507.901592s:
sk_prot_alloc+0x35/0x140
sk_clone_lock+0x1f/0x3f0
inet_csk_clone_lock+0x15/0x160
tcp_create_openreq_child+0x1f/0x410
tcp_v6_syn_recv_sock+0x1da/0x700
tcp_check_req+0x1fb/0x510
tcp_v6_rcv+0x98b/0x1420
ipv6_list_rcv+0x2258/0x26e0
napi_complete_done+0x5b1/0x2990
mlx5e_napi_poll+0x2ae/0x8d0
net_rx_action+0x13e/0x590
irq_exit_rcu+0xf5/0x320
common_interrupt+0x80/0x90
asm_common_interrupt+0x22/0x40
cpuidle_enter_state+0xfb/0x273
cpu_startup_entry+0x15e/0x260
start_secondary+0x8a/0x90
secondary_startup_64_no_verify+0xfa/0xfb
freed by task 0 on cpu 9 at 260507.927527s:
rcu_core_si+0x4ff/0xf10
irq_exit_rcu+0xf5/0x320
sysvec_apic_timer_interrupt+0x6d/0x80
asm_sysvec_apic_timer_interrupt+0x16/0x20
cpuidle_enter_state+0xfb/0x273
cpu_startup_entry+0x15e/0x260
start_secondary+0x8a/0x90
secondary_startup_64_no_verify+0xfa/0xfb
Fixes: 83fccfc3940c ("inet: fix potential deadlock in reqsk_queue_unlink()")
Reported-by: Martin KaFai Lau <martin.lau(a)kernel.org>
Closes: https://lore.kernel.org/netdev/eb6684d0-ffd9-4bdc-9196-33f690c25824@linux.d…
Link: https://lore.kernel.org/netdev/b55e2ca0-42f2-4b7c-b445-6ffd87ca74a0@linux.d… [1]
Signed-off-by: Kuniyuki Iwashima <kuniyu(a)amazon.com>
Reviewed-by: Eric Dumazet <edumazet(a)google.com>
Reviewed-by: Martin KaFai Lau <martin.lau(a)kernel.org>
Link: https://patch.msgid.link/20241014223312.4254-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Bin Lan <bin.lan.cn(a)windriver.com>
Signed-off-by: He Zhe <zhe.he(a)windriver.com>
---
Build test passed.
---
net/ipv4/inet_connection_sock.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 6ebe43b4d28f..723aab818b52 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -722,21 +722,31 @@ static bool reqsk_queue_unlink(struct request_sock *req)
found = __sk_nulls_del_node_init_rcu(req_to_sk(req));
spin_unlock(lock);
}
- if (timer_pending(&req->rsk_timer) && del_timer_sync(&req->rsk_timer))
- reqsk_put(req);
+
return found;
}
-bool inet_csk_reqsk_queue_drop(struct sock *sk, struct request_sock *req)
+static bool __inet_csk_reqsk_queue_drop(struct sock *sk,
+ struct request_sock *req,
+ bool from_timer)
{
bool unlinked = reqsk_queue_unlink(req);
+ if (!from_timer && timer_delete_sync(&req->rsk_timer))
+ reqsk_put(req);
+
if (unlinked) {
reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req);
reqsk_put(req);
}
+
return unlinked;
}
+
+bool inet_csk_reqsk_queue_drop(struct sock *sk, struct request_sock *req)
+{
+ return __inet_csk_reqsk_queue_drop(sk, req, false);
+}
EXPORT_SYMBOL(inet_csk_reqsk_queue_drop);
void inet_csk_reqsk_queue_drop_and_put(struct sock *sk, struct request_sock *req)
@@ -804,7 +814,8 @@ static void reqsk_timer_handler(struct timer_list *t)
return;
}
drop:
- inet_csk_reqsk_queue_drop_and_put(sk_listener, req);
+ __inet_csk_reqsk_queue_drop(sk_listener, req, true);
+ reqsk_put(req);
}
static void reqsk_queue_hash_req(struct request_sock *req,
--
2.34.1
Hi,
I prepared this series about 6 months ago, but never got around to
sending it in. In mainline, we got rid of remap_pfn_range() on the
io_uring side, and this just backports it to 6.1-stable as well.
This eliminates issues with fragmented memory, and hence it'd
be nice to have it in 6.1 stable as well.
--
Jens Axboe
When adding a socket option support in MPTCP, both the get and set parts
are supposed to be implemented.
IPV6_V6ONLY support for the setsockopt part has been added a while ago,
but it looks like the get part got forgotten. It should have been
present as a way to verify a setting has been set as expected, and not
to act differently from TCP or any other socket types.
Not supporting this getsockopt(IPV6_V6ONLY) blocks some apps which want
to check the default value, before doing extra actions. On Linux, the
default value is 0, but this can be changed with the net.ipv6.bindv6only
sysctl knob. On Windows, it is set to 1 by default. So supporting the
get part, like for all other socket options, is important.
Everything was in place to expose it, just the last step was missing.
Only new code is added to cover this specific getsockopt(), that seems
safe.
Fixes: c9b95a135987 ("mptcp: support IPV6_V6ONLY setsockopt")
Cc: stable(a)vger.kernel.org
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/550
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
net/mptcp/sockopt.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 505445a9598fafa1cde6d8f4a1a8635c3e778051..4b99eb796855e4578d14df90f9d1cc3f1cd5b8c7 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -1430,6 +1430,20 @@ static int mptcp_getsockopt_v4(struct mptcp_sock *msk, int optname,
return -EOPNOTSUPP;
}
+static int mptcp_getsockopt_v6(struct mptcp_sock *msk, int optname,
+ char __user *optval, int __user *optlen)
+{
+ struct sock *sk = (void *)msk;
+
+ switch (optname) {
+ case IPV6_V6ONLY:
+ return mptcp_put_int_option(msk, optval, optlen,
+ sk->sk_ipv6only);
+ }
+
+ return -EOPNOTSUPP;
+}
+
static int mptcp_getsockopt_sol_mptcp(struct mptcp_sock *msk, int optname,
char __user *optval, int __user *optlen)
{
@@ -1469,6 +1483,8 @@ int mptcp_getsockopt(struct sock *sk, int level, int optname,
if (level == SOL_IP)
return mptcp_getsockopt_v4(msk, optname, optval, option);
+ if (level == SOL_IPV6)
+ return mptcp_getsockopt_v6(msk, optname, optval, option);
if (level == SOL_TCP)
return mptcp_getsockopt_sol_tcp(msk, optname, optval, option);
if (level == SOL_MPTCP)
--
2.48.1
From: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
Turns out LNL+ and BMG+ no longer have the weird extra scanline
offset for HDMI outputs. Fix intel_crtc_scanline_offset()
accordingly so that scanline evasion/etc. works correctly on
HDMI outputs on these new platforms.
Cc: stable(a)vger.kernel.org
Signed-off-by: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
---
drivers/gpu/drm/i915/display/intel_vblank.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/display/intel_vblank.c b/drivers/gpu/drm/i915/display/intel_vblank.c
index 4efd4f7d497a..7b240ce681a0 100644
--- a/drivers/gpu/drm/i915/display/intel_vblank.c
+++ b/drivers/gpu/drm/i915/display/intel_vblank.c
@@ -222,7 +222,9 @@ int intel_crtc_scanline_offset(const struct intel_crtc_state *crtc_state)
* However if queried just before the start of vblank we'll get an
* answer that's slightly in the future.
*/
- if (DISPLAY_VER(display) == 2)
+ if (DISPLAY_VER(display) >= 20 || display->platform.battlemage)
+ return 1;
+ else if (DISPLAY_VER(display) == 2)
return -1;
else if (HAS_DDI(display) && intel_crtc_has_type(crtc_state, INTEL_OUTPUT_HDMI))
return 2;
--
2.45.3
This series has a bug fix for the early bootconsole as well as
some related efficiency improvements and cleanup.
The relevant code is subject to CONSOLE_DEBUG, which is only used
on Macs at present.
---
Changed since v1:
- Solved problem with line wrap while scrolling.
- Added two additional patches.
Finn Thain (3):
m68k: Fix lost column on framebuffer debug console
m68k: Avoid pointless recursion in debug console rendering
m68k: Remove unused "cursor home" code from debug console
arch/m68k/kernel/head.S | 73 +++++++++++++++++++++--------------------
1 file changed, 37 insertions(+), 36 deletions(-)
--
2.45.3
The dp_perform_128b_132b_channel_eq_done_sequence() calls
dp_get_lane_status_and_lane_adjust() but lacks dp_decide_lane_settings().
The dp_get_lane_status_and_lane_adjust() and dp_decide_lane_settings()
functions are essential for DisplayPort link training in the Linux kernel,
with the former retrieving lane status and adjustment needs, and the
latter determining optimal lane configurations. This omission risks
compatibility issues, particularly with lower-quality cables or displays,
as the system cannot dynamically adjust to hardware limitations,
potentially leading to failed connections. Add dp_decide_lane_settings()
to enable adaptive lane configuration.
Fixes: 630168a97314 ("drm/amd/display: move dp link training logic to link_dp_training")
Cc: stable(a)vger.kernel.org # 6.3+
Signed-off-by: Wentao Liang <vulab(a)iscas.ac.cn>
---
.../amd/display/dc/link/protocols/link_dp_training_128b_132b.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_training_128b_132b.c b/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_training_128b_132b.c
index db87cfe37b5c..99aae3e43106 100644
--- a/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_training_128b_132b.c
+++ b/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_training_128b_132b.c
@@ -176,6 +176,8 @@ static enum link_training_result dp_perform_128b_132b_cds_done_sequence(
wait_time += lt_settings->cds_pattern_time;
status = dp_get_lane_status_and_lane_adjust(link, lt_settings, dpcd_lane_status,
&dpcd_lane_status_updated, dpcd_lane_adjust, DPRX);
+ dp_decide_lane_settings(lt_settings, dpcd_lane_adjust,
+ lt_settings->hw_lane_settings, lt_settings->dpcd_lane_settings);
if (status != DC_OK) {
result = LINK_TRAINING_ABORT;
} else if (dp_is_symbol_locked(lt_settings->link_settings.lane_count, dpcd_lane_status) &&
--
2.42.0.windows.2
From: Paul Aurich <paul(a)darkrain42.org>
If open_cached_dir() encounters an error parsing the lease from the
server, the error handling may race with receiving a lease break,
resulting in open_cached_dir() freeing the cfid while the queued work is
pending.
Update open_cached_dir() to drop refs rather than directly freeing the
cfid.
Have cached_dir_lease_break(), cfids_laundromat_worker(), and
invalidate_all_cached_dirs() clear has_lease immediately while still
holding cfids->cfid_list_lock, and then use this to also simplify the
reference counting in cfids_laundromat_worker() and
invalidate_all_cached_dirs().
Fixes this KASAN splat (which manually injects an error and lease break
in open_cached_dir()):
==================================================================
BUG: KASAN: slab-use-after-free in smb2_cached_lease_break+0x27/0xb0
Read of size 8 at addr ffff88811cc24c10 by task kworker/3:1/65
CPU: 3 UID: 0 PID: 65 Comm: kworker/3:1 Not tainted 6.12.0-rc6-g255cf264e6e5-dirty #87
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Workqueue: cifsiod smb2_cached_lease_break
Call Trace:
<TASK>
dump_stack_lvl+0x77/0xb0
print_report+0xce/0x660
kasan_report+0xd3/0x110
smb2_cached_lease_break+0x27/0xb0
process_one_work+0x50a/0xc50
worker_thread+0x2ba/0x530
kthread+0x17c/0x1c0
ret_from_fork+0x34/0x60
ret_from_fork_asm+0x1a/0x30
</TASK>
Allocated by task 2464:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
__kasan_kmalloc+0xaa/0xb0
open_cached_dir+0xa7d/0x1fb0
smb2_query_path_info+0x43c/0x6e0
cifs_get_fattr+0x346/0xf10
cifs_get_inode_info+0x157/0x210
cifs_revalidate_dentry_attr+0x2d1/0x460
cifs_getattr+0x173/0x470
vfs_statx_path+0x10f/0x160
vfs_statx+0xe9/0x150
vfs_fstatat+0x5e/0xc0
__do_sys_newfstatat+0x91/0xf0
do_syscall_64+0x95/0x1a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Freed by task 2464:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x60
__kasan_slab_free+0x51/0x70
kfree+0x174/0x520
open_cached_dir+0x97f/0x1fb0
smb2_query_path_info+0x43c/0x6e0
cifs_get_fattr+0x346/0xf10
cifs_get_inode_info+0x157/0x210
cifs_revalidate_dentry_attr+0x2d1/0x460
cifs_getattr+0x173/0x470
vfs_statx_path+0x10f/0x160
vfs_statx+0xe9/0x150
vfs_fstatat+0x5e/0xc0
__do_sys_newfstatat+0x91/0xf0
do_syscall_64+0x95/0x1a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Last potentially related work creation:
kasan_save_stack+0x33/0x60
__kasan_record_aux_stack+0xad/0xc0
insert_work+0x32/0x100
__queue_work+0x5c9/0x870
queue_work_on+0x82/0x90
open_cached_dir+0x1369/0x1fb0
smb2_query_path_info+0x43c/0x6e0
cifs_get_fattr+0x346/0xf10
cifs_get_inode_info+0x157/0x210
cifs_revalidate_dentry_attr+0x2d1/0x460
cifs_getattr+0x173/0x470
vfs_statx_path+0x10f/0x160
vfs_statx+0xe9/0x150
vfs_fstatat+0x5e/0xc0
__do_sys_newfstatat+0x91/0xf0
do_syscall_64+0x95/0x1a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The buggy address belongs to the object at ffff88811cc24c00
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 16 bytes inside of
freed 1024-byte region [ffff88811cc24c00, ffff88811cc25000)
Cc: stable(a)vger.kernel.org
Signed-off-by: Paul Aurich <paul(a)darkrain42.org>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
[ Do not apply the change for cfids_laundromat_worker() since there is no
this function and related feature on 6.1.y. Update open_cached_dir()
according to method of upstream patch. ]
Signed-off-by: Cliff Liu <donghua.liu(a)windriver.com>
Signed-off-by: He Zhe <Zhe.He(a)windriver.com>
---
Verified the build test.
---
fs/smb/client/cached_dir.c | 39 ++++++++++++++++----------------------
1 file changed, 16 insertions(+), 23 deletions(-)
diff --git a/fs/smb/client/cached_dir.c b/fs/smb/client/cached_dir.c
index d09226c1ac90..d65d5fe5b8fe 100644
--- a/fs/smb/client/cached_dir.c
+++ b/fs/smb/client/cached_dir.c
@@ -320,17 +320,13 @@ int open_cached_dir(unsigned int xid, struct cifs_tcon *tcon,
/*
* We are guaranteed to have two references at this point.
* One for the caller and one for a potential lease.
- * Release the Lease-ref so that the directory will be closed
- * when the caller closes the cached handle.
+ * Release one here, and the second below.
*/
kref_put(&cfid->refcount, smb2_close_cached_fid);
}
if (rc) {
- if (cfid->is_open)
- SMB2_close(0, cfid->tcon, cfid->fid.persistent_fid,
- cfid->fid.volatile_fid);
- free_cached_dir(cfid);
- cfid = NULL;
+ cfid->has_lease = false;
+ kref_put(&cfid->refcount, smb2_close_cached_fid);
}
if (rc == 0) {
@@ -462,25 +458,24 @@ void invalidate_all_cached_dirs(struct cifs_tcon *tcon)
cfids->num_entries--;
cfid->is_open = false;
cfid->on_list = false;
- /* To prevent race with smb2_cached_lease_break() */
- kref_get(&cfid->refcount);
+ if (cfid->has_lease) {
+ /*
+ * The lease was never cancelled from the server,
+ * so steal that reference.
+ */
+ cfid->has_lease = false;
+ } else
+ kref_get(&cfid->refcount);
}
spin_unlock(&cfids->cfid_list_lock);
list_for_each_entry_safe(cfid, q, &entry, entry) {
list_del(&cfid->entry);
cancel_work_sync(&cfid->lease_break);
- if (cfid->has_lease) {
- /*
- * We lease was never cancelled from the server so we
- * need to drop the reference.
- */
- spin_lock(&cfids->cfid_list_lock);
- cfid->has_lease = false;
- spin_unlock(&cfids->cfid_list_lock);
- kref_put(&cfid->refcount, smb2_close_cached_fid);
- }
- /* Drop the extra reference opened above*/
+ /*
+ * Drop the ref-count from above, either the lease-ref (if there
+ * was one) or the extra one acquired.
+ */
kref_put(&cfid->refcount, smb2_close_cached_fid);
}
}
@@ -491,9 +486,6 @@ smb2_cached_lease_break(struct work_struct *work)
struct cached_fid *cfid = container_of(work,
struct cached_fid, lease_break);
- spin_lock(&cfid->cfids->cfid_list_lock);
- cfid->has_lease = false;
- spin_unlock(&cfid->cfids->cfid_list_lock);
kref_put(&cfid->refcount, smb2_close_cached_fid);
}
@@ -511,6 +503,7 @@ int cached_dir_lease_break(struct cifs_tcon *tcon, __u8 lease_key[16])
!memcmp(lease_key,
cfid->fid.lease_key,
SMB2_LEASE_KEY_SIZE)) {
+ cfid->has_lease = false;
cfid->time = 0;
/*
* We found a lease remove it from the list
--
2.43.0
Since the i and pool->chunk_size variables are of type 'u32',
their product can wrap around and then be cast to 'u64'.
This can lead to two different XDP buffers pointing to the same
memory area.
Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.
Fixes: 94033cd8e73b ("xsk: Optimize for aligned case")
Cc: stable(a)vger.kernel.org
Signed-off-by: Ilia Gavrilov <Ilia.Gavrilov(a)infotecs.ru>
---
net/xdp/xsk_buff_pool.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 1f7975b49657..d158cb6dd391 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -105,7 +105,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
if (pool->unaligned)
pool->free_heads[i] = xskb;
else
- xp_init_xskb_addr(xskb, pool, i * pool->chunk_size);
+ xp_init_xskb_addr(xskb, pool, (u64)i * pool->chunk_size);
}
return pool;
--
2.39.5
On 3/11/25 4:00 PM, Jens Axboe wrote:
> On 3/11/25 8:59 AM, Milan Broz wrote:
>> On 3/11/25 4:03 AM, Shinichiro Kawasaki wrote:
>>>
>>> I created fix candidate patches to address the blktests nvme/039 failure [1].
>>> This may work for the failures Ondrej and Milan observe too, hopefully.
>>
>> Hi,
>>
>> I quickly tried to run the test with todays' mainline git and mentioned two patches:
>> https://lkml.kernel.org/linux-block/20250311024144.1762333-2-shinichiro.kaw…
>> https://lkml.kernel.org/linux-block/20250311024144.1762333-3-shinichiro.kaw…
>> and it looks like our SED Opal tests are fixed, no errors or warnings, thanks!
>>
>>> Jens, Alan, could you take a look in the patches and see if they make sense?
>>
>> Please, fix it before the 6.14 final, this could cause serious data corruption, at least
>> on systems using SED Opal drives.
>
> Fix is already queued up.
Hi Jens,
it seems the bug was "successfully" backported to stable 6.13.4+ and I do not see any fix
in the stable queue yet.
We were hit by it today, as all locked ranges for Opal devices now returning empty data
instead of failing read.
Please could you check what need to be backported to 6.13 stable?
IMO these patches should be backported:
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/comm…https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/comm…
Thanks,
Milan
Mounting a corrupted filesystem with directory which contains '.' dir
entry with rec_len == block size results in out-of-bounds read (later
on, when the corrupted directory is removed).
ext4_empty_dir() assumes every ext4 directory contains at least '.'
and '..' as directory entries in the first data block. It first loads
the '.' dir entry, performs sanity checks by calling ext4_check_dir_entry()
and then uses its rec_len member to compute the location of '..' dir
entry (in ext4_next_entry). It assumes the '..' dir entry fits into the
same data block.
If the rec_len of '.' is precisely one block (4KB), it slips through the
sanity checks (it is considered the last directory entry in the data
block) and leaves "struct ext4_dir_entry_2 *de" point exactly past the
memory slot allocated to the data block. The following call to
ext4_check_dir_entry() on new value of de then dereferences this pointer
which results in out-of-bounds mem access.
Fix this by extending __ext4_check_dir_entry() to check for '.' dir
entries that reach the end of data block. Make sure to ignore the phony
dir entries for checksum (by checking name_len for non-zero).
Note: This is reported by KASAN as use-after-free in case another
structure was recently freed from the slot past the bound, but it is
really an OOB read.
This issue was found by syzkaller tool.
Call Trace:
[ 38.594108] BUG: KASAN: slab-use-after-free in __ext4_check_dir_entry+0x67e/0x710
[ 38.594649] Read of size 2 at addr ffff88802b41a004 by task syz-executor/5375
[ 38.595158]
[ 38.595288] CPU: 0 UID: 0 PID: 5375 Comm: syz-executor Not tainted 6.14.0-rc7 #1
[ 38.595298] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 38.595304] Call Trace:
[ 38.595308] <TASK>
[ 38.595311] dump_stack_lvl+0xa7/0xd0
[ 38.595325] print_address_description.constprop.0+0x2c/0x3f0
[ 38.595339] ? __ext4_check_dir_entry+0x67e/0x710
[ 38.595349] print_report+0xaa/0x250
[ 38.595359] ? __ext4_check_dir_entry+0x67e/0x710
[ 38.595368] ? kasan_addr_to_slab+0x9/0x90
[ 38.595378] kasan_report+0xab/0xe0
[ 38.595389] ? __ext4_check_dir_entry+0x67e/0x710
[ 38.595400] __ext4_check_dir_entry+0x67e/0x710
[ 38.595410] ext4_empty_dir+0x465/0x990
[ 38.595421] ? __pfx_ext4_empty_dir+0x10/0x10
[ 38.595432] ext4_rmdir.part.0+0x29a/0xd10
[ 38.595441] ? __dquot_initialize+0x2a7/0xbf0
[ 38.595455] ? __pfx_ext4_rmdir.part.0+0x10/0x10
[ 38.595464] ? __pfx___dquot_initialize+0x10/0x10
[ 38.595478] ? down_write+0xdb/0x140
[ 38.595487] ? __pfx_down_write+0x10/0x10
[ 38.595497] ext4_rmdir+0xee/0x140
[ 38.595506] vfs_rmdir+0x209/0x670
[ 38.595517] ? lookup_one_qstr_excl+0x3b/0x190
[ 38.595529] do_rmdir+0x363/0x3c0
[ 38.595537] ? __pfx_do_rmdir+0x10/0x10
[ 38.595544] ? strncpy_from_user+0x1ff/0x2e0
[ 38.595561] __x64_sys_unlinkat+0xf0/0x130
[ 38.595570] do_syscall_64+0x5b/0x180
[ 38.595583] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes: ac27a0ec112a0 ("[PATCH] ext4: initial copy of files from ext3")
Signed-off-by: Jakub Acs <acsjakub(a)amazon.com>
Cc: "Theodore Ts'o" <tytso(a)mit.edu>
Cc: Andreas Dilger <adilger.kernel(a)dilger.ca>
Cc: linux-ext4(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: Mahmoud Adam <mngyadam(a)amazon.com>
Cc: stable(a)vger.kernel.org
Cc: security(a)kernel.org
---
If not fixed, this potentially leaks information from kernel data
structures allocated after the data block.
I based the check on the assumption that every ext4 directory has '.'
followed by at least one entry ('..') in the first data block.
(the code in ext4_empty_dir seems to operate on this assumption)
..and it is also supported by claim in
https://www.kernel.org/doc/html/latest/filesystems/ext4/directory.html:
"By ext2 custom, the '.' and '..' entries must appear at the beginning of
this first block"
Please confirm that this is correct and there are no valid ext4
directories that have '.' as the last directory entry. If this
assumption is wrong, I would fix the caller rather than the callee.
Testing:
I booted the kernel in AL2023 EC2 Instance and ran ext4 tests from
git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git:
<setup with loop devices>
./check ext4/*
[..]
ext4/002 [not run] This test requires a valid $SCRATCH_LOGDEV
ext4/004 [not run] dump utility required, skipped this test
ext4/006 [not run] Couldn't find e2fuzz
ext4/029 [not run] This test requires a valid $SCRATCH_LOGDEV
ext4/030 [not run] mount /dev/loop1 with dax failed
ext4/031 [not run] mount /dev/loop1 with dax failed
ext4/047 [not run] mount /dev/loop1 with dax=always failed
ext4/055 [not run] fsgqa user not defined.
ext4/057 [not run] UUID ioctls are not supported by kernel.
Not run: ext4/002 ext4/004 ext4/006 ext4/029 ext4/030 ext4/031 ext4/047
ext4/055 ext4/057
Passed all 69 tests
(please let me know if any of the skipped tests need to be run)
Thanks,
Jakub
---
fs/ext4/dir.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 02d47a64e8d1..d157a6c0eff6 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -104,6 +104,9 @@ int __ext4_check_dir_entry(const char *function, unsigned int line,
else if (unlikely(le32_to_cpu(de->inode) >
le32_to_cpu(EXT4_SB(dir->i_sb)->s_es->s_inodes_count)))
error_msg = "inode out of bounds";
+ else if (unlikely(de->name_len > 0 && strcmp(".", de->name) == 0 &&
+ next_offset == size))
+ error_msg = "'.' directory cannot be the last in data block";
else
return 0;
--
2.47.1
From: Arthur Mongodin <amongodin(a)randorisec.fr>
Because of the size restriction in the TCP options space, the MPTCP
ADD_ADDR option is exclusive and cannot be sent with other MPTCP ones.
For this reason, in the linked mptcp_out_options structure, group of
fields linked to different options are part of the same union.
There is a case where the mptcp_pm_add_addr_signal() function can modify
opts->addr, but not ended up sending an ADD_ADDR. Later on, back in
mptcp_established_options, other options will be sent, but with
unexpected data written in other fields due to the union, e.g. in
opts->ext_copy. This could lead to a data stream corruption in the next
packet.
Using an intermediate variable, prevents from corrupting previously
established DSS option. The assignment of the ADD_ADDR option
parameters is now done once we are sure this ADD_ADDR option can be set
in the packet, e.g. after having dropped other suboptions.
Fixes: 1bff1e43a30e ("mptcp: optimize out option generation")
Cc: stable(a)vger.kernel.org
Suggested-by: Paolo Abeni <pabeni(a)redhat.com>
Signed-off-by: Arthur Mongodin <amongodin(a)randorisec.fr>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
[ Matt: the commit message has been updated: long lines splits and some
clarifications. ]
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
net/mptcp/options.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index fd2de185bc939f8730e87a63ac02a015e610e99c..23949ae2a3a8db19d05c5c8373f45c885c3523ad 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -651,6 +651,7 @@ static bool mptcp_established_options_add_addr(struct sock *sk, struct sk_buff *
struct mptcp_sock *msk = mptcp_sk(subflow->conn);
bool drop_other_suboptions = false;
unsigned int opt_size = *size;
+ struct mptcp_addr_info addr;
bool echo;
int len;
@@ -659,7 +660,7 @@ static bool mptcp_established_options_add_addr(struct sock *sk, struct sk_buff *
*/
if (!mptcp_pm_should_add_signal(msk) ||
(opts->suboptions & (OPTION_MPTCP_MPJ_ACK | OPTION_MPTCP_MPC_ACK)) ||
- !mptcp_pm_add_addr_signal(msk, skb, opt_size, remaining, &opts->addr,
+ !mptcp_pm_add_addr_signal(msk, skb, opt_size, remaining, &addr,
&echo, &drop_other_suboptions))
return false;
@@ -672,7 +673,7 @@ static bool mptcp_established_options_add_addr(struct sock *sk, struct sk_buff *
else if (opts->suboptions & OPTION_MPTCP_DSS)
return false;
- len = mptcp_add_addr_len(opts->addr.family, echo, !!opts->addr.port);
+ len = mptcp_add_addr_len(addr.family, echo, !!addr.port);
if (remaining < len)
return false;
@@ -689,6 +690,7 @@ static bool mptcp_established_options_add_addr(struct sock *sk, struct sk_buff *
opts->ahmac = 0;
*size -= opt_size;
}
+ opts->addr = addr;
opts->suboptions |= OPTION_MPTCP_ADD_ADDR;
if (!echo) {
MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_ADDADDRTX);
--
2.48.1
When adding a socket option support in MPTCP, both the get and set parts
are supposed to be implemented.
IP(V6)_FREEBIND and IP(V6)_TRANSPARENT support for the setsockopt part
has been added a while ago, but it looks like the get part got
forgotten. It should have been present as a way to verify a setting has
been set as expected, and not to act differently from TCP or any other
socket types.
Everything was in place to expose it, just the last step was missing.
Only new code is added to cover these specific getsockopt(), that seems
safe.
Fixes: c9406a23c116 ("mptcp: sockopt: add SOL_IP freebind & transparent options")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
net/mptcp/sockopt.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 4b99eb796855e4578d14df90f9d1cc3f1cd5b8c7..3caa0a9d3b3885ce6399570f2d98a2e8f103638d 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -1419,6 +1419,12 @@ static int mptcp_getsockopt_v4(struct mptcp_sock *msk, int optname,
switch (optname) {
case IP_TOS:
return mptcp_put_int_option(msk, optval, optlen, READ_ONCE(inet_sk(sk)->tos));
+ case IP_FREEBIND:
+ return mptcp_put_int_option(msk, optval, optlen,
+ inet_test_bit(FREEBIND, sk));
+ case IP_TRANSPARENT:
+ return mptcp_put_int_option(msk, optval, optlen,
+ inet_test_bit(TRANSPARENT, sk));
case IP_BIND_ADDRESS_NO_PORT:
return mptcp_put_int_option(msk, optval, optlen,
inet_test_bit(BIND_ADDRESS_NO_PORT, sk));
@@ -1439,6 +1445,12 @@ static int mptcp_getsockopt_v6(struct mptcp_sock *msk, int optname,
case IPV6_V6ONLY:
return mptcp_put_int_option(msk, optval, optlen,
sk->sk_ipv6only);
+ case IPV6_TRANSPARENT:
+ return mptcp_put_int_option(msk, optval, optlen,
+ inet_test_bit(TRANSPARENT, sk));
+ case IPV6_FREEBIND:
+ return mptcp_put_int_option(msk, optval, optlen,
+ inet_test_bit(FREEBIND, sk));
}
return -EOPNOTSUPP;
--
2.48.1
Dear Greg,
I see that the review for 6.6.84-rc1 hasn't started yet, but as it was
already available on [1], our CI has already tried to built it for ia64
in the morning. Unfortunately that failed - I assume due to the
following **missing** upstream commit:
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/comm…
[1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.…
Build failure (see [2]):
```
[...]
CC drivers/video/fbdev/core/fbcon.o
drivers/video/fbdev/core/fbcon.c: In function 'fb_console_setup':
drivers/video/fbdev/core/fbcon.c:478:33: error: 'fb_center_logo' undeclared (first use in this function); did you mean 'fb_prepare_logo'?
478 | fb_center_logo = true;
| ^~~~~~~~~~~~~~
| fb_prepare_logo
drivers/video/fbdev/core/fbcon.c:478:33: note: each undeclared identifier is reported only once for each function it appears in
drivers/video/fbdev/core/fbcon.c:485:33: error: 'fb_logo_count' undeclared (first use in this function); did you mean 'file_count'?
485 | fb_logo_count = simple_strtol(options, &options, 0);
| ^~~~~~~~~~~~~
| file_count
make[8]: *** [scripts/Makefile.build:243: drivers/video/fbdev/core/fbcon.o] Error 1
[...]
```
[2]: https://github.com/linux-ia64/linux-stable-rc/actions/runs/13914712427/job/…
[3] (fa671e4f1556e2c18e5443f777a75ae041290068 upstream) includes
definitions for these variables, but they are guarded by CONFIG_LOGO.
But in `drivers/video/fbdev/core/fbcon.c` those variables are used
unguarded with 6.6.84-rc1. The above upstream commit (8887086) fixes
that IIUC. Not build-tested yet, though.
[3]: https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.…
Cheers,
Frank
Hi again,
I would appreciate your feedback on the proposal I sent earlier.
Looking forward to your response.
Best regards,
Kristina
________________________________
From: Kristina Williams
Sent: 17 March 2025 12:29
To: linux-stable-mirror(a)lists.linaro.org<mailto:linux-stable-mirror@lists.linaro.org>
Subject: Transform your business
Hi there,
Would you like an updated contact list of Real Estate CRM Software users and customers?
We can also provide contact lists for users and customers of the following companies:
* AppFolio
* RealPage
* Buildium
* CoreLogic
* Yardi Voyager
* Kissflow CRM
* Pipedrive
* HubSpot CRM Service and more...
Please specify the target technology users and geographical areas of interest, so we can provide relevant information accordingly.
Kind regards,
Kristina Williams
Lead Specialist
To unsubscribe from future emails, simply reply with Stop.
Hi Sasha,
On Sat, Mar 15, 2025 at 09:34:40AM -0400, Sasha Levin wrote:
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
This fix is unfortunately buggy (schedule in atomic context). There is a
fixup patch which would be necessary alongside this patch:
f13409bb3f91 ("nvme-fc: rely on state transitions to handle connectivity
loss")
Thanks,
Daniel
> commit e0cdcd023334a757ae78a43d2fa8909dcc72ec56
> Author: Daniel Wagner <wagi(a)kernel.org>
> Date: Thu Jan 9 14:30:49 2025 +0100
>
> nvme-fc: do not ignore connectivity loss during connecting
We have received reports about cards can become corrupt related to the
aggressive PM support. Let's make a partial revert of the change that
enabled the feature.
Reported-by: David Owens <daowens01(a)gmail.com>
Reported-by: Romain Naour <romain.naour(a)smile.fr>
Reported-by: Robert Nelson <robertcnelson(a)gmail.com>
Tested-by: Robert Nelson <robertcnelson(a)gmail.com>
Fixes: 3edf588e7fe0 ("mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM")
Cc: stable(a)vger.kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
---
drivers/mmc/host/sdhci-omap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/mmc/host/sdhci-omap.c b/drivers/mmc/host/sdhci-omap.c
index 54d795205fb4..26a9a8b5682a 100644
--- a/drivers/mmc/host/sdhci-omap.c
+++ b/drivers/mmc/host/sdhci-omap.c
@@ -1339,8 +1339,8 @@ static int sdhci_omap_probe(struct platform_device *pdev)
/* R1B responses is required to properly manage HW busy detection. */
mmc->caps |= MMC_CAP_NEED_RSP_BUSY;
- /* Allow card power off and runtime PM for eMMC/SD card devices */
- mmc->caps |= MMC_CAP_POWER_OFF_CARD | MMC_CAP_AGGRESSIVE_PM;
+ /* Enable SDIO card power off. */
+ mmc->caps |= MMC_CAP_POWER_OFF_CARD;
ret = sdhci_setup_host(host);
if (ret)
--
2.43.0
In mii_nway_restart() during the line:
bmcr = mii->mdio_read(mii->dev, mii->phy_id, MII_BMCR);
The code attempts to call mii->mdio_read which is ch9200_mdio_read().
ch9200_mdio_read() utilises a local buffer, which is initialised
with control_read():
unsigned char buff[2];
However buff is conditionally initialised inside control_read():
if (err == size) {
memcpy(data, buf, size);
}
If the condition of "err == size" is not met, then buff remains
uninitialised. Once this happens the uninitialised buff is accessed
and returned during ch9200_mdio_read():
return (buff[0] | buff[1] << 8);
The problem stems from the fact that ch9200_mdio_read() ignores the
return value of control_read(), leading to uinit-access of buff.
To fix this we should check the return value of control_read()
and return early on error.
Reported-by: syzbot <syzbot+3361c2d6f78a3e0892f9(a)syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=3361c2d6f78a3e0892f9
Tested-by: syzbot <syzbot+3361c2d6f78a3e0892f9(a)syzkaller.appspotmail.com>
Fixes: 4a476bd6d1d9 ("usbnet: New driver for QinHeng CH9200 devices")
Cc: stable(a)vger.kernel.org
Signed-off-by: Qasim Ijaz <qasdev00(a)gmail.com>
---
drivers/net/mii.c | 2 ++
drivers/net/usb/ch9200.c | 7 +++++--
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/net/mii.c b/drivers/net/mii.c
index 37bc3131d31a..e305bf0f1d04 100644
--- a/drivers/net/mii.c
+++ b/drivers/net/mii.c
@@ -464,6 +464,8 @@ int mii_nway_restart (struct mii_if_info *mii)
/* if autoneg is off, it's an error */
bmcr = mii->mdio_read(mii->dev, mii->phy_id, MII_BMCR);
+ if (bmcr < 0)
+ return bmcr;
if (bmcr & BMCR_ANENABLE) {
bmcr |= BMCR_ANRESTART;
diff --git a/drivers/net/usb/ch9200.c b/drivers/net/usb/ch9200.c
index f69d9b902da0..a206ffa76f1b 100644
--- a/drivers/net/usb/ch9200.c
+++ b/drivers/net/usb/ch9200.c
@@ -178,6 +178,7 @@ static int ch9200_mdio_read(struct net_device *netdev, int phy_id, int loc)
{
struct usbnet *dev = netdev_priv(netdev);
unsigned char buff[2];
+ int ret;
netdev_dbg(netdev, "%s phy_id:%02x loc:%02x\n",
__func__, phy_id, loc);
@@ -185,8 +186,10 @@ static int ch9200_mdio_read(struct net_device *netdev, int phy_id, int loc)
if (phy_id != 0)
return -ENODEV;
- control_read(dev, REQUEST_READ, 0, loc * 2, buff, 0x02,
- CONTROL_TIMEOUT_MS);
+ ret = control_read(dev, REQUEST_READ, 0, loc * 2, buff, 0x02,
+ CONTROL_TIMEOUT_MS);
+ if (ret < 0)
+ return ret;
return (buff[0] | buff[1] << 8);
}
--
2.39.5
v2: fix incorrect patch submission.
-o-
Hi Greg, Sasha,
This batch contains a backport fix for 6.6-stable.
The following list shows the backported patches, I am using original commit
IDs for reference:
1) 82cfd785c7b3 ("netfilter: nf_tables: bail out if stateful expression provides no .clone")
This is a stable dependency for the next patch.
2) 56fac3c36c8f ("netfilter: nf_tables: allow clone callbacks to sleep")
Without this series, ENOMEM is reported when loading a set with 100k elements
with counters.
Please, apply,
Thanks
Florian Westphal (1):
netfilter: nf_tables: allow clone callbacks to sleep
Pablo Neira Ayuso (1):
netfilter: nf_tables: bail out if stateful expression provides no
.clone
include/net/netfilter/nf_tables.h | 4 ++--
net/netfilter/nf_tables_api.c | 21 ++++++++++-----------
net/netfilter/nft_connlimit.c | 4 ++--
net/netfilter/nft_counter.c | 4 ++--
net/netfilter/nft_dynset.c | 2 +-
net/netfilter/nft_last.c | 4 ++--
net/netfilter/nft_limit.c | 14 ++++++++------
net/netfilter/nft_quota.c | 4 ++--
8 files changed, 29 insertions(+), 28 deletions(-)
--
2.30.2
Hi Greg, Sasha,
This batch contains a backport fix for 6.6-stable.
The following list shows the backported patches, I am using original commit
IDs for reference:
1) 82cfd785c7b3 ("netfilter: nf_tables: bail out if stateful expression provides no .clone")
This is a stable dependency for the next patch.
2) 56fac3c36c8f ("netfilter: nf_tables: allow clone callbacks to sleep")
Please, apply,
Thanks
without this fix, the default set expression is silently ignored when
used from dynamic sets.
Florian Westphal (1):
netfilter: nf_tables: allow clone callbacks to sleep
Pablo Neira Ayuso (1):
netfilter: nf_tables: use timestamp to check for set element timeout
include/net/netfilter/nf_tables.h | 20 ++++++++++++++++----
net/netfilter/nf_tables_api.c | 12 +++++++-----
net/netfilter/nft_connlimit.c | 4 ++--
net/netfilter/nft_counter.c | 4 ++--
net/netfilter/nft_dynset.c | 2 +-
net/netfilter/nft_last.c | 4 ++--
net/netfilter/nft_limit.c | 14 ++++++++------
net/netfilter/nft_quota.c | 4 ++--
net/netfilter/nft_set_hash.c | 8 +++++++-
net/netfilter/nft_set_pipapo.c | 18 +++++++++++-------
net/netfilter/nft_set_rbtree.c | 11 +++++++----
11 files changed, 65 insertions(+), 36 deletions(-)
--
2.30.2
In mii_nway_restart() during the line:
bmcr = mii->mdio_read(mii->dev, mii->phy_id, MII_BMCR);
The code attempts to call mii->mdio_read which is ch9200_mdio_read().
ch9200_mdio_read() utilises a local buffer, which is initialised
with control_read():
unsigned char buff[2];
However buff is conditionally initialised inside control_read():
if (err == size) {
memcpy(data, buf, size);
}
If the condition of "err == size" is not met, then buff remains
uninitialised. Once this happens the uninitialised buff is accessed
and returned during ch9200_mdio_read():
return (buff[0] | buff[1] << 8);
The problem stems from the fact that ch9200_mdio_read() ignores the
return value of control_read(), leading to uinit-access of buff.
To fix this we should check the return value of control_read()
and return early on error.
Reported-by: syzbot <syzbot+3361c2d6f78a3e0892f9(a)syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=3361c2d6f78a3e0892f9
Tested-by: syzbot <syzbot+3361c2d6f78a3e0892f9(a)syzkaller.appspotmail.com>
Fixes: 4a476bd6d1d9 ("usbnet: New driver for QinHeng CH9200 devices")
Cc: stable(a)vger.kernel.org
Signed-off-by: Qasim Ijaz <qasdev00(a)gmail.com>
---
drivers/net/mii.c | 2 ++
drivers/net/usb/ch9200.c | 7 +++++--
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/net/mii.c b/drivers/net/mii.c
index 37bc3131d31a..e305bf0f1d04 100644
--- a/drivers/net/mii.c
+++ b/drivers/net/mii.c
@@ -464,6 +464,8 @@ int mii_nway_restart (struct mii_if_info *mii)
/* if autoneg is off, it's an error */
bmcr = mii->mdio_read(mii->dev, mii->phy_id, MII_BMCR);
+ if (bmcr < 0)
+ return bmcr;
if (bmcr & BMCR_ANENABLE) {
bmcr |= BMCR_ANRESTART;
diff --git a/drivers/net/usb/ch9200.c b/drivers/net/usb/ch9200.c
index f69d9b902da0..a206ffa76f1b 100644
--- a/drivers/net/usb/ch9200.c
+++ b/drivers/net/usb/ch9200.c
@@ -178,6 +178,7 @@ static int ch9200_mdio_read(struct net_device *netdev, int phy_id, int loc)
{
struct usbnet *dev = netdev_priv(netdev);
unsigned char buff[2];
+ int ret;
netdev_dbg(netdev, "%s phy_id:%02x loc:%02x\n",
__func__, phy_id, loc);
@@ -185,8 +186,10 @@ static int ch9200_mdio_read(struct net_device *netdev, int phy_id, int loc)
if (phy_id != 0)
return -ENODEV;
- control_read(dev, REQUEST_READ, 0, loc * 2, buff, 0x02,
- CONTROL_TIMEOUT_MS);
+ ret = control_read(dev, REQUEST_READ, 0, loc * 2, buff, 0x02,
+ CONTROL_TIMEOUT_MS);
+ if (ret < 0)
+ return ret;
return (buff[0] | buff[1] << 8);
}
--
2.39.5
Commit 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()") introduce
a way to set nid to all reserved region.
But there is a corner case it will leave some region with invalid nid.
When memblock_set_node() doubles the array of memblock.reserved, it may
lead to a new reserved region before current position. The new region
will be left with an invalid node id.
Repeat the process when detecting it.
Fixes: 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")
Signed-off-by: Wei Yang <richard.weiyang(a)gmail.com>
CC: Mike Rapoport <rppt(a)kernel.org>
CC: Yajun Deng <yajun.deng(a)linux.dev>
CC: <stable(a)vger.kernel.org>
---
v2: move check out side of the loop
---
mm/memblock.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/mm/memblock.c b/mm/memblock.c
index 85442f1b7f14..0bae7547d2db 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2179,11 +2179,14 @@ static void __init memmap_init_reserved_pages(void)
struct memblock_region *region;
phys_addr_t start, end;
int nid;
+ unsigned long max_reserved;
/*
* set nid on all reserved pages and also treat struct
* pages for the NOMAP regions as PageReserved
*/
+repeat:
+ max_reserved = memblock.reserved.max;
for_each_mem_region(region) {
nid = memblock_get_region_node(region);
start = region->base;
@@ -2194,6 +2197,13 @@ static void __init memmap_init_reserved_pages(void)
memblock_set_node(start, region->size, &memblock.reserved, nid);
}
+ /*
+ * 'max' is changed means memblock.reserved has been doubled its
+ * array, which may result a new reserved region before current
+ * 'start'. Now we should repeat the procedure to set its node id.
+ */
+ if (max_reserved != memblock.reserved.max)
+ goto repeat;
/*
* initialize struct pages for reserved regions that don't have
--
2.34.1
The patch titled
Subject: mm/hugetlb: move hugetlb_sysctl_init() to the __init section
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-hugetlb-move-hugetlb_sysctl_init-to-the-__init-section.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Marc Herbert <Marc.Herbert(a)linux.intel.com>
Subject: mm/hugetlb: move hugetlb_sysctl_init() to the __init section
Date: Wed, 19 Mar 2025 06:00:30 +0000
hugetlb_sysctl_init() is only invoked once by an __init function and is
merely a wrapper around another __init function so there is not reason to
keep it.
Fixes the following warning when toning down some GCC inline options:
WARNING: modpost: vmlinux: section mismatch in reference:
hugetlb_sysctl_init+0x1b (section: .text) ->
__register_sysctl_init (section: .init.text)
Link: https://lkml.kernel.org/r/20250319060041.2737320-1-marc.herbert@linux.intel…
Signed-off-by: Marc Herbert <Marc.Herbert(a)linux.intel.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/hugetlb.c~mm-hugetlb-move-hugetlb_sysctl_init-to-the-__init-section
+++ a/mm/hugetlb.c
@@ -4912,7 +4912,7 @@ static const struct ctl_table hugetlb_ta
},
};
-static void hugetlb_sysctl_init(void)
+static void __init hugetlb_sysctl_init(void)
{
register_sysctl_init("vm", hugetlb_table);
}
_
Patches currently in -mm which might be from Marc.Herbert(a)linux.intel.com are
mm-hugetlb-move-hugetlb_sysctl_init-to-the-__init-section.patch
The cdns3 driver has the same NCM deadlock as fixed in cdnsp by commit
58f2fcb3a845 ("usb: cdnsp: Fix deadlock issue during using NCM gadget").
Under PREEMPT_RT the deadlock can be readily triggered by heavy network
traffic, for example using "iperf --bidir" over NCM ethernet link.
The deadlock occurs because the threaded interrupt handler gets
preempted by a softirq, but both are protected by the same spinlock.
Prevent deadlock by disabling softirq during threaded irq handler.
cc: stable(a)vger.kernel.org
Fixes: 7733f6c32e36 ("usb: cdns3: Add Cadence USB3 DRD Driver")
Signed-off-by: Ralph Siemsen <ralph.siemsen(a)linaro.org>
---
v2 changes:
- move the fix up the call stack, as per discussion at
https://lore.kernel.org/linux-rt-devel/20250226082931.-XRIDa6D@linutronix.d…
---
drivers/usb/cdns3/cdns3-gadget.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/usb/cdns3/cdns3-gadget.c b/drivers/usb/cdns3/cdns3-gadget.c
index fd1beb10bba72..19101ff1cf1bd 100644
--- a/drivers/usb/cdns3/cdns3-gadget.c
+++ b/drivers/usb/cdns3/cdns3-gadget.c
@@ -1963,6 +1963,7 @@ static irqreturn_t cdns3_device_thread_irq_handler(int irq, void *data)
unsigned int bit;
unsigned long reg;
+ local_bh_disable();
spin_lock_irqsave(&priv_dev->lock, flags);
reg = readl(&priv_dev->regs->usb_ists);
@@ -2004,6 +2005,7 @@ static irqreturn_t cdns3_device_thread_irq_handler(int irq, void *data)
irqend:
writel(~0, &priv_dev->regs->ep_ien);
spin_unlock_irqrestore(&priv_dev->lock, flags);
+ local_bh_enable();
return ret;
}
---
base-commit: 4701f33a10702d5fc577c32434eb62adde0a1ae1
change-id: 20250312-rfs-cdns3-deadlock-697df5cad3ce
Best regards,
--
Ralph Siemsen <ralph.siemsen(a)linaro.org>
Backport this series to 6.1&6.6 because we get build errors with GCC14
and OpenSSL3 (or later):
certs/extract-cert.c: In function 'main':
certs/extract-cert.c:124:17: error: implicit declaration of function 'ENGINE_load_builtin_engines' [-Wimplicit-function-declaration]
124 | ENGINE_load_builtin_engines();
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
certs/extract-cert.c:126:21: error: implicit declaration of function 'ENGINE_by_id' [-Wimplicit-function-declaration]
126 | e = ENGINE_by_id("pkcs11");
| ^~~~~~~~~~~~
certs/extract-cert.c:126:19: error: assignment to 'ENGINE *' {aka 'struct engine_st *'} from 'int' makes pointer from integer without a cast [-Wint-conversion]
126 | e = ENGINE_by_id("pkcs11");
| ^
certs/extract-cert.c:128:21: error: implicit declaration of function 'ENGINE_init' [-Wimplicit-function-declaration]
128 | if (ENGINE_init(e))
| ^~~~~~~~~~~
certs/extract-cert.c:133:30: error: implicit declaration of function 'ENGINE_ctrl_cmd_string' [-Wimplicit-function-declaration]
133 | ERR(!ENGINE_ctrl_cmd_string(e, "PIN", key_pass, 0), "Set PKCS#11 PIN");
| ^~~~~~~~~~~~~~~~~~~~~~
certs/extract-cert.c:64:32: note: in definition of macro 'ERR'
64 | bool __cond = (cond); \
| ^~~~
certs/extract-cert.c:134:17: error: implicit declaration of function 'ENGINE_ctrl_cmd' [-Wimplicit-function-declaration]
134 | ENGINE_ctrl_cmd(e, "LOAD_CERT_CTRL", 0, &parms, NULL, 1);
| ^~~~~~~~~~~~~~~
In theory 5.4&5.10&5.15 also need this, but they need more efforts because
file paths are different.
The ENGINE interface has its limitations and it has been superseded
by the PROVIDER API, it is deprecated in OpenSSL version 3.0.
Some distros have started removing it from header files.
Update sign-file and extract-cert to use PROVIDER API for OpenSSL Major >= 3.
Tested on F39 with openssl-3.1.1, pkcs11-provider-0.5-2, openssl-pkcs11-0.4.12-4
and softhsm-2.6.1-5 by using same key/cert as PEM and PKCS11 and comparing that
the result is identical.
V1 -> V2:
Add upstream commit id.
Jan Stancek (3):
sign-file,extract-cert: move common SSL helper functions to a header
sign-file,extract-cert: avoid using deprecated ERR_get_error_line()
sign-file,extract-cert: use pkcs11 provider for OPENSSL MAJOR >= 3
Signed-off-by: Jan Stancek <jstancek(a)redhat.com>
Signed-off-by: Huacai Chen <chenhuacai(a)loongson.cn>
---
MAINTAINERS | 1 +
certs/Makefile | 2 +-
certs/extract-cert.c | 138 +++++++++++++++++++++++--------------------
scripts/sign-file.c | 134 +++++++++++++++++++++--------------------
scripts/ssl-common.h | 32 ++++++++++
5 files changed, 178 insertions(+), 129 deletions(-)
create mode 100644 scripts/ssl-common.h
---
2.27.0
The call to read_word_at_a_time() in sized_strscpy() is problematic
with MTE because it may trigger a tag check fault when reading
across a tag granule (16 bytes) boundary. To make this code
MTE compatible, let's start using load_unaligned_zeropad()
on architectures where it is available (i.e. architectures that
define CONFIG_DCACHE_WORD_ACCESS). Because load_unaligned_zeropad()
takes care of page boundaries as well as tag granule boundaries,
also disable the code preventing crossing page boundaries when using
load_unaligned_zeropad().
Signed-off-by: Peter Collingbourne <pcc(a)google.com>
Link: https://linux-review.googlesource.com/id/If4b22e43b5a4ca49726b4bf98ada827fd…
Fixes: 94ab5b61ee16 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS")
Cc: stable(a)vger.kernel.org
---
v2:
- new approach
lib/string.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/lib/string.c b/lib/string.c
index eb4486ed40d25..b632c71df1a50 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -119,6 +119,7 @@ ssize_t sized_strscpy(char *dest, const char *src, size_t count)
if (count == 0 || WARN_ON_ONCE(count > INT_MAX))
return -E2BIG;
+#ifndef CONFIG_DCACHE_WORD_ACCESS
#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
/*
* If src is unaligned, don't cross a page boundary,
@@ -133,12 +134,14 @@ ssize_t sized_strscpy(char *dest, const char *src, size_t count)
/* If src or dest is unaligned, don't do word-at-a-time. */
if (((long) dest | (long) src) & (sizeof(long) - 1))
max = 0;
+#endif
#endif
/*
- * read_word_at_a_time() below may read uninitialized bytes after the
- * trailing zero and use them in comparisons. Disable this optimization
- * under KMSAN to prevent false positive reports.
+ * load_unaligned_zeropad() or read_word_at_a_time() below may read
+ * uninitialized bytes after the trailing zero and use them in
+ * comparisons. Disable this optimization under KMSAN to prevent
+ * false positive reports.
*/
if (IS_ENABLED(CONFIG_KMSAN))
max = 0;
@@ -146,7 +149,11 @@ ssize_t sized_strscpy(char *dest, const char *src, size_t count)
while (max >= sizeof(unsigned long)) {
unsigned long c, data;
+#ifdef CONFIG_DCACHE_WORD_ACCESS
+ c = load_unaligned_zeropad(src+res);
+#else
c = read_word_at_a_time(src+res);
+#endif
if (has_zero(c, &data, &constants)) {
data = prep_zero_mask(c, data, &constants);
data = create_zero_mask(data);
--
2.49.0.395.g12beb8f557-goog
We have recently seen report of lockdep circular lock dependency warnings
on platforms like skykale and kabylake:
======================================================
WARNING: possible circular locking dependency detected
6.14.0-rc6-CI_DRM_16276-gca2c04fe76e8+ #1 Not tainted
------------------------------------------------------
swapper/0/1 is trying to acquire lock:
ffffffff8360ee48 (iommu_probe_device_lock){+.+.}-{3:3},
at: iommu_probe_device+0x1d/0x70
but task is already holding lock:
ffff888102c7efa8 (&device->physical_node_lock){+.+.}-{3:3},
at: intel_iommu_init+0xe75/0x11f0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&device->physical_node_lock){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
intel_iommu_init+0xe75/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #5 (dmar_global_lock){++++}-{3:3}:
down_read+0x43/0x1d0
enable_drhd_fault_handling+0x21/0x110
cpuhp_invoke_callback+0x4c6/0x870
cpuhp_issue_call+0xbf/0x1f0
__cpuhp_setup_state_cpuslocked+0x111/0x320
__cpuhp_setup_state+0xb0/0x220
irq_remap_enable_fault_handling+0x3f/0xa0
apic_intr_mode_init+0x5c/0x110
x86_late_time_init+0x24/0x40
start_kernel+0x895/0xbd0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xbf/0x110
common_startup_64+0x13e/0x141
-> #4 (cpuhp_state_mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
__cpuhp_setup_state_cpuslocked+0x67/0x320
__cpuhp_setup_state+0xb0/0x220
page_alloc_init_cpuhp+0x2d/0x60
mm_core_init+0x18/0x2c0
start_kernel+0x576/0xbd0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xbf/0x110
common_startup_64+0x13e/0x141
-> #3 (cpu_hotplug_lock){++++}-{0:0}:
__cpuhp_state_add_instance+0x4f/0x220
iova_domain_init_rcaches+0x214/0x280
iommu_setup_dma_ops+0x1a4/0x710
iommu_device_register+0x17d/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #2 (&domain->iova_cookie->mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
iommu_setup_dma_ops+0x16b/0x710
iommu_device_register+0x17d/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #1 (&group->mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
__iommu_probe_device+0x24c/0x4e0
probe_iommu_group+0x2b/0x50
bus_for_each_dev+0x7d/0xe0
iommu_device_register+0xe1/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #0 (iommu_probe_device_lock){+.+.}-{3:3}:
__lock_acquire+0x1637/0x2810
lock_acquire+0xc9/0x300
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
iommu_probe_device+0x1d/0x70
intel_iommu_init+0xe90/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
other info that might help us debug this:
Chain exists of:
iommu_probe_device_lock --> dmar_global_lock -->
&device->physical_node_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&device->physical_node_lock);
lock(dmar_global_lock);
lock(&device->physical_node_lock);
lock(iommu_probe_device_lock);
*** DEADLOCK ***
This driver uses a global lock to protect the list of enumerated DMA
remapping units. It is necessary due to the driver's support for dynamic
addition and removal of remapping units at runtime.
Two distinct code paths require iteration over this remapping unit list:
- Device registration and probing: the driver iterates the list to
register each remapping unit with the upper layer IOMMU framework
and subsequently probe the devices managed by that unit.
- Global configuration: Upper layer components may also iterate the list
to apply configuration changes.
The lock acquisition order between these two code paths was reversed. This
caused lockdep warnings, indicating a risk of deadlock. Fix this warning
by releasing the global lock before invoking upper layer interfaces for
device registration.
Fixes: b150654f74bf ("iommu/vt-d: Fix suspicious RCU usage")
Closes: https://lore.kernel.org/linux-iommu/SJ1PR11MB612953431F94F18C954C4A9CB9D32@…
Cc: stable(a)vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu(a)linux.intel.com>
---
drivers/iommu/intel/iommu.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 85aa66ef4d61..ec2f385ae25b 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3049,6 +3049,7 @@ static int __init probe_acpi_namespace_devices(void)
if (dev->bus != &acpi_bus_type)
continue;
+ up_read(&dmar_global_lock);
adev = to_acpi_device(dev);
mutex_lock(&adev->physical_node_lock);
list_for_each_entry(pn,
@@ -3058,6 +3059,7 @@ static int __init probe_acpi_namespace_devices(void)
break;
}
mutex_unlock(&adev->physical_node_lock);
+ down_read(&dmar_global_lock);
if (ret)
return ret;
--
2.43.0
The optimized strscpy() and dentry_string_cmp() routines will read 8
unaligned bytes at a time via the function read_word_at_a_time(), but
this is incompatible with MTE which will fault on a partially invalid
read. The attributes on read_word_at_a_time() that disable KASAN are
invisible to the CPU so they have no effect on MTE. Let's fix the
bug for now by disabling the optimizations if the kernel is built
with HW tag-based KASAN and consider improvements for followup changes.
Signed-off-by: Peter Collingbourne <pcc(a)google.com>
Link: https://linux-review.googlesource.com/id/If4b22e43b5a4ca49726b4bf98ada827fd…
Fixes: 94ab5b61ee16 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS")
Cc: stable(a)vger.kernel.org
---
fs/dcache.c | 2 +-
lib/string.c | 3 ++-
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index e3634916ffb93..71f0830ac5e69 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -223,7 +223,7 @@ fs_initcall(init_fs_dcache_sysctls);
* Compare 2 name strings, return 0 if they match, otherwise non-zero.
* The strings are both count bytes long, and count is non-zero.
*/
-#ifdef CONFIG_DCACHE_WORD_ACCESS
+#if defined(CONFIG_DCACHE_WORD_ACCESS) && !defined(CONFIG_KASAN_HW_TAGS)
#include <asm/word-at-a-time.h>
/*
diff --git a/lib/string.c b/lib/string.c
index eb4486ed40d25..9a43a3824d0d7 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -119,7 +119,8 @@ ssize_t sized_strscpy(char *dest, const char *src, size_t count)
if (count == 0 || WARN_ON_ONCE(count > INT_MAX))
return -E2BIG;
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && \
+ !defined(CONFIG_KASAN_HW_TAGS)
/*
* If src is unaligned, don't cross a page boundary,
* since we don't know if the next page is mapped.
--
2.49.0.rc0.332.g42c0ae87b1-goog
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 966944f3711665db13e214fef6d02982c49bb972
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025031632-divorcee-duly-868e@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 966944f3711665db13e214fef6d02982c49bb972 Mon Sep 17 00:00:00 2001
From: Mitchell Levy <levymitchell0(a)gmail.com>
Date: Fri, 7 Mar 2025 15:27:00 -0800
Subject: [PATCH] rust: lockdep: Remove support for dynamically allocated
LockClassKeys
Currently, dynamically allocated LockCLassKeys can be used from the Rust
side without having them registered. This is a soundness issue, so
remove them.
Fixes: 6ea5aa08857a ("rust: sync: introduce `LockClassKey`")
Suggested-by: Alice Ryhl <aliceryhl(a)google.com>
Signed-off-by: Mitchell Levy <levymitchell0(a)gmail.com>
Signed-off-by: Boqun Feng <boqun.feng(a)gmail.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Reviewed-by: Benno Lossin <benno.lossin(a)proton.me>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20250307232717.1759087-11-boqun.feng@gmail.com
diff --git a/rust/kernel/sync.rs b/rust/kernel/sync.rs
index 3498fb344dc9..16eab9138b2b 100644
--- a/rust/kernel/sync.rs
+++ b/rust/kernel/sync.rs
@@ -30,28 +30,20 @@
unsafe impl Sync for LockClassKey {}
impl LockClassKey {
- /// Creates a new lock class key.
- pub const fn new() -> Self {
- Self(Opaque::uninit())
- }
-
pub(crate) fn as_ptr(&self) -> *mut bindings::lock_class_key {
self.0.get()
}
}
-impl Default for LockClassKey {
- fn default() -> Self {
- Self::new()
- }
-}
-
/// Defines a new static lock class and returns a pointer to it.
#[doc(hidden)]
#[macro_export]
macro_rules! static_lock_class {
() => {{
- static CLASS: $crate::sync::LockClassKey = $crate::sync::LockClassKey::new();
+ static CLASS: $crate::sync::LockClassKey =
+ // SAFETY: lockdep expects uninitialized memory when it's handed a statically allocated
+ // lock_class_key
+ unsafe { ::core::mem::MaybeUninit::uninit().assume_init() };
&CLASS
}};
}
Hi folks,
This series fixes support for correctly saving and restoring fltcon0
and fltcon1 registers on gs101 for non-alive banks where the fltcon
register offset is not at a fixed offset (unlike previous SoCs).
This is done by adding a eint_fltcon_offset and providing GS101
specific pin macros that take an additional parameter (similar to
how exynosautov920 handles it's eint_con_offset).
Additionally the SoC specific suspend and resume callbacks are
re-factored so that each SoC variant has it's own callback containing
the peculiarities for that SoC.
Finally support for filter selection on alive banks is added, this is
currently only enabled for gs101. The code path can be excercised using
`echo mem > /sys/power/state`
regards,
Peter
To: Krzysztof Kozlowski <krzk(a)kernel.org>
To: Sylwester Nawrocki <s.nawrocki(a)samsung.com>
To: Alim Akhtar <alim.akhtar(a)samsung.com>
To: Linus Walleij <linus.walleij(a)linaro.org>
Cc: linux-arm-kernel(a)lists.infradead.org
Cc: linux-samsung-soc(a)vger.kernel.org
Cc: linux-gpio(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: andre.draszik(a)linaro.org
Cc: tudor.ambarus(a)linaro.org
Cc: willmcvicker(a)google.com
Cc: semen.protsenko(a)linaro.org
Cc: kernel-team(a)android.com
Cc: jaewon02.kim(a)samsung.com
Signed-off-by: Peter Griffin <peter.griffin(a)linaro.org>
---
Changes in v5:
- Split drvdata suspend & resume callbacks into a dedicated patch (Krzysztof)
- Add comment about stable dependency (Krzysztof)
- Add back in {} braces (Krzysztof)
- Link to v4: https://lore.kernel.org/r/20250307-pinctrl-fltcon-suspend-v4-0-2d775e486036…
Changes in v4:
- save->eint_fltcon1 is an argument to pr_debug(), not readl() change alignment accordingly (Andre)
- Link to v3: https://lore.kernel.org/r/20250306-pinctrl-fltcon-suspend-v3-0-f9ab4ff6a24e…
Changes in v3:
- Ensure EXYNOS_FLTCON_DIGITAL bit is cleared (Andre)
- Make it obvious that exynos_eint_set_filter() is conditional on bank type (Andre)
- Make it obvious exynos_set_wakeup() is conditional on bank type (Andre)
- Align style where the '+' is placed first (Andre)
- Remove unnecessary braces (Andre)
- Link to v2: https://lore.kernel.org/r/20250301-pinctrl-fltcon-suspend-v2-0-a7eef9bb443b…
Changes in v2:
- Remove eint_flt_selectable bool as it can be deduced from EINT_TYPE_WKUP (Peter)
- Move filter config register comment to header file (Andre)
- Rename EXYNOS_FLTCON_DELAY to EXYNOS_FLTCON_ANALOG (Andre)
- Remove misleading old comment (Andre)
- Refactor exynos_eint_update_flt_reg() into a loop (Andre)
- Split refactor of suspend/resume callbacks & gs101 parts into separate patches (Andre)
- Link to v1: https://lore.kernel.org/r/20250120-pinctrl-fltcon-suspend-v1-0-e77900b2a854…
---
Peter Griffin (5):
pinctrl: samsung: add support for eint_fltcon_offset
pinctrl: samsung: refactor drvdata suspend & resume callbacks
pinctrl: samsung: add dedicated SoC eint suspend/resume callbacks
pinctrl: samsung: add gs101 specific eint suspend/resume callbacks
pinctrl: samsung: Add filter selection support for alive bank on gs101
drivers/pinctrl/samsung/pinctrl-exynos-arm64.c | 150 ++++++-------
drivers/pinctrl/samsung/pinctrl-exynos.c | 294 +++++++++++++++----------
drivers/pinctrl/samsung/pinctrl-exynos.h | 50 ++++-
drivers/pinctrl/samsung/pinctrl-samsung.c | 12 +-
drivers/pinctrl/samsung/pinctrl-samsung.h | 12 +-
5 files changed, 319 insertions(+), 199 deletions(-)
---
base-commit: 0761652a3b3b607787aebc386d412b1d0ae8008c
change-id: 20250120-pinctrl-fltcon-suspend-2333a137c4d4
Best regards,
--
Peter Griffin <peter.griffin(a)linaro.org>
Hello,
New build issue found on stable-rc/linux-6.13.y:
---
format specifies type 'unsigned long' but the argument has type
'size_t' (aka 'unsigned int') [-Werror,-Wformat] in
drivers/gpu/drm/xe/xe_guc_ct.o (drivers/gpu/drm/xe/xe_guc_ct.c)
[logspec:kbuild,kbuild.compiler.error]
---
- dashboard: https://d.kernelci.org/i/maestro:654a7499ce48b09c6748b92c5ad90055057a67a1
- giturl: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
- commit HEAD: c025445840beab62deab3af5afa8429f63e8f186
Log excerpt:
=====================================================
drivers/gpu/drm/xe/xe_guc_ct.c:1703:43: error: format specifies type
'unsigned long' but the argument has type 'size_t' (aka 'unsigned
int') [-Werror,-Wformat]
1703 | drm_printf(p, "[CTB].length: 0x%lx\n",
snapshot->ctb_size);
| ~~~
^~~~~~~~~~~~~~~~~~
| %zx
1 error generated.
=====================================================
# Builds where the incident occurred:
## i386_defconfig+allmodconfig+CONFIG_FRAME_WARN=2048 on (i386):
- compiler: clang-17
- dashboard: https://d.kernelci.org/build/maestro:67d98d1028b1441c081c5d19
#kernelci issue maestro:654a7499ce48b09c6748b92c5ad90055057a67a1
Reported-by: kernelci.org bot <bot(a)kernelci.org>
--
This is an experimental report format. Please send feedback in!
Talk to us at kernelci(a)lists.linux.dev
Made with love by the KernelCI team - https://kernelci.org
Hello,
New build issue found on stable-rc/linux-5.4.y:
---
clang: error: linker command failed with exit code 1 (use -v to see
invocation) in samples/seccomp/bpf-fancy (scripts/Makefile.host:116)
[logspec:kbuild,kbuild.other]
---
- dashboard: https://d.kernelci.org/i/maestro:9b282409ffe9399386349927812ed439dcc91837
- giturl: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
- commit HEAD: bbf51067f14852464d75398076a963edbe1e5cc0
Log excerpt:
=====================================================
.o
Generating X.509 key generation config
/usr/bin/ld: cannot find crtbeginS.o: No such file or directory
/usr/bin/ld: cannot find -lgcc: No such file or directory
/usr/bin/ld: cannot find -lgcc_s: No such file or directory
clang: error: linker command failed with exit code 1 (use -v to see invocation)
=====================================================
# Builds where the incident occurred:
## i386_defconfig+allmodconfig+CONFIG_FRAME_WARN=2048 on (i386):
- compiler: clang-17
- dashboard: https://d.kernelci.org/build/maestro:67d9892328b1441c081c5044
#kernelci issue maestro:9b282409ffe9399386349927812ed439dcc91837
Reported-by: kernelci.org bot <bot(a)kernelci.org>
--
This is an experimental report format. Please send feedback in!
Talk to us at kernelci(a)lists.linux.dev
Made with love by the KernelCI team - https://kernelci.org
Hello,
New build issue found on stable-rc/linux-6.13.y:
---
format specifies type 'unsigned long' but the argument has type
'size_t' (aka 'unsigned int') [-Werror,-Wformat] in
drivers/gpu/drm/xe/xe_guc_ct.o (drivers/gpu/drm/xe/xe_guc_ct.c)
[logspec:kbuild,kbuild.compiler.error]
---
- dashboard: https://d.kernelci.org/i/maestro:654a7499ce48b09c6748b92c5ad90055057a67a1
- giturl: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
- commit HEAD: c025445840beab62deab3af5afa8429f63e8f186
Log excerpt:
=====================================================
drivers/gpu/drm/xe/xe_guc_ct.c:1703:43: error: format specifies type
'unsigned long' but the argument has type 'size_t' (aka 'unsigned
int') [-Werror,-Wformat]
1703 | drm_printf(p, "[CTB].length: 0x%lx\n",
snapshot->ctb_size);
| ~~~
^~~~~~~~~~~~~~~~~~
| %zx
1 error generated.
=====================================================
# Builds where the incident occurred:
## i386_defconfig+allmodconfig+CONFIG_FRAME_WARN=2048 on (i386):
- compiler: clang-17
- dashboard: https://d.kernelci.org/build/maestro:67d98d1028b1441c081c5d19
#kernelci issue maestro:654a7499ce48b09c6748b92c5ad90055057a67a1
Reported-by: kernelci.org bot <bot(a)kernelci.org>
--
This is an experimental report format. Please send feedback in!
Talk to us at kernelci(a)lists.linux.dev
Made with love by the KernelCI team - https://kernelci.org
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x a18a69bbec083093c3bfebaec13ce0b4c6b2af7e
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025031846-septic-french-4efb@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a18a69bbec083093c3bfebaec13ce0b4c6b2af7e Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch(a)lst.de>
Date: Fri, 30 Aug 2024 15:37:00 -0700
Subject: [PATCH] xfs: use the recalculated transaction reservation in
xfs_growfs_rt_bmblock
After going great length to calculate the transaction reservation for
the new geometry, we should also use it to allocate the transaction it
was calculated for.
Fixes: 578bd4ce7100 ("xfs: recompute growfsrtfree transaction reservation while growing rt volume")
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Darrick J. Wong <djwong(a)kernel.org>
Signed-off-by: Darrick J. Wong <djwong(a)kernel.org>
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index d290749b0304..a9f08d96f1fe 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -730,10 +730,12 @@ xfs_growfs_rt_bmblock(
xfs_rtsummary_blockcount(mp, nmp->m_rsumlevels,
nmp->m_sb.sb_rbmblocks));
- /* recompute growfsrt reservation from new rsumsize */
+ /*
+ * Recompute the growfsrt reservation from the new rsumsize, so that the
+ * transaction below use the new, potentially larger value.
+ * */
xfs_trans_resv_calc(nmp, &nmp->m_resv);
-
- error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growrtfree, 0, 0, 0,
+ error = xfs_trans_alloc(mp, &M_RES(nmp)->tr_growrtfree, 0, 0, 0,
&args.tp);
if (error)
goto out_free;
Hello,
New build issue found on stable-rc/linux-6.1.y:
---
implicit declaration of function ‘vunmap’; did you mean ‘kunmap’?
[-Werror=implicit-function-declaration] in io_uring/io_uring.o
(io_uring/io_uring.c) [logspec:kbuild,kbuild.compiler.error]
---
- dashboard: https://d.kernelci.org/i/maestro:cbcc52388974e489070975f640bce79475aeb50a
- giturl: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
- commit HEAD: 463226a52e45030747cdf2c689bf719e1ab21055
Log excerpt:
=====================================================
io_uring/io_uring.c:2540:17: error: implicit declaration of function
‘vunmap’; did you mean ‘kunmap’?
[-Werror=implicit-function-declaration]
2540 | vunmap(ptr);
| ^~~~~~
| kunmap
io_uring/io_uring.c: In function ‘io_mem_alloc_single’:
io_uring/io_uring.c:2588:15: error: implicit declaration of function
‘vmap’; did you mean ‘kmap’? [-Werror=implicit-function-declaration]
2588 | ret = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
| ^~~~
| kmap
io_uring/io_uring.c:2588:37: error: ‘VM_MAP’ undeclared (first use in
this function); did you mean ‘VM_MTE’?
2588 | ret = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
| ^~~~~~
| VM_MTE
io_uring/io_uring.c:2588:37: note: each undeclared identifier is
reported only once for each function it appears in
CC fs/read_write.o
CC kernel/bpf/core.o
cc1: some warnings being treated as errors
=====================================================
# Builds where the incident occurred:
## 32r2el_defconfig on (mips):
- compiler: gcc-12
- dashboard: https://d.kernelci.org/build/maestro:67d98b2228b1441c081c54ba
#kernelci issue maestro:cbcc52388974e489070975f640bce79475aeb50a
Reported-by: kernelci.org bot <bot(a)kernelci.org>
--
This is an experimental report format. Please send feedback in!
Talk to us at kernelci(a)lists.linux.dev
Made with love by the KernelCI team - https://kernelci.org
Hello,
New build issue found on stable-rc/linux-6.6.y:
---
‘fb_center_logo’ undeclared (first use in this function); did you
mean ‘fb_prepare_logo’? in drivers/video/fbdev/core/fbcon.o
(drivers/video/fbdev/core/fbcon.c)
[logspec:kbuild,kbuild.compiler.error]
---
- dashboard: https://d.kernelci.org/i/maestro:e495b4ec34845e228a5f1639d0ec805fdbb6e7c0
- giturl: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
- commit HEAD: 52baa369b052eae3278dda3062d63a3058eb9cfe
Log excerpt:
=====================================================
drivers/video/fbdev/core/fbcon.c:478:33: error: ‘fb_center_logo’
undeclared (first use in this function); did you mean
‘fb_prepare_logo’?
478 | fb_center_logo = true;
| ^~~~~~~~~~~~~~
| fb_prepare_logo
drivers/video/fbdev/core/fbcon.c:478:33: note: each undeclared
identifier is reported only once for each function it appears in
CC fs/ubifs/budget.o
drivers/video/fbdev/core/fbcon.c:485:33: error: ‘fb_logo_count’
undeclared (first use in this function); did you mean ‘file_count’?
485 | fb_logo_count =
simple_strtol(options, &options, 0);
| ^~~~~~~~~~~~~
| file_count
=====================================================
# Builds where the incident occurred:
## cros://chromeos-6.6/arm64/chromiumos-mediatek.flavour.config+lab-setup+arm64-chromebook+CONFIG_MODULE_COMPRESS=n+CONFIG_MODULE_COMPRESS_NONE=y
on (arm64):
- compiler: gcc-12
- dashboard: https://d.kernelci.org/build/maestro:67d98b4f28b1441c081c5566
## cros://chromeos-6.6/x86_64/chromeos-amd-stoneyridge.flavour.config+lab-setup+x86-board+CONFIG_MODULE_COMPRESS=n+CONFIG_MODULE_COMPRESS_NONE=y
on (x86_64):
- compiler: gcc-12
- dashboard: https://d.kernelci.org/build/maestro:67d98b5428b1441c081c558c
## cros://chromeos-6.6/x86_64/chromeos-intel-pineview.flavour.config+lab-setup+x86-board+CONFIG_MODULE_COMPRESS=n+CONFIG_MODULE_COMPRESS_NONE=y
on (x86_64):
- compiler: gcc-12
- dashboard: https://d.kernelci.org/build/maestro:67d98b5928b1441c081c558f
## multi_v5_defconfig on (arm):
- compiler: gcc-12
- dashboard: https://d.kernelci.org/build/maestro:67d98b8f28b1441c081c5622
## multi_v7_defconfig on (arm):
- compiler: gcc-12
- dashboard: https://d.kernelci.org/build/maestro:67d98b8028b1441c081c55fe
## multi_v7_defconfig+kselftest on (arm):
- compiler: gcc-12
- dashboard: https://d.kernelci.org/build/maestro:67d98b8a28b1441c081c561f
#kernelci issue maestro:e495b4ec34845e228a5f1639d0ec805fdbb6e7c0
Reported-by: kernelci.org bot <bot(a)kernelci.org>
--
This is an experimental report format. Please send feedback in!
Talk to us at kernelci(a)lists.linux.dev
Made with love by the KernelCI team - https://kernelci.org
Hello,
New build issue found on stable-rc/linux-6.6.y:
---
implicit declaration of function ‘vunmap’; did you mean ‘kunmap’?
[-Werror=implicit-function-declaration] in io_uring/io_uring.o
(io_uring/io_uring.c) [logspec:kbuild,kbuild.compiler.error]
---
- dashboard: https://d.kernelci.org/i/maestro:160797c1391e9c7479eace7259b46a47c35c7db7
- giturl: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
- commit HEAD: 52baa369b052eae3278dda3062d63a3058eb9cfe
Log excerpt:
=====================================================
io_uring/io_uring.c:2708:17: error: implicit declaration of function
‘vunmap’; did you mean ‘kunmap’?
[-Werror=implicit-function-declaration]
2708 | vunmap(ptr);
| ^~~~~~
| kunmap
io_uring/io_uring.c: In function ‘__io_uaddr_map’:
io_uring/io_uring.c:2784:21: error: implicit declaration of function
‘vmap’; did you mean ‘kmap’? [-Werror=implicit-function-declaration]
2784 | page_addr = vmap(page_array, nr_pages, VM_MAP, PAGE_KERNEL);
| ^~~~
| kmap
io_uring/io_uring.c:2784:48: error: ‘VM_MAP’ undeclared (first use in
this function); did you mean ‘VM_MTE’?
2784 | page_addr = vmap(page_array, nr_pages, VM_MAP, PAGE_KERNEL);
| ^~~~~~
| VM_MTE
io_uring/io_uring.c:2784:48: note: each undeclared identifier is
reported only once for each function it appears in
io_uring/io_uring.c: In function ‘io_mem_alloc_single’:
io_uring/io_uring.c:2863:37: error: ‘VM_MAP’ undeclared (first use in
this function); did you mean ‘VM_MTE’?
2863 | ret = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
| ^~~~~~
| VM_MTE
CC crypto/sha256_generic.o
cc1: some warnings being treated as errors
=====================================================
# Builds where the incident occurred:
## 32r2el_defconfig on (mips):
- compiler: gcc-12
- dashboard: https://d.kernelci.org/build/maestro:67d98bca28b1441c081c56f5
#kernelci issue maestro:160797c1391e9c7479eace7259b46a47c35c7db7
Reported-by: kernelci.org bot <bot(a)kernelci.org>
--
This is an experimental report format. Please send feedback in!
Talk to us at kernelci(a)lists.linux.dev
Made with love by the KernelCI team - https://kernelci.org
The max_bad_eraseblocks_per_lun member of nand_device obviously
describes a number of *maximum* number of bad eraseblocks per LUN.
Fix this obvious typo.
Fixes: 377e517b5fa5 ("mtd: nand: Add max_bad_eraseblocks_per_lun info to memorg")
Cc: stable(a)vger.kernel.org
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
---
include/linux/mtd/nand.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/mtd/nand.h b/include/linux/mtd/nand.h
index 0e2f228e8b4a..07486168d104 100644
--- a/include/linux/mtd/nand.h
+++ b/include/linux/mtd/nand.h
@@ -21,7 +21,7 @@ struct nand_device;
* @oobsize: OOB area size
* @pages_per_eraseblock: number of pages per eraseblock
* @eraseblocks_per_lun: number of eraseblocks per LUN (Logical Unit Number)
- * @max_bad_eraseblocks_per_lun: maximum number of eraseblocks per LUN
+ * @max_bad_eraseblocks_per_lun: maximum number of bad eraseblocks per LUN
* @planes_per_lun: number of planes per LUN
* @luns_per_target: number of LUN per target (target is a synonym for die)
* @ntargets: total number of targets exposed by the NAND device
--
2.48.1
Potentially include errata for Landlock ABI v5 (Linux 6.10) and v6
(Linux 6.12). That will be useful for the following signal scoping
erratum.
As explained in errata.h, this commit should be backportable without
conflict down to ABI v5. It must then not include the errata/abi-6.h
file.
Fixes: 54a6e6bbf3be ("landlock: Add signal scoping")
Cc: Günther Noack <gnoack(a)google.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Mickaël Salaün <mic(a)digikod.net>
Link: https://lore.kernel.org/r/20250318161443.279194-5-mic@digikod.net
---
Changes since v1:
- New patch.
---
security/landlock/errata.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/security/landlock/errata.h b/security/landlock/errata.h
index f26b28b9873d..8e626accac10 100644
--- a/security/landlock/errata.h
+++ b/security/landlock/errata.h
@@ -63,6 +63,18 @@ static const struct landlock_erratum landlock_errata_init[] __initconst = {
#endif
#undef LANDLOCK_ERRATA_ABI
+#define LANDLOCK_ERRATA_ABI 5
+#if __has_include("errata/abi-5.h")
+#include "errata/abi-5.h"
+#endif
+#undef LANDLOCK_ERRATA_ABI
+
+#define LANDLOCK_ERRATA_ABI 6
+#if __has_include("errata/abi-6.h")
+#include "errata/abi-6.h"
+#endif
+#undef LANDLOCK_ERRATA_ABI
+
/*
* For each new erratum, we need to include all the ABI files up to the impacted
* ABI to make all potential future intermediate errata easy to backport.
--
2.48.1
Hello,
New build issue found on stable-rc/linux-6.6.y:
---
implicit declaration of function ‘vunmap’; did you mean ‘kunmap’?
[-Werror=implicit-function-declaration] in io_uring/io_uring.o
(io_uring/io_uring.c) [logspec:kbuild,kbuild.compiler.error]
---
- dashboard: https://d.kernelci.org/i/maestro:160797c1391e9c7479eace7259b46a47c35c7db7
- giturl: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
- commit HEAD: 52baa369b052eae3278dda3062d63a3058eb9cfe
Log excerpt:
=====================================================
io_uring/io_uring.c:2708:17: error: implicit declaration of function
‘vunmap’; did you mean ‘kunmap’?
[-Werror=implicit-function-declaration]
2708 | vunmap(ptr);
| ^~~~~~
| kunmap
io_uring/io_uring.c: In function ‘__io_uaddr_map’:
io_uring/io_uring.c:2784:21: error: implicit declaration of function
‘vmap’; did you mean ‘kmap’? [-Werror=implicit-function-declaration]
2784 | page_addr = vmap(page_array, nr_pages, VM_MAP, PAGE_KERNEL);
| ^~~~
| kmap
io_uring/io_uring.c:2784:48: error: ‘VM_MAP’ undeclared (first use in
this function); did you mean ‘VM_MTE’?
2784 | page_addr = vmap(page_array, nr_pages, VM_MAP, PAGE_KERNEL);
| ^~~~~~
| VM_MTE
io_uring/io_uring.c:2784:48: note: each undeclared identifier is
reported only once for each function it appears in
io_uring/io_uring.c: In function ‘io_mem_alloc_single’:
io_uring/io_uring.c:2863:37: error: ‘VM_MAP’ undeclared (first use in
this function); did you mean ‘VM_MTE’?
2863 | ret = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
| ^~~~~~
| VM_MTE
CC crypto/sha256_generic.o
cc1: some warnings being treated as errors
=====================================================
# Builds where the incident occurred:
## 32r2el_defconfig on (mips):
- compiler: gcc-12
- dashboard: https://d.kernelci.org/build/maestro:67d98bca28b1441c081c56f5
#kernelci issue maestro:160797c1391e9c7479eace7259b46a47c35c7db7
Reported-by: kernelci.org bot <bot(a)kernelci.org>
--
This is an experimental report format. Please send feedback in!
Talk to us at kernelci(a)lists.linux.dev
Made with love by the KernelCI team - https://kernelci.org
Backport this series to 6.1&6.6 because we get build errors with GCC14
and OpenSSL3 (or later):
certs/extract-cert.c: In function 'main':
certs/extract-cert.c:124:17: error: implicit declaration of function 'ENGINE_load_builtin_engines' [-Wimplicit-function-declaration]
124 | ENGINE_load_builtin_engines();
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
certs/extract-cert.c:126:21: error: implicit declaration of function 'ENGINE_by_id' [-Wimplicit-function-declaration]
126 | e = ENGINE_by_id("pkcs11");
| ^~~~~~~~~~~~
certs/extract-cert.c:126:19: error: assignment to 'ENGINE *' {aka 'struct engine_st *'} from 'int' makes pointer from integer without a cast [-Wint-conversion]
126 | e = ENGINE_by_id("pkcs11");
| ^
certs/extract-cert.c:128:21: error: implicit declaration of function 'ENGINE_init' [-Wimplicit-function-declaration]
128 | if (ENGINE_init(e))
| ^~~~~~~~~~~
certs/extract-cert.c:133:30: error: implicit declaration of function 'ENGINE_ctrl_cmd_string' [-Wimplicit-function-declaration]
133 | ERR(!ENGINE_ctrl_cmd_string(e, "PIN", key_pass, 0), "Set PKCS#11 PIN");
| ^~~~~~~~~~~~~~~~~~~~~~
certs/extract-cert.c:64:32: note: in definition of macro 'ERR'
64 | bool __cond = (cond); \
| ^~~~
certs/extract-cert.c:134:17: error: implicit declaration of function 'ENGINE_ctrl_cmd' [-Wimplicit-function-declaration]
134 | ENGINE_ctrl_cmd(e, "LOAD_CERT_CTRL", 0, &parms, NULL, 1);
| ^~~~~~~~~~~~~~~
In theory 5.4&5.10&5.15 also need this, but they need more efforts because
file paths are different.
The ENGINE interface has its limitations and it has been superseded
by the PROVIDER API, it is deprecated in OpenSSL version 3.0.
Some distros have started removing it from header files.
Update sign-file and extract-cert to use PROVIDER API for OpenSSL Major >= 3.
Tested on F39 with openssl-3.1.1, pkcs11-provider-0.5-2, openssl-pkcs11-0.4.12-4
and softhsm-2.6.1-5 by using same key/cert as PEM and PKCS11 and comparing that
the result is identical.
Jan Stancek (3):
sign-file,extract-cert: move common SSL helper functions to a header
sign-file,extract-cert: avoid using deprecated ERR_get_error_line()
sign-file,extract-cert: use pkcs11 provider for OPENSSL MAJOR >= 3
Signed-off-by: Jan Stancek <jstancek(a)redhat.com>
Signed-off-by: Huacai Chen <chenhuacai(a)loongson.cn>
---
MAINTAINERS | 1 +
certs/Makefile | 2 +-
certs/extract-cert.c | 138 +++++++++++++++++++++++--------------------
scripts/sign-file.c | 134 +++++++++++++++++++++--------------------
scripts/ssl-common.h | 32 ++++++++++
5 files changed, 178 insertions(+), 129 deletions(-)
create mode 100644 scripts/ssl-common.h
---
2.27.0
From: Kang Yang <quic_kangyang(a)quicinc.com>
[ Upstream commit 95c38953cb1ecf40399a676a1f85dfe2b5780a9a ]
When running 'rmmod ath10k', ath10k_sdio_remove() will free sdio
workqueue by destroy_workqueue(). But if CONFIG_INIT_ON_FREE_DEFAULT_ON
is set to yes, kernel panic will happen:
Call trace:
destroy_workqueue+0x1c/0x258
ath10k_sdio_remove+0x84/0x94
sdio_bus_remove+0x50/0x16c
device_release_driver_internal+0x188/0x25c
device_driver_detach+0x20/0x2c
This is because during 'rmmod ath10k', ath10k_sdio_remove() will call
ath10k_core_destroy() before destroy_workqueue(). wiphy_dev_release()
will finally be called in ath10k_core_destroy(). This function will free
struct cfg80211_registered_device *rdev and all its members, including
wiphy, dev and the pointer of sdio workqueue. Then the pointer of sdio
workqueue will be set to NULL due to CONFIG_INIT_ON_FREE_DEFAULT_ON.
After device release, destroy_workqueue() will use NULL pointer then the
kernel panic happen.
Call trace:
ath10k_sdio_remove
->ath10k_core_unregister
……
->ath10k_core_stop
->ath10k_hif_stop
->ath10k_sdio_irq_disable
->ath10k_hif_power_down
->del_timer_sync(&ar_sdio->sleep_timer)
->ath10k_core_destroy
->ath10k_mac_destroy
->ieee80211_free_hw
->wiphy_free
……
->wiphy_dev_release
->destroy_workqueue
Need to call destroy_workqueue() before ath10k_core_destroy(), free
the work queue buffer first and then free pointer of work queue by
ath10k_core_destroy(). This order matches the error path order in
ath10k_sdio_probe().
No work will be queued on sdio workqueue between it is destroyed and
ath10k_core_destroy() is called. Based on the call_stack above, the
reason is:
Only ath10k_sdio_sleep_timer_handler(), ath10k_sdio_hif_tx_sg() and
ath10k_sdio_irq_disable() will queue work on sdio workqueue.
Sleep timer will be deleted before ath10k_core_destroy() in
ath10k_hif_power_down().
ath10k_sdio_irq_disable() only be called in ath10k_hif_stop().
ath10k_core_unregister() will call ath10k_hif_power_down() to stop hif
bus, so ath10k_sdio_hif_tx_sg() won't be called anymore.
Tested-on: QCA6174 hw3.2 SDIO WLAN.RMH.4.4.1-00189
Signed-off-by: Kang Yang <quic_kangyang(a)quicinc.com>
Tested-by: David Ruth <druth(a)chromium.org>
Reviewed-by: David Ruth <druth(a)chromium.org>
Link: https://patch.msgid.link/20241008022246.1010-1-quic_kangyang@quicinc.com
Signed-off-by: Jeff Johnson <quic_jjohnson(a)quicinc.com>
Signed-off-by: Alva Lan <alvalan9(a)foxmail.com>
---
drivers/net/wireless/ath/ath10k/sdio.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
index 63e1c2d783c5..b2e0abb5b2b5 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -3,6 +3,7 @@
* Copyright (c) 2004-2011 Atheros Communications Inc.
* Copyright (c) 2011-2012,2017 Qualcomm Atheros, Inc.
* Copyright (c) 2016-2017 Erik Stromdahl <erik.stromdahl(a)gmail.com>
+ * Copyright (c) 2022-2024 Qualcomm Innovation Center, Inc. All rights reserved.
*/
#include <linux/module.h>
@@ -2648,9 +2649,9 @@ static void ath10k_sdio_remove(struct sdio_func *func)
netif_napi_del(&ar->napi);
- ath10k_core_destroy(ar);
-
destroy_workqueue(ar_sdio->workqueue);
+
+ ath10k_core_destroy(ar);
}
static const struct sdio_device_id ath10k_sdio_devices[] = {
--
2.34.1
From: Hillf Danton <hdanton(a)sina.com>
Once a key's reference count has been reduced to 0, the garbage collector
thread may destroy it at any time and so key_put() is not allowed to touch
the key after that point. The most key_put() is normally allowed to do is
to touch key_gc_work as that's a static global variable.
However, in an effort to speed up the reclamation of quota, this is now
done in key_put() once the key's usage is reduced to 0 - but now the code
is looking at the key after the deadline, which is forbidden.
Fix this on an expedited basis[*] by taking a ref on the key->user struct
and caching the key length before dropping the refcount so that we can
reduce the quota afterwards if we dropped the last ref.
[*] This is going to hurt key_put() performance, so a better way is
probably necessary, such as sticking the dead key onto a queue for the
garbage collector to pick up rather than having it scan the serial number
registry.
Fixes: 9578e327b2b4 ("keys: update key quotas in key_put()")
Reported-by: syzbot+6105ffc1ded71d194d6d(a)syzkaller.appspotmail.com
Tested-by: syzbot+6105ffc1ded71d194d6d(a)syzkaller.appspotmail.com
Suggested-by: Hillf Danton <hdanton(a)sina.com>
Signed-off-by: David Howells <dhowells(a)redhat.com>
cc: Jarkko Sakkinen <jarkko(a)kernel.org>
cc: Kees Cook <kees(a)kernel.org>
cc: keyrings(a)vger.kernel.org
cc: stable(a)vger.kernel.org # v6.10+
---
security/keys/key.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/security/keys/key.c b/security/keys/key.c
index 3d7d185019d3..1e6028492355 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -645,21 +645,30 @@ EXPORT_SYMBOL(key_reject_and_link);
*/
void key_put(struct key *key)
{
+ int quota_flag;
+ unsigned short len;
+ struct key_user *user;
+
if (key) {
key_check(key);
+ quota_flag = test_bit(KEY_FLAG_IN_QUOTA, &key->flags);
+ len = key->quotalen;
+ user = key->user;
+ refcount_inc(&user->usage);
if (refcount_dec_and_test(&key->usage)) {
unsigned long flags;
/* deal with the user's key tracking and quota */
- if (test_bit(KEY_FLAG_IN_QUOTA, &key->flags)) {
- spin_lock_irqsave(&key->user->lock, flags);
- key->user->qnkeys--;
- key->user->qnbytes -= key->quotalen;
- spin_unlock_irqrestore(&key->user->lock, flags);
+ if (quota_flag) {
+ spin_lock_irqsave(&user->lock, flags);
+ user->qnkeys--;
+ user->qnbytes -= len;
+ spin_unlock_irqrestore(&user->lock, flags);
}
schedule_work(&key_gc_work);
}
+ key_user_put(user);
}
}
EXPORT_SYMBOL(key_put);
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x df27cef153603b18a7d094b53cc3d5264ff32797
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025031635-resent-sniff-676f@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From df27cef153603b18a7d094b53cc3d5264ff32797 Mon Sep 17 00:00:00 2001
From: Benno Lossin <benno.lossin(a)proton.me>
Date: Wed, 5 Mar 2025 13:29:01 +0000
Subject: [PATCH] rust: init: fix `Zeroable` implementation for
`Option<NonNull<T>>` and `Option<KBox<T>>`
According to [1], `NonNull<T>` and `#[repr(transparent)]` wrapper types
such as our custom `KBox<T>` have the null pointer optimization only if
`T: Sized`. Thus remove the `Zeroable` implementation for the unsized
case.
Link: https://doc.rust-lang.org/stable/std/option/index.html#representation [1]
Reported-by: Alice Ryhl <aliceryhl(a)google.com>
Closes: https://lore.kernel.org/rust-for-linux/CAH5fLghL+qzrD8KiCF1V3vf2YcC6aWySzkm…
Cc: stable(a)vger.kernel.org # v6.12+ (a custom patch will be needed for 6.6.y)
Fixes: 38cde0bd7b67 ("rust: init: add `Zeroable` trait and `init::zeroed` function")
Signed-off-by: Benno Lossin <benno.lossin(a)proton.me>
Reviewed-by: Alice Ryhl <aliceryhl(a)google.com>
Reviewed-by: Andreas Hindborg <a.hindborg(a)kernel.org>
Link: https://lore.kernel.org/r/20250305132836.2145476-1-benno.lossin@proton.me
[ Added Closes tag and moved up the Reported-by one. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda(a)kernel.org>
diff --git a/rust/kernel/init.rs b/rust/kernel/init.rs
index 7fd1ea8265a5..8bbd5e3398fc 100644
--- a/rust/kernel/init.rs
+++ b/rust/kernel/init.rs
@@ -1418,17 +1418,14 @@ macro_rules! impl_zeroable {
// SAFETY: `T: Zeroable` and `UnsafeCell` is `repr(transparent)`.
{<T: ?Sized + Zeroable>} UnsafeCell<T>,
- // SAFETY: All zeros is equivalent to `None` (option layout optimization guarantee).
+ // SAFETY: All zeros is equivalent to `None` (option layout optimization guarantee:
+ // https://doc.rust-lang.org/stable/std/option/index.html#representation).
Option<NonZeroU8>, Option<NonZeroU16>, Option<NonZeroU32>, Option<NonZeroU64>,
Option<NonZeroU128>, Option<NonZeroUsize>,
Option<NonZeroI8>, Option<NonZeroI16>, Option<NonZeroI32>, Option<NonZeroI64>,
Option<NonZeroI128>, Option<NonZeroIsize>,
-
- // SAFETY: All zeros is equivalent to `None` (option layout optimization guarantee).
- //
- // In this case we are allowed to use `T: ?Sized`, since all zeros is the `None` variant.
- {<T: ?Sized>} Option<NonNull<T>>,
- {<T: ?Sized>} Option<KBox<T>>,
+ {<T>} Option<NonNull<T>>,
+ {<T>} Option<KBox<T>>,
// SAFETY: `null` pointer is valid.
//
Once device_register() failed, we should call put_device() to
decrement reference count for cleanup. Or it could cause memory leak.
And move callback function before put_device().
As comment of device_register() says, 'NOTE: _Never_ directly free
@dev after calling this function, even if it returned an error! Always
use put_device() to give up the reference initialized in this function
instead.'
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: baa057e29b58 ("media: v4l2-dev: use pr_foo() for printing messages")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
Changes in v2:
- modified the patch as no callback function before put_device().
---
drivers/media/v4l2-core/v4l2-dev.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/drivers/media/v4l2-core/v4l2-dev.c b/drivers/media/v4l2-core/v4l2-dev.c
index 5bcaeeba4d09..4a8fdf8115c0 100644
--- a/drivers/media/v4l2-core/v4l2-dev.c
+++ b/drivers/media/v4l2-core/v4l2-dev.c
@@ -1054,17 +1054,16 @@ int __video_register_device(struct video_device *vdev,
vdev->dev.class = &video_class;
vdev->dev.devt = MKDEV(VIDEO_MAJOR, vdev->minor);
vdev->dev.parent = vdev->dev_parent;
+ vdev->dev.release = v4l2_device_release;
dev_set_name(&vdev->dev, "%s%d", name_base, vdev->num);
mutex_lock(&videodev_lock);
ret = device_register(&vdev->dev);
if (ret < 0) {
mutex_unlock(&videodev_lock);
pr_err("%s: device_register failed\n", __func__);
- goto cleanup;
+ put_device(&vdev->dev);
+ return ret;
}
- /* Register the release callback that will be called when the last
- reference to the device goes away. */
- vdev->dev.release = v4l2_device_release;
if (nr != -1 && nr != vdev->num && warn_if_nr_in_use)
pr_warn("%s: requested %s%d, got %s\n", __func__,
--
2.25.1
From: Vincent Mailhol <mailhol.vincent(a)wanadoo.fr>
Commit 7fdaf8966aae ("can: ucan: use strscpy() to instead of strncpy()")
unintentionally introduced a one byte out of bound read on strscpy()'s
source argument (which is kind of ironic knowing that strscpy() is meant
to be a more secure alternative :)).
Let's consider below buffers:
dest[len + 1]; /* will be NUL terminated */
src[len]; /* may not be NUL terminated */
When doing:
strncpy(dest, src, len);
dest[len] = '\0';
strncpy() will read up to len bytes from src.
On the other hand:
strscpy(dest, src, len + 1);
will read up to len + 1 bytes from src, that is to say, an out of bound
read of one byte will occur on src if it is not NUL terminated. Note
that the src[len] byte is never copied, but strscpy() still needs to
read it to check whether a truncation occurred or not.
This exact pattern happened in ucan.
The root cause is that the source is not NUL terminated. Instead of
doing a copy in a local buffer, directly NUL terminate it as soon as
usb_control_msg() returns. With this, the local firmware_str[] variable
can be removed.
On top of this do a couple refactors:
- ucan_ctl_payload->raw is only used for the firmware string, so
rename it to ucan_ctl_payload->fw_str and change its type from u8 to
char.
- ucan_device_request_in() is only used to retrieve the firmware
string, so rename it to ucan_get_fw_str() and refactor it to make it
directly handle all the string termination logic.
Reported-by: syzbot+d7d8c418e8317899e88c(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-can/67b323a4.050a0220.173698.002b.GAE@google.…
Fixes: 7fdaf8966aae ("can: ucan: use strscpy() to instead of strncpy()")
Signed-off-by: Vincent Mailhol <mailhol.vincent(a)wanadoo.fr>
Link: https://patch.msgid.link/20250218143515.627682-2-mailhol.vincent@wanadoo.fr
Cc: stable(a)vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl(a)pengutronix.de>
---
drivers/net/can/usb/ucan.c | 43 ++++++++++++++++----------------------
1 file changed, 18 insertions(+), 25 deletions(-)
diff --git a/drivers/net/can/usb/ucan.c b/drivers/net/can/usb/ucan.c
index 39a63b7313a4..07406daf7c88 100644
--- a/drivers/net/can/usb/ucan.c
+++ b/drivers/net/can/usb/ucan.c
@@ -186,7 +186,7 @@ union ucan_ctl_payload {
*/
struct ucan_ctl_cmd_get_protocol_version cmd_get_protocol_version;
- u8 raw[128];
+ u8 fw_str[128];
} __packed;
enum {
@@ -424,18 +424,20 @@ static int ucan_ctrl_command_out(struct ucan_priv *up,
UCAN_USB_CTL_PIPE_TIMEOUT);
}
-static int ucan_device_request_in(struct ucan_priv *up,
- u8 cmd, u16 subcmd, u16 datalen)
+static void ucan_get_fw_str(struct ucan_priv *up, char *fw_str, size_t size)
{
- return usb_control_msg(up->udev,
- usb_rcvctrlpipe(up->udev, 0),
- cmd,
- USB_DIR_IN | USB_TYPE_VENDOR | USB_RECIP_DEVICE,
- subcmd,
- 0,
- up->ctl_msg_buffer,
- datalen,
- UCAN_USB_CTL_PIPE_TIMEOUT);
+ int ret;
+
+ ret = usb_control_msg(up->udev, usb_rcvctrlpipe(up->udev, 0),
+ UCAN_DEVICE_GET_FW_STRING,
+ USB_DIR_IN | USB_TYPE_VENDOR |
+ USB_RECIP_DEVICE,
+ 0, 0, fw_str, size - 1,
+ UCAN_USB_CTL_PIPE_TIMEOUT);
+ if (ret > 0)
+ fw_str[ret] = '\0';
+ else
+ strscpy(fw_str, "unknown", size);
}
/* Parse the device information structure reported by the device and
@@ -1314,7 +1316,6 @@ static int ucan_probe(struct usb_interface *intf,
u8 in_ep_addr;
u8 out_ep_addr;
union ucan_ctl_payload *ctl_msg_buffer;
- char firmware_str[sizeof(union ucan_ctl_payload) + 1];
udev = interface_to_usbdev(intf);
@@ -1527,17 +1528,6 @@ static int ucan_probe(struct usb_interface *intf,
*/
ucan_parse_device_info(up, &ctl_msg_buffer->cmd_get_device_info);
- /* just print some device information - if available */
- ret = ucan_device_request_in(up, UCAN_DEVICE_GET_FW_STRING, 0,
- sizeof(union ucan_ctl_payload));
- if (ret > 0) {
- /* copy string while ensuring zero termination */
- strscpy(firmware_str, up->ctl_msg_buffer->raw,
- sizeof(union ucan_ctl_payload) + 1);
- } else {
- strcpy(firmware_str, "unknown");
- }
-
/* device is compatible, reset it */
ret = ucan_ctrl_command_out(up, UCAN_COMMAND_RESET, 0, 0);
if (ret < 0)
@@ -1555,7 +1545,10 @@ static int ucan_probe(struct usb_interface *intf,
/* initialisation complete, log device info */
netdev_info(up->netdev, "registered device\n");
- netdev_info(up->netdev, "firmware string: %s\n", firmware_str);
+ ucan_get_fw_str(up, up->ctl_msg_buffer->fw_str,
+ sizeof(up->ctl_msg_buffer->fw_str));
+ netdev_info(up->netdev, "firmware string: %s\n",
+ up->ctl_msg_buffer->fw_str);
/* success */
return 0;
base-commit: 4003c9e78778e93188a09d6043a74f7154449d43
--
2.47.2
The second parameter of memblock_set_node() is size instead of end.
Since it iterates from lower address to higher address, finally the node
id is correct. But during the process, some of them are wrong.
Pass size instead of end.
Fixes: 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")
Signed-off-by: Wei Yang <richard.weiyang(a)gmail.com>
CC: Mike Rapoport <rppt(a)kernel.org>
CC: Yajun Deng <yajun.deng(a)linux.dev>
CC: <stable(a)vger.kernel.org>
---
mm/memblock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memblock.c b/mm/memblock.c
index 64ae678cd1d1..85442f1b7f14 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2192,7 +2192,7 @@ static void __init memmap_init_reserved_pages(void)
if (memblock_is_nomap(region))
reserve_bootmem_region(start, end, nid);
- memblock_set_node(start, end, &memblock.reserved, nid);
+ memblock_set_node(start, region->size, &memblock.reserved, nid);
}
/*
--
2.34.1
Add device awake calls in case of rproc boot and rproc shutdown path.
Currently, device awake call is only present in the recovery path
of remoteproc. If a user stops and starts rproc by using the sysfs
interface, then on pm suspension the firmware loading fails. Keep the
device awake in such a case just like it is done for the recovery path.
Fixes: a781e5aa59110 ("remoteproc: core: Prevent system suspend during remoteproc recovery")
Signed-off-by: Souradeep Chowdhury <quic_schowdhu(a)quicinc.com>
Cc: stable(a)vger.kernel.org
---
drivers/remoteproc/remoteproc_core.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index c2cf0d277729..5d6c4e694b4c 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1917,6 +1917,7 @@ int rproc_boot(struct rproc *rproc)
return -EINVAL;
}
+ pm_stay_awake(rproc->dev.parent);
dev = &rproc->dev;
ret = mutex_lock_interruptible(&rproc->lock);
@@ -1961,6 +1962,7 @@ int rproc_boot(struct rproc *rproc)
atomic_dec(&rproc->power);
unlock_mutex:
mutex_unlock(&rproc->lock);
+ pm_relax(rproc->dev.parent);
return ret;
}
EXPORT_SYMBOL(rproc_boot);
@@ -1991,6 +1993,7 @@ int rproc_shutdown(struct rproc *rproc)
struct device *dev = &rproc->dev;
int ret = 0;
+ pm_stay_awake(rproc->dev.parent);
ret = mutex_lock_interruptible(&rproc->lock);
if (ret) {
dev_err(dev, "can't lock rproc %s: %d\n", rproc->name, ret);
@@ -2027,6 +2030,7 @@ int rproc_shutdown(struct rproc *rproc)
rproc->table_ptr = NULL;
out:
mutex_unlock(&rproc->lock);
+ pm_relax(rproc->dev.parent);
return ret;
}
EXPORT_SYMBOL(rproc_shutdown);
--
2.34.1
Once cdev_device_add() failed, we should use put_device() to decrement
reference count for cleanup. Or it could cause memory leak. Although
operations in err_free_ida are similar to the operations in callback
function fsi_slave_release(), put_device() is a correct handling
operation as comments require when cdev_device_add() fails.
As comment of device_add() says, 'if device_add() succeeds, you should
call device_del() when you want to get rid of it. If device_add() has
not succeeded, use only put_device() to drop the reference count'.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 371975b0b075 ("fsi/core: Fix error paths on CFAM init")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
drivers/fsi/fsi-core.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/fsi/fsi-core.c b/drivers/fsi/fsi-core.c
index e2e1e9df6115..1373e05e3659 100644
--- a/drivers/fsi/fsi-core.c
+++ b/drivers/fsi/fsi-core.c
@@ -1084,7 +1084,8 @@ static int fsi_slave_init(struct fsi_master *master, int link, uint8_t id)
rc = cdev_device_add(&slave->cdev, &slave->dev);
if (rc) {
dev_err(&slave->dev, "Error %d creating slave device\n", rc);
- goto err_free_ida;
+ put_device(&slave->dev);
+ return rc;
}
/* Now that we have the cdev registered with the core, any fatal
@@ -1110,8 +1111,6 @@ static int fsi_slave_init(struct fsi_master *master, int link, uint8_t id)
return 0;
-err_free_ida:
- fsi_free_minor(slave->dev.devt);
err_free:
of_node_put(slave->dev.of_node);
kfree(slave);
--
2.25.1
If kvm_arch_vcpu_create() fails to share the vCPU page with the
hypervisor, we propagate the error back to the ioctl but leave the
vGIC vCPU data initialised. Note only does this leak the corresponding
memory when the vCPU is destroyed but it can also lead to use-after-free
if the redistributor device handling tries to walk into the vCPU.
Add the missing cleanup to kvm_arch_vcpu_create(), ensuring that the
vGIC vCPU structures are destroyed on error.
Cc: <stable(a)vger.kernel.org>
Cc: Marc Zyngier <maz(a)kernel.org>
Cc: Oliver Upton <oliver.upton(a)linux.dev>
Cc: Quentin Perret <qperret(a)google.com>
Signed-off-by: Will Deacon <will(a)kernel.org>
---
It's hard to come up with a "Fixes:" tag for this. Prior to 3f868e142c0b
("KVM: arm64: Introduce kvm_share_hyp()"), create_hyp_mappings() could
still have failed, although if you go back before 66c57edd3bc7 ("KVM:
arm64: Restrict EL2 stage-1 changes in protected mode") then it's
vanishingly unlikely.
arch/arm64/kvm/arm.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index b8e55a441282..fa71cee02faa 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -466,7 +466,11 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
if (err)
return err;
- return kvm_share_hyp(vcpu, vcpu + 1);
+ err = kvm_share_hyp(vcpu, vcpu + 1);
+ if (err)
+ kvm_vgic_vcpu_destroy(vcpu);
+
+ return err;
}
void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
--
2.49.0.rc1.451.g8f38331e32-goog
On our Marvell OCTEON CN96XX board, we observed the following panic on
the latest kernel:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000080
CPU: 22 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc6 #20
Hardware name: Marvell OcteonTX CN96XX board (DT)
pc : of_pci_add_properties+0x278/0x4c8
Call trace:
of_pci_add_properties+0x278/0x4c8 (P)
of_pci_make_dev_node+0xe0/0x158
pci_bus_add_device+0x158/0x228
pci_bus_add_devices+0x40/0x98
pci_host_probe+0x94/0x118
pci_host_common_probe+0x130/0x1b0
platform_probe+0x70/0xf0
The dmesg logs indicated that the PCI bridge was scanning with an invalid bus range:
pci-host-generic 878020000000.pci: PCI host bridge to bus 0002:00
pci_bus 0002:00: root bus resource [bus 00-ff]
pci 0002:00:00.0: scanning [bus f9-f9] behind bridge, pass 0
pci 0002:00:01.0: scanning [bus fa-fa] behind bridge, pass 0
pci 0002:00:02.0: scanning [bus fb-fb] behind bridge, pass 0
pci 0002:00:03.0: scanning [bus fc-fc] behind bridge, pass 0
pci 0002:00:04.0: scanning [bus fd-fd] behind bridge, pass 0
pci 0002:00:05.0: scanning [bus fe-fe] behind bridge, pass 0
pci 0002:00:06.0: scanning [bus ff-ff] behind bridge, pass 0
pci 0002:00:07.0: scanning [bus 00-00] behind bridge, pass 0
pci 0002:00:07.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci 0002:00:08.0: scanning [bus 01-01] behind bridge, pass 0
pci 0002:00:09.0: scanning [bus 02-02] behind bridge, pass 0
pci 0002:00:0a.0: scanning [bus 03-03] behind bridge, pass 0
pci 0002:00:0b.0: scanning [bus 04-04] behind bridge, pass 0
pci 0002:00:0c.0: scanning [bus 05-05] behind bridge, pass 0
pci 0002:00:0d.0: scanning [bus 06-06] behind bridge, pass 0
pci 0002:00:0e.0: scanning [bus 07-07] behind bridge, pass 0
pci 0002:00:0f.0: scanning [bus 08-08] behind bridge, pass 0
This regression was introduced by commit 7246a4520b4b ("PCI: Use
preserve_config in place of pci_flags"). On our board, the 0002:00:07.0
bridge is misconfigured by the bootloader. Both its secondary and
subordinate bus numbers are initialized to 0, while its fixed secondary
bus number is set to 8. However, bus number 8 is also assigned to another
bridge (0002:00:0f.0). Although this is a bootloader issue, before the
change in commit 7246a4520b4b, the PCI_REASSIGN_ALL_BUS flag was set
by default when PCI_PROBE_ONLY was not enabled, ensuing that all the
bus number for these bridges were reassigned, avoiding any conflicts.
After the change introduced in commit 7246a4520b4b, the bus numbers
assigned by the bootloader are reused by all other bridges, except
the misconfigured 0002:00:07.0 bridge. The kernel attempt to reconfigure
0002:00:07.0 by reusing the fixed secondary bus number 8 assigned by
bootloader. However, since a pci_bus has already been allocated for
bus 8 due to the probe of 0002:00:0f.0, no new pci_bus allocated for
0002:00:07.0. This results in a pci bridge device without a pci_bus
attached (pdev->subordinate == NULL). Consequently, accessing
pdev->subordinate in of_pci_prop_bus_range() leads to a NULL pointer
dereference.
To summarize, we need to set the PCI_REASSIGN_ALL_BUS flag when
PCI_PROBE_ONLY is not enabled in order to work around issue like the
one described above.
Cc: stable(a)vger.kernel.org
Fixes: 7246a4520b4b ("PCI: Use preserve_config in place of pci_flags")
Signed-off-by: Bo Sun <Bo.Sun.CN(a)windriver.com>
---
Changes in v2:
- Added explicit comment about the quirk, as requested by Mani.
- Made commit message more clear, as requested by Bjorn.
drivers/pci/quirks.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 82b21e34c545..cec58c7479e1 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -6181,6 +6181,23 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1536, rom_bar_overlap_defect);
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1537, rom_bar_overlap_defect);
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1538, rom_bar_overlap_defect);
+/*
+ * Quirk for Marvell CN96XX/CN10XXX boards:
+ *
+ * Adds PCI_REASSIGN_ALL_BUS unless PCI_PROBE_ONLY is set, forcing bus number
+ * reassignment to avoid conflicts caused by bootloader misconfigured PCI bridges.
+ *
+ * This resolves a regression introduced by commit 7246a4520b4b ("PCI: Use
+ * preserve_config in place of pci_flags"), which removed this behavior.
+ */
+static void quirk_marvell_cn96xx_cn10xxx_reassign_all_busnr(struct pci_dev *dev)
+{
+ if (!pci_has_flag(PCI_PROBE_ONLY))
+ pci_add_flags(PCI_REASSIGN_ALL_BUS);
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_CAVIUM, 0xa002,
+ quirk_marvell_cn96xx_cn10xxx_reassign_all_busnr);
+
#ifdef CONFIG_PCIEASPM
/*
* Several Intel DG2 graphics devices advertise that they can only tolerate
--
2.48.1
The quilt patch titled
Subject: mm/hwpoison: do not send SIGBUS to processes with recovered clean pages
has been removed from the -mm tree. Its filename was
mm-hwpoison-do-not-send-sigbus-to-processes-with-recovered-clean-pages.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Shuai Xue <xueshuai(a)linux.alibaba.com>
Subject: mm/hwpoison: do not send SIGBUS to processes with recovered clean pages
Date: Wed, 12 Mar 2025 19:28:51 +0800
When an uncorrected memory error is consumed there is a race between the
CMCI from the memory controller reporting an uncorrected error with a UCNA
signature, and the core reporting and SRAR signature machine check when
the data is about to be consumed.
- Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1]
Prior to Icelake memory controllers reported patrol scrub events that
detected a previously unseen uncorrected error in memory by signaling a
broadcast machine check with an SRAO (Software Recoverable Action
Optional) signature in the machine check bank. This was overkill because
it's not an urgent problem that no core is on the verge of consuming that
bad data. It's also found that multi SRAO UCE may cause nested MCE
interrupts and finally become an IERR.
Hence, Intel downgrades the machine check bank signature of patrol scrub
from SRAO to UCNA (Uncorrected, No Action required), and signal changed to
#CMCI. Just to add to the confusion, Linux does take an action (in
uc_decode_notifier()) to try to offline the page despite the UC*NA*
signature name.
- Background: why #CMCI and #MCE race when poison is consuming in Intel platform [1]
Having decided that CMCI/UCNA is the best action for patrol scrub errors,
the memory controller uses it for reads too. But the memory controller is
executing asynchronously from the core, and can't tell the difference
between a "real" read and a speculative read. So it will do CMCI/UCNA if
an error is found in any read.
Thus:
1) Core is clever and thinks address A is needed soon, issues a speculative read.
2) Core finds it is going to use address A soon after sending the read request
3) The CMCI from the memory controller is in a race with MCE from the core
that will soon try to retire the load from address A.
Quite often (because speculation has got better) the CMCI from the memory
controller is delivered before the core is committed to the instruction
reading address A, so the interrupt is taken, and Linux offlines the page
(marking it as poison).
- Why user process is killed for instr case
Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported
"not recovered"") tries to fix noise message "Memory error not recovered"
and skips duplicate SIGBUSs due to the race. But it also introduced a bug
that kill_accessing_process() return -EHWPOISON for instr case, as result,
kill_me_maybe() send a SIGBUS to user process.
If the CMCI wins that race, the page is marked poisoned when
uc_decode_notifier() calls memory_failure(). For dirty pages,
memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag,
converting the PTE to a hwpoison entry. As a result,
kill_accessing_process():
- call walk_page_range() and return 1 regardless of whether
try_to_unmap() succeeds or fails,
- call kill_proc() to make sure a SIGBUS is sent
- return -EHWPOISON to indicate that SIGBUS is already sent to the
process and kill_me_maybe() doesn't have to send it again.
However, for clean pages, the TTU_HWPOISON flag is cleared, leaving the
PTE unchanged and not converted to a hwpoison entry. Conversely, for
clean pages where PTE entries are not marked as hwpoison,
kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send
a SIGBUS.
Console log looks like this:
Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects
Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered
Memory failure: 0x827ca68: already hardware poisoned
mce: Memory error not recovered
To fix it, return 0 for "corrupted page was clean", preventing an
unnecessary SIGBUS to user process.
[1] https://lore.kernel.org/lkml/20250217063335.22257-1-xueshuai@linux.alibaba.…
Link: https://lkml.kernel.org/r/20250312112852.82415-3-xueshuai@linux.alibaba.com
Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
Signed-off-by: Shuai Xue <xueshuai(a)linux.alibaba.com>
Tested-by: Tony Luck <tony.luck(a)intel.com>
Acked-by: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Borislav Betkov <bp(a)alien8.de>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Jane Chu <jane.chu(a)oracle.com>
Cc: Jarkko Sakkinen <jarkko(a)kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Cc: Josh Poimboeuf <jpoimboe(a)kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi(a)gmail.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Ruidong Tian <tianruidong(a)linux.alibaba.com>
Cc: Thomas Gleinxer <tglx(a)linutronix.de>
Cc: Yazen Ghannam <yazen.ghannam(a)amd.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory-failure.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
--- a/mm/memory-failure.c~mm-hwpoison-do-not-send-sigbus-to-processes-with-recovered-clean-pages
+++ a/mm/memory-failure.c
@@ -881,12 +881,17 @@ static int kill_accessing_process(struct
mmap_read_lock(p->mm);
ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwpoison_walk_ops,
(void *)&priv);
+ /*
+ * ret = 1 when CMCI wins, regardless of whether try_to_unmap()
+ * succeeds or fails, then kill the process with SIGBUS.
+ * ret = 0 when poison page is a clean page and it's dropped, no
+ * SIGBUS is needed.
+ */
if (ret == 1 && priv.tk.addr)
kill_proc(&priv.tk, pfn, flags);
- else
- ret = 0;
mmap_read_unlock(p->mm);
- return ret > 0 ? -EHWPOISON : -EFAULT;
+
+ return ret > 0 ? -EHWPOISON : 0;
}
/*
_
Patches currently in -mm which might be from xueshuai(a)linux.alibaba.com are
The quilt patch titled
Subject: x86/mce: use is_copy_from_user() to determine copy-from-user context
has been removed from the -mm tree. Its filename was
x86-mce-use-is_copy_from_user-to-determine-copy-from-user-context.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Shuai Xue <xueshuai(a)linux.alibaba.com>
Subject: x86/mce: use is_copy_from_user() to determine copy-from-user context
Date: Wed, 12 Mar 2025 19:28:50 +0800
Patch series "mm/hwpoison: Fix regressions in memory failure handling",
v4.
## 1. What am I trying to do:
This patchset resolves two critical regressions related to memory failure
handling that have appeared in the upstream kernel since version 5.17, as
compared to 5.10 LTS.
- copyin case: poison found in user page while kernel copying from user space
- instr case: poison found while instruction fetching in user space
## 2. What is the expected outcome and why
- For copyin case:
Kernel can recover from poison found where kernel is doing get_user() or
copy_from_user() if those places get an error return and the kernel return
-EFAULT to the process instead of crashing. More specifily, MCE handler
checks the fixup handler type to decide whether an in kernel #MC can be
recovered. When EX_TYPE_UACCESS is found, the PC jumps to recovery code
specified in _ASM_EXTABLE_FAULT() and return a -EFAULT to user space.
- For instr case:
If a poison found while instruction fetching in user space, full recovery
is possible. User process takes #PF, Linux allocates a new page and fills
by reading from storage.
## 3. What actually happens and why
- For copyin case: kernel panic since v5.17
Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new
extable fixup type, EX_TYPE_EFAULT_REG, and later patches updated the
extable fixup type for copy-from-user operations, changing it from
EX_TYPE_UACCESS to EX_TYPE_EFAULT_REG. It breaks previous EX_TYPE_UACCESS
handling when posion found in get_user() or copy_from_user().
- For instr case: user process is killed by a SIGBUS signal due to #CMCI
and #MCE race
When an uncorrected memory error is consumed there is a race between the
CMCI from the memory controller reporting an uncorrected error with a UCNA
signature, and the core reporting and SRAR signature machine check when
the data is about to be consumed.
### Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1]
Prior to Icelake memory controllers reported patrol scrub events that
detected a previously unseen uncorrected error in memory by signaling a
broadcast machine check with an SRAO (Software Recoverable Action
Optional) signature in the machine check bank. This was overkill because
it's not an urgent problem that no core is on the verge of consuming that
bad data. It's also found that multi SRAO UCE may cause nested MCE
interrupts and finally become an IERR.
Hence, Intel downgrades the machine check bank signature of patrol scrub
from SRAO to UCNA (Uncorrected, No Action required), and signal changed to
#CMCI. Just to add to the confusion, Linux does take an action (in
uc_decode_notifier()) to try to offline the page despite the UC*NA*
signature name.
### Background: why #CMCI and #MCE race when poison is consuming in
Intel platform [1]
Having decided that CMCI/UCNA is the best action for patrol scrub errors,
the memory controller uses it for reads too. But the memory controller is
executing asynchronously from the core, and can't tell the difference
between a "real" read and a speculative read. So it will do CMCI/UCNA if
an error is found in any read.
Thus:
1) Core is clever and thinks address A is needed soon, issues a
speculative read.
2) Core finds it is going to use address A soon after sending the read
request
3) The CMCI from the memory controller is in a race with MCE from the
core that will soon try to retire the load from address A.
Quite often (because speculation has got better) the CMCI from the memory
controller is delivered before the core is committed to the instruction
reading address A, so the interrupt is taken, and Linux offlines the page
(marking it as poison).
## Why user process is killed for instr case
Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported
"not recovered"") tries to fix noise message "Memory error not recovered"
and skips duplicate SIGBUSs due to the race. But it also introduced a bug
that kill_accessing_process() return -EHWPOISON for instr case, as result,
kill_me_maybe() send a SIGBUS to user process.
# 4. The fix, in my opinion, should be:
- For copyin case:
The key point is whether the error context is in a read from user memory.
We do not care about the ex-type if we know its a MOV reading from
userspace.
is_copy_from_user() return true when both of the following two checks are
true:
- the current instruction is copy
- source address is user memory
If copy_user is true, we set
m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
Then do_machine_check() will try fixup_exception() first.
- For instr case: let kill_accessing_process() return 0 to prevent a SIGBUS.
- For patch 3:
The return value of memory_failure() is quite important while discussed
instr case regression with Tony and Miaohe for patch 2, so add comment
about the return value.
This patch (of 3):
Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new
extable fixup type, EX_TYPE_EFAULT_REG, and commit 4c132d1d844a
("x86/futex: Remove .fixup usage") updated the extable fixup type for
copy-from-user operations, changing it from EX_TYPE_UACCESS to
EX_TYPE_EFAULT_REG. The error context for copy-from-user operations no
longer functions as an in-kernel recovery context. Consequently, the
error context for copy-from-user operations no longer functions as an
in-kernel recovery context, resulting in kernel panics with the message:
"Machine check: Data load in unrecoverable area of kernel."
To address this, it is crucial to identify if an error context involves a
read operation from user memory. The function is_copy_from_user() can be
utilized to determine:
- the current operation is copy
- when reading user memory
When these conditions are met, is_copy_from_user() will return true,
confirming that it is indeed a direct copy from user memory. This check
is essential for correctly handling the context of errors in these
operations without relying on the extable fixup types that previously
allowed for in-kernel recovery.
So, use is_copy_from_user() to determine if a context is copy user directly.
Link: https://lkml.kernel.org/r/20250312112852.82415-1-xueshuai@linux.alibaba.com
Link: https://lkml.kernel.org/r/20250312112852.82415-2-xueshuai@linux.alibaba.com
Fixes: 4c132d1d844a ("x86/futex: Remove .fixup usage")
Signed-off-by: Shuai Xue <xueshuai(a)linux.alibaba.com>
Suggested-by: Peter Zijlstra <peterz(a)infradead.org>
Acked-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Tested-by: Tony Luck <tony.luck(a)intel.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Borislav Betkov <bp(a)alien8.de>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Josh Poimboeuf <jpoimboe(a)kernel.org>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi(a)gmail.com>
Cc: Ruidong Tian <tianruidong(a)linux.alibaba.com>
Cc: Thomas Gleinxer <tglx(a)linutronix.de>
Cc: Yazen Ghannam <yazen.ghannam(a)amd.com>
Cc: Jane Chu <jane.chu(a)oracle.com>
Cc: Jarkko Sakkinen <jarkko(a)kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/x86/kernel/cpu/mce/severity.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
--- a/arch/x86/kernel/cpu/mce/severity.c~x86-mce-use-is_copy_from_user-to-determine-copy-from-user-context
+++ a/arch/x86/kernel/cpu/mce/severity.c
@@ -300,13 +300,12 @@ static noinstr int error_context(struct
copy_user = is_copy_from_user(regs);
instrumentation_end();
- switch (fixup_type) {
- case EX_TYPE_UACCESS:
- if (!copy_user)
- return IN_KERNEL;
- m->kflags |= MCE_IN_KERNEL_COPYIN;
- fallthrough;
+ if (copy_user) {
+ m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
+ return IN_KERNEL_RECOV;
+ }
+ switch (fixup_type) {
case EX_TYPE_FAULT_MCE_SAFE:
case EX_TYPE_DEFAULT_MCE_SAFE:
m->kflags |= MCE_IN_KERNEL_RECOV;
_
Patches currently in -mm which might be from xueshuai(a)linux.alibaba.com are
The quilt patch titled
Subject: mm: add missing release barrier on PGDAT_RECLAIM_LOCKED unlock
has been removed from the -mm tree. Its filename was
mm-add-missing-release-barrier-on-pgdat_reclaim_locked-unlock.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Subject: mm: add missing release barrier on PGDAT_RECLAIM_LOCKED unlock
Date: Wed, 12 Mar 2025 10:10:13 -0400
The PGDAT_RECLAIM_LOCKED bit is used to provide mutual exclusion of node
reclaim for struct pglist_data using a single bit.
It is "locked" with a test_and_set_bit (similarly to a try lock) which
provides full ordering with respect to loads and stores done within
__node_reclaim().
It is "unlocked" with clear_bit(), which does not provide any ordering
with respect to loads and stores done before clearing the bit.
The lack of clear_bit() memory ordering with respect to stores within
__node_reclaim() can cause a subsequent CPU to fail to observe stores from
a prior node reclaim. This is not an issue in practice on TSO (e.g.
x86), but it is an issue on weakly-ordered architectures (e.g. arm64).
Fix this by using clear_bit_unlock rather than clear_bit to clear
PGDAT_RECLAIM_LOCKED with a release memory ordering semantic.
This provides stronger memory ordering (release rather than relaxed).
Link: https://lkml.kernel.org/r/20250312141014.129725-1-mathieu.desnoyers@efficio…
Fixes: d773ed6b856a ("mm: test and set zone reclaim lock before starting reclaim")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Alan Stern <stern(a)rowland.harvard.edu>
Cc: Andrea Parri <parri.andrea(a)gmail.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Boqun Feng <boqun.feng(a)gmail.com>
Cc: Nicholas Piggin <npiggin(a)gmail.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Jade Alglave <j.alglave(a)ucl.ac.uk>
Cc: Luc Maranget <luc.maranget(a)inria.fr>
Cc: "Paul E. McKenney" <paulmck(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/vmscan.c~mm-add-missing-release-barrier-on-pgdat_reclaim_locked-unlock
+++ a/mm/vmscan.c
@@ -7581,7 +7581,7 @@ int node_reclaim(struct pglist_data *pgd
return NODE_RECLAIM_NOSCAN;
ret = __node_reclaim(pgdat, gfp_mask, order);
- clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
+ clear_bit_unlock(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
if (ret)
count_vm_event(PGSCAN_ZONE_RECLAIM_SUCCESS);
_
Patches currently in -mm which might be from mathieu.desnoyers(a)efficios.com are
The quilt patch titled
Subject: mm/mremap: correctly handle partial mremap() of VMA starting at 0
has been removed from the -mm tree. Its filename was
mm-mremap-correctly-handle-partial-mremap-of-vma-starting-at-0.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: mm/mremap: correctly handle partial mremap() of VMA starting at 0
Date: Mon, 10 Mar 2025 20:50:34 +0000
Patch series "refactor mremap and fix bug", v3.
The existing mremap() logic has grown organically over a very long period
of time, resulting in code that is in many parts, very difficult to follow
and full of subtleties and sources of confusion.
In addition, it is difficult to thread state through the operation
correctly, as function arguments have expanded, some parameters are
expected to be temporarily altered during the operation, others are
intended to remain static and some can be overridden.
This series completely refactors the mremap implementation, sensibly
separating functions, adding comments to explain the more subtle aspects
of the implementation and making use of small structs to thread state
through everything.
The reason for doing so is to lay the groundwork for planned future
changes to the mremap logic, changes which require the ability to easily
pass around state.
Additionally, it would be unhelpful to add yet more logic to code that is
already difficult to follow without first refactoring it like this.
The first patch in this series additionally fixes a bug when a VMA with
start address zero is partially remapped.
Tested on real hardware under heavy workload and all self tests are
passing.
This patch (of 3):
Consider the case of a partial mremap() (that results in a VMA split) of
an accountable VMA (i.e. which has the VM_ACCOUNT flag set) whose start
address is zero, with the MREMAP_MAYMOVE flag specified and a scenario
where a move does in fact occur:
addr end
| |
v v
|-------------|
| vma |
|-------------|
0
This move is affected by unmapping the range [addr, end). In order to
prevent an incorrect decrement of accounted memory which has already been
determined, the mremap() code in move_vma() clears VM_ACCOUNT from the VMA
prior to doing so, before reestablishing it in each of the VMAs
post-split:
addr end
| |
v v
|---| |---|
| A | | B |
|---| |---|
Commit 6b73cff239e5 ("mm: change munmap splitting order and move_vma()")
changed this logic such as to determine whether there is a need to do so
by establishing account_start and account_end and, in the instance where
such an operation is required, assigning them to vma->vm_start and
vma->vm_end.
Later the code checks if the operation is required for 'A' referenced
above thusly:
if (account_start) {
...
}
However, if the VMA described above has vma->vm_start == 0, which is now
assigned to account_start, this branch will not be executed.
As a result, the VMA 'A' above will remain stripped of its VM_ACCOUNT
flag, incorrectly.
The fix is to simply convert these variables to booleans and set them as
required.
Link: https://lkml.kernel.org/r/cover.1741639347.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/dc55cb6db25d97c3d9e460de4986a323fa959676.17416393…
Fixes: 6b73cff239e5 ("mm: change munmap splitting order and move_vma()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reviewed-by: Harry Yoo <harry.yoo(a)oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mremap.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
--- a/mm/mremap.c~mm-mremap-correctly-handle-partial-mremap-of-vma-starting-at-0
+++ a/mm/mremap.c
@@ -705,8 +705,8 @@ static unsigned long move_vma(struct vm_
unsigned long vm_flags = vma->vm_flags;
unsigned long new_pgoff;
unsigned long moved_len;
- unsigned long account_start = 0;
- unsigned long account_end = 0;
+ bool account_start = false;
+ bool account_end = false;
unsigned long hiwater_vm;
int err = 0;
bool need_rmap_locks;
@@ -790,9 +790,9 @@ static unsigned long move_vma(struct vm_
if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) {
vm_flags_clear(vma, VM_ACCOUNT);
if (vma->vm_start < old_addr)
- account_start = vma->vm_start;
+ account_start = true;
if (vma->vm_end > old_addr + old_len)
- account_end = vma->vm_end;
+ account_end = true;
}
/*
@@ -832,7 +832,7 @@ static unsigned long move_vma(struct vm_
/* OOM: unable to split vma, just get accounts right */
if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
vm_acct_memory(old_len >> PAGE_SHIFT);
- account_start = account_end = 0;
+ account_start = account_end = false;
}
if (vm_flags & VM_LOCKED) {
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
Variable "bridge" is allocated by agp_alloc_bridge() and
have to be released by agp_put_bridge() if something goes
wrong. In this patch, add the missing call of agp_put_bridge()
in agp_amdk7_probe() to prevent potential memory leak bug.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable(a)vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024(a)163.com>
---
drivers/char/agp/amd-k7-agp.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/char/agp/amd-k7-agp.c b/drivers/char/agp/amd-k7-agp.c
index 795c8c9ff680..40e1fc462dca 100644
--- a/drivers/char/agp/amd-k7-agp.c
+++ b/drivers/char/agp/amd-k7-agp.c
@@ -441,6 +441,7 @@ static int agp_amdk7_probe(struct pci_dev *pdev,
gfxcard = pci_get_class(PCI_CLASS_DISPLAY_VGA<<8, gfxcard);
if (!gfxcard) {
dev_info(&pdev->dev, "no AGP VGA controller\n");
+ agp_put_bridge(bridge);
return -ENODEV;
}
cap_ptr = pci_find_capability(gfxcard, PCI_CAP_ID_AGP);
--
2.25.1
Commit 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()") introduce
a way to set nid to all reserved region.
But there is a corner case it will leave some region with invalid nid.
When memblock_set_node() doubles the array of memblock.reserved, it may
lead to a new reserved region before current position. The new region
will be left with an invalid node id.
Repeat the process when detecting it.
Fixes: 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")
Signed-off-by: Wei Yang <richard.weiyang(a)gmail.com>
CC: Mike Rapoport <rppt(a)kernel.org>
CC: Yajun Deng <yajun.deng(a)linux.dev>
CC: <stable(a)vger.kernel.org>
---
mm/memblock.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/mm/memblock.c b/mm/memblock.c
index 85442f1b7f14..302dd7bc622d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2184,7 +2184,10 @@ static void __init memmap_init_reserved_pages(void)
* set nid on all reserved pages and also treat struct
* pages for the NOMAP regions as PageReserved
*/
+repeat:
for_each_mem_region(region) {
+ unsigned long max = memblock.reserved.max;
+
nid = memblock_get_region_node(region);
start = region->base;
end = start + region->size;
@@ -2193,6 +2196,15 @@ static void __init memmap_init_reserved_pages(void)
reserve_bootmem_region(start, end, nid);
memblock_set_node(start, region->size, &memblock.reserved, nid);
+
+ /*
+ * 'max' is changed means memblock.reserved has been doubled
+ * its array, which may result a new reserved region before
+ * current 'start'. Now we should repeat the procedure to set
+ * its node id.
+ */
+ if (max != memblock.reserved.max)
+ goto repeat;
}
/*
--
2.34.1
When long lines are sent to the debug console on the framebuffer, the
right-most column is lost. Fix this by subtracting 1 from the column
count before comparing it with console_struct_cur_column, as the latter
counts from zero.
Linewrap is handled with a recursive call to console_putc, but this
alters the console_struct_cur_row global. Store the old value before
calling console_putc, so the right-most character gets rendered on the
correct line.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable(a)vger.kernel.org
Signed-off-by: Finn Thain <fthain(a)linux-m68k.org>
---
arch/m68k/kernel/head.S | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/m68k/kernel/head.S b/arch/m68k/kernel/head.S
index 852255cf60de..9c60047764d0 100644
--- a/arch/m68k/kernel/head.S
+++ b/arch/m68k/kernel/head.S
@@ -3583,11 +3583,16 @@ L(console_not_home):
movel %a0@(Lconsole_struct_cur_column),%d0
addql #1,%a0@(Lconsole_struct_cur_column)
movel %a0@(Lconsole_struct_num_columns),%d1
+ subil #1,%d1
cmpl %d1,%d0
jcs 1f
- console_putc #'\n' /* recursion is OK! */
+ /* recursion will alter console_struct so load d1 register first */
+ movel %a0@(Lconsole_struct_cur_row),%d1
+ console_putc #'\n'
+ jmp 2f
1:
movel %a0@(Lconsole_struct_cur_row),%d1
+2:
/*
* At this point we make a shift in register usage
--
2.45.3
Prepare vPMC registers for user-initiated changes after first run. This
is important specifically for debugging Windows on QEMU with GDB; QEMU
tries to write back all visible registers when resuming the VM execution
with GDB, corrupting the PMU state. Windows always uses the PMU so this
can cause adverse effects on that particular OS.
This series also contains patch "KVM: arm64: PMU: Set raw values from
user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}", which reverts semantic
changes made for the mentioned registers in the past. It is necessary
to migrate the PMU state properly on Firecracker, QEMU, and crosvm.
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
Changes in v5:
- Rebased.
- Link to v4: https://lore.kernel.org/r/20250313-pmc-v4-0-2c976827118c@daynix.com
Changes in v4:
- Reverted changes for functions implementing ioctls in patch
"KVM: arm64: PMU: Assume PMU presence in pmu-emul.c".
- Removed kvm_pmu_vcpu_reset().
- Reordered function calls in kvm_vcpu_reload_pmu() for better style.
- Link to v3: https://lore.kernel.org/r/20250312-pmc-v3-0-0411cab5dc3d@daynix.com
Changes in v3:
- Added patch "KVM: arm64: PMU: Assume PMU presence in pmu-emul.c".
- Added an explanation of this path series' motivation to each patch.
- Explained why userspace register writes and register reset should be
covered in patch "KVM: arm64: PMU: Reload when user modifies
registers".
- Marked patch "KVM: arm64: PMU: Set raw values from user to
PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}" for stable.
- Reoreded so that patch "KVM: arm64: PMU: Set raw values from user to
PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}" would come first.
- Added patch "KVM: arm64: PMU: Call kvm_pmu_handle_pmcr() after masking
PMCNTENSET_EL0".
- Added patch "KVM: arm64: Reload PMCNTENSET_EL0".
- Link to v2: https://lore.kernel.org/r/20250307-pmc-v2-0-6c3375a5f1e4@daynix.com
Changes in v2:
- Changed to utilize KVM_REQ_RELOAD_PMU as suggested by Oliver Upton.
- Added patch "KVM: arm64: PMU: Reload when user modifies registers"
to cover more registers.
- Added patch "KVM: arm64: PMU: Set raw values from user to
PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}".
- Link to v1: https://lore.kernel.org/r/20250302-pmc-v1-1-caff989093dc@daynix.com
---
Akihiko Odaki (5):
KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
KVM: arm64: PMU: Assume PMU presence in pmu-emul.c
KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs
KVM: arm64: PMU: Reload when user modifies registers
KVM: arm64: PMU: Reload when resetting
arch/arm64/kvm/arm.c | 17 ++++++++-----
arch/arm64/kvm/emulate-nested.c | 6 +++--
arch/arm64/kvm/pmu-emul.c | 56 +++++++++++------------------------------
arch/arm64/kvm/reset.c | 3 ---
arch/arm64/kvm/sys_regs.c | 52 ++++++++++++++++++++++----------------
include/kvm/arm_pmu.h | 4 +--
6 files changed, 62 insertions(+), 76 deletions(-)
---
base-commit: 80e54e84911a923c40d7bee33a34c1b4be148d7a
change-id: 20250302-pmc-b90a86af945c
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
pciehp_reset_slot() disables PDCE (Presence Detect Changed Enable) and
DLLSCE (Data Link Layer State Changed Enable) for the duration of reset
and clears the related status bits PDC and DLLSC from the Slot Status
register after the reset to avoid hotplug incorrectly assuming the card
was removed.
However, hotplug shares interrupt with PME and BW notifications both of
which can make pciehp_isr() to run despite PDCE and DLLSCE bits being
off. pciehp_isr() then picks PDC or DLLSC bits from the Slot Status
register due to the events that occur during reset and caches them into
->pending_events. Later, the IRQ thread in pciehp_ist() will process
the ->pending_events and will assume the Link went Down due to a card
change (in pciehp_handle_presence_or_link_change()).
Change pciehp_reset_slot() to also clear HPIE (Hot-Plug Interrupt
Enable) as pciehp_isr() will first check HPIE to see if the interrupt
is not for it. Then synchronize with the IRQ handling to ensure no
events are pending, before invoking the reset.
Similarly, if the poll mode is in use, park the poll thread over the
duration of the reset to stop handling events.
In order to not race irq_syncronize()/kthread_{,un}park() with the irq
/ poll_thread freeing from pciehp_remove(), take reset_lock in
pciehp_free_irq() and check the irq / poll_thread variable validity in
pciehp_reset_slot().
Fixes: 06a8d89af551 ("PCI: pciehp: Disable link notification across slot reset")
Fixes: 720d6a671a6e ("PCI: pciehp: Do not handle events if interrupts are masked")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219765
Suggested-by: Lukas Wunner <lukas(a)wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
---
drivers/pci/hotplug/pciehp_hpc.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index bb5a8d9f03ad..c487e274b282 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -77,10 +77,15 @@ static inline int pciehp_request_irq(struct controller *ctrl)
static inline void pciehp_free_irq(struct controller *ctrl)
{
- if (pciehp_poll_mode)
+ down_read_nested(&ctrl->reset_lock, ctrl->depth);
+ if (pciehp_poll_mode) {
kthread_stop(ctrl->poll_thread);
- else
+ ctrl->poll_thread = NULL;
+ } else {
free_irq(ctrl->pcie->irq, ctrl);
+ ctrl->pcie->irq = IRQ_NOTCONNECTED;
+ }
+ up_read(&ctrl->reset_lock);
}
static int pcie_poll_cmd(struct controller *ctrl, int timeout)
@@ -766,8 +771,9 @@ static int pciehp_poll(void *data)
while (!kthread_should_stop()) {
/* poll for interrupt events or user requests */
- while (pciehp_isr(IRQ_NOTCONNECTED, ctrl) == IRQ_WAKE_THREAD ||
- atomic_read(&ctrl->pending_events))
+ while (!kthread_should_park() &&
+ (pciehp_isr(IRQ_NOTCONNECTED, ctrl) == IRQ_WAKE_THREAD ||
+ atomic_read(&ctrl->pending_events)))
pciehp_ist(IRQ_NOTCONNECTED, ctrl);
if (pciehp_poll_time <= 0 || pciehp_poll_time > 60)
@@ -907,6 +913,8 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, bool probe)
down_write_nested(&ctrl->reset_lock, ctrl->depth);
+ if (!pciehp_poll_mode)
+ ctrl_mask |= PCI_EXP_SLTCTL_HPIE;
if (!ATTN_BUTTN(ctrl)) {
ctrl_mask |= PCI_EXP_SLTCTL_PDCE;
stat_mask |= PCI_EXP_SLTSTA_PDC;
@@ -918,9 +926,21 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, bool probe)
ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, 0);
+ /* Make sure HPIE is no longer seen by the interrupt handler. */
+ if (pciehp_poll_mode) {
+ if (ctrl->poll_thread)
+ kthread_park(ctrl->poll_thread);
+ } else {
+ if (ctrl->pcie->irq != IRQ_NOTCONNECTED)
+ synchronize_irq(ctrl->pcie->irq);
+ }
+
rc = pci_bridge_secondary_bus_reset(ctrl->pcie->port);
pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, stat_mask);
+ if (pciehp_poll_mode && ctrl->poll_thread)
+ kthread_unpark(ctrl->poll_thread);
+
pcie_write_cmd_nowait(ctrl, ctrl_mask, ctrl_mask);
ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask);
--
2.39.5
Hi there,
Would you like an updated contact list of Real Estate CRM Software users and customers?
We can also provide contact lists for users and customers of the following companies:
* AppFolio
* RealPage
* Buildium
* CoreLogic
* Yardi Voyager
* Kissflow CRM
* Pipedrive
* HubSpot CRM Service and more...
Please specify the target technology users and geographical areas of interest, so we can provide relevant information accordingly.
Kind regards,
Kristina Williams
Lead Specialist
To unsubscribe from future emails, simply reply with Stop.
This patch series contains some missing openvswitch port output fixes
for the stable 5.15 kernel.
Felix Huettner (1):
net: openvswitch: fix race on port output
Ilya Maximets (1):
openvswitch: fix lockup on tx to unregistering netdev with carrier
net/core/dev.c | 1 +
net/openvswitch/actions.c | 4 +++-
2 files changed, 4 insertions(+), 1 deletion(-)
--
2.34.1
This patch series contains some missing openvswitch port output fixes
for the stable 5.4 kernel.
Felix Huettner (1):
net: openvswitch: fix race on port output
Ilya Maximets (1):
openvswitch: fix lockup on tx to unregistering netdev with carrier
net/core/dev.c | 1 +
net/openvswitch/actions.c | 4 +++-
2 files changed, 4 insertions(+), 1 deletion(-)
--
2.34.1
This patch series contains some missing openvswitch port output fixes
for the stable 5.10 kernel.
Felix Huettner (1):
net: openvswitch: fix race on port output
Ilya Maximets (1):
openvswitch: fix lockup on tx to unregistering netdev with carrier
net/core/dev.c | 1 +
net/openvswitch/actions.c | 4 +++-
2 files changed, 4 insertions(+), 1 deletion(-)
--
2.34.1
[ Upstream commit 5ac9b4e935dfc6af41eee2ddc21deb5c36507a9f ]
>From memfd_secret(2) manpage:
The memory areas backing the file created with memfd_secret(2) are
visible only to the processes that have access to the file descriptor.
The memory region is removed from the kernel page tables and only the
page tables of the processes holding the file descriptor map the
corresponding physical memory. (Thus, the pages in the region can't be
accessed by the kernel itself, so that, for example, pointers to the
region can't be passed to system calls.)
We need to handle this special case gracefully in build ID fetching
code. Return -EFAULT whenever secretmem file is passed to build_id_parse()
family of APIs. Original report and repro can be found in [0].
[0] https://lore.kernel.org/bpf/ZwyG8Uro%2FSyTXAni@ly-workstation/
Fixes: de3ec364c3c3 ("lib/buildid: add single folio-based file reader abstraction")
Reported-by: Yi Lai <yi1.lai(a)intel.com>
Suggested-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Link: https://lore.kernel.org/bpf/20241017175431.6183-A-hca@linux.ibm.com
Link: https://lore.kernel.org/bpf/20241017174713.2157873-1-andrii@kernel.org
[ Chen Linxuan: backport same logic without folio-based changes ]
Cc: stable(a)vger.kernel.org
Fixes: 88a16a130933 ("perf: Add build id data in mmap2 event")
Signed-off-by: Chen Linxuan <chenlinxuan(a)deepin.org>
---
v1 -> v2: use vma_is_secretmem() instead of directly checking
vma->vm_file->f_op == &secretmem_fops
v2 -> v3: keep original comment
---
lib/buildid.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/lib/buildid.c b/lib/buildid.c
index 9fc46366597e..8d839ff5548e 100644
--- a/lib/buildid.c
+++ b/lib/buildid.c
@@ -5,6 +5,7 @@
#include <linux/elf.h>
#include <linux/kernel.h>
#include <linux/pagemap.h>
+#include <linux/secretmem.h>
#define BUILD_ID 3
@@ -157,6 +158,10 @@ int build_id_parse(struct vm_area_struct *vma, unsigned char *build_id,
if (!vma->vm_file)
return -EINVAL;
+ /* reject secretmem folios created with memfd_secret() */
+ if (vma_is_secretmem(vma))
+ return -EFAULT;
+
page = find_get_page(vma->vm_file->f_mapping, 0);
if (!page)
return -EFAULT; /* page not mapped */
--
2.48.1
From: Eric Dumazet <edumazet(a)google.com>
tcp_abort() has the same issue than the one fixed in the prior patch
in tcp_write_err().
commit 5ce4645c23cf5f048eb8e9ce49e514bababdee85 upstream.
To apply commit bac76cf89816bff06c4ec2f3df97dc34e150a1c4,
this patch must be applied first.
In order to get consistent results from tcp_poll(), we must call
sk_error_report() after tcp_done().
We can use tcp_done_with_error() to centralize this logic.
Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
Signed-off-by: Eric Dumazet <edumazet(a)google.com>
Acked-by: Neal Cardwell <ncardwell(a)google.com>
Link: https://lore.kernel.org/r/20240528125253.1966136-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
[youngmin: Resolved minor conflict in net/ipv4/tcp.c]
Signed-off-by: Youngmin Nam <youngmin.nam(a)samsung.com>
---
net/ipv4/tcp.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3c85ecab1445..c1e624ca6a25 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4514,13 +4514,9 @@ int tcp_abort(struct sock *sk, int err)
bh_lock_sock(sk);
if (!sock_flag(sk, SOCK_DEAD)) {
- WRITE_ONCE(sk->sk_err, err);
- /* This barrier is coupled with smp_rmb() in tcp_poll() */
- smp_wmb();
- sk_error_report(sk);
if (tcp_need_reset(sk->sk_state))
tcp_send_active_reset(sk, GFP_ATOMIC);
- tcp_done(sk);
+ tcp_done_with_error(sk, err);
}
bh_unlock_sock(sk);
--
2.39.2
[ Upstream commit 5ac9b4e935dfc6af41eee2ddc21deb5c36507a9f ]
>From memfd_secret(2) manpage:
The memory areas backing the file created with memfd_secret(2) are
visible only to the processes that have access to the file descriptor.
The memory region is removed from the kernel page tables and only the
page tables of the processes holding the file descriptor map the
corresponding physical memory. (Thus, the pages in the region can't be
accessed by the kernel itself, so that, for example, pointers to the
region can't be passed to system calls.)
We need to handle this special case gracefully in build ID fetching
code. Return -EFAULT whenever secretmem file is passed to build_id_parse()
family of APIs. Original report and repro can be found in [0].
[0] https://lore.kernel.org/bpf/ZwyG8Uro%2FSyTXAni@ly-workstation/
Fixes: de3ec364c3c3 ("lib/buildid: add single folio-based file reader abstraction")
Reported-by: Yi Lai <yi1.lai(a)intel.com>
Suggested-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Link: https://lore.kernel.org/bpf/20241017175431.6183-A-hca@linux.ibm.com
Link: https://lore.kernel.org/bpf/20241017174713.2157873-1-andrii@kernel.org
[ Chen Linxuan: backport same logic without folio-based changes ]
Cc: stable(a)vger.kernel.org
Fixes: 88a16a130933 ("perf: Add build id data in mmap2 event")
Signed-off-by: Chen Linxuan <chenlinxuan(a)deepin.org>
---
v1 -> v2: use vma_is_secretmem() instead of directly checking
vma->vm_file->f_op == &secretmem_fops
---
lib/buildid.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/lib/buildid.c b/lib/buildid.c
index 9fc46366597e..34315d09b544 100644
--- a/lib/buildid.c
+++ b/lib/buildid.c
@@ -5,6 +5,7 @@
#include <linux/elf.h>
#include <linux/kernel.h>
#include <linux/pagemap.h>
+#include <linux/secretmem.h>
#define BUILD_ID 3
@@ -157,6 +158,10 @@ int build_id_parse(struct vm_area_struct *vma, unsigned char *build_id,
if (!vma->vm_file)
return -EINVAL;
+ /* reject secretmem */
+ if (vma_is_secretmem(vma))
+ return -EFAULT;
+
page = find_get_page(vma->vm_file->f_mapping, 0);
if (!page)
return -EFAULT; /* page not mapped */
--
2.48.1
From: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
[ Upstream commit 58a039e679fe72bd0efa8b2abe669a7914bb4429 ]
Commit ea7e2d5e49c0 ("mm: call the security_mmap_file() LSM hook in
remap_file_pages()") fixed a security issue, it added an LSM check when
trying to remap file pages, so that LSMs have the opportunity to evaluate
such action like for other memory operations such as mmap() and
mprotect().
However, that commit called security_mmap_file() inside the mmap_lock
lock, while the other calls do it before taking the lock, after commit
8b3ec6814c83 ("take security_mmap_file() outside of ->mmap_sem").
This caused lock inversion issue with IMA which was taking the mmap_lock
and i_mutex lock in the opposite way when the remap_file_pages() system
call was called.
Solve the issue by splitting the critical region in remap_file_pages() in
two regions: the first takes a read lock of mmap_lock, retrieves the VMA
and the file descriptor associated, and calculates the 'prot' and 'flags'
variables; the second takes a write lock on mmap_lock, checks that the VMA
flags and the VMA file descriptor are the same as the ones obtained in the
first critical region (otherwise the system call fails), and calls
do_mmap().
In between, after releasing the read lock and before taking the write
lock, call security_mmap_file(), and solve the lock inversion issue.
Link: https://lkml.kernel.org/r/20241018161415.3845146-1-roberto.sassu@huaweiclou…
Fixes: ea7e2d5e49c0 ("mm: call the security_mmap_file() LSM hook in remap_file_pages()")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Signed-off-by: Roberto Sassu <roberto.sassu(a)huawei.com>
Reported-by: syzbot+1cd571a672400ef3a930(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-security-module/66f7b10e.050a0220.46d20.0036.…
Tested-by: Roberto Sassu <roberto.sassu(a)huawei.com>
Reviewed-by: Roberto Sassu <roberto.sassu(a)huawei.com>
Reviewed-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Reviewed-by: Paul Moore <paul(a)paul-moore.com>
Tested-by: syzbot+1cd571a672400ef3a930(a)syzkaller.appspotmail.com
Cc: Jarkko Sakkinen <jarkko(a)kernel.org>
Cc: Dmitry Kasatkin <dmitry.kasatkin(a)gmail.com>
Cc: Eric Snowberg <eric.snowberg(a)oracle.com>
Cc: James Morris <jmorris(a)namei.org>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: "Serge E. Hallyn" <serge(a)hallyn.com>
Cc: Shu Han <ebpqwerty472123(a)gmail.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Jianqi Ren <jianqi.ren.cn(a)windriver.com>
Signed-off-by: He Zhe <zhe.he(a)windriver.com>
---
Verified the build test
---
mm/mmap.c | 69 +++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 52 insertions(+), 17 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index e4dfeaef668a..03a24cb3951d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2981,6 +2981,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
unsigned long populate = 0;
unsigned long ret = -EINVAL;
struct file *file;
+ vm_flags_t vm_flags;
pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. See Documentation/mm/remap_file_pages.rst.\n",
current->comm, current->pid);
@@ -2997,12 +2998,60 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
if (pgoff + (size >> PAGE_SHIFT) < pgoff)
return ret;
- if (mmap_write_lock_killable(mm))
+ if (mmap_read_lock_killable(mm))
+ return -EINTR;
+
+ /*
+ * Look up VMA under read lock first so we can perform the security
+ * without holding locks (which can be problematic). We reacquire a
+ * write lock later and check nothing changed underneath us.
+ */
+ vma = vma_lookup(mm, start);
+
+ if (!vma || !(vma->vm_flags & VM_SHARED)) {
+ mmap_read_unlock(mm);
+ return -EINVAL;
+ }
+
+ prot |= vma->vm_flags & VM_READ ? PROT_READ : 0;
+ prot |= vma->vm_flags & VM_WRITE ? PROT_WRITE : 0;
+ prot |= vma->vm_flags & VM_EXEC ? PROT_EXEC : 0;
+
+ flags &= MAP_NONBLOCK;
+ flags |= MAP_SHARED | MAP_FIXED | MAP_POPULATE;
+ if (vma->vm_flags & VM_LOCKED)
+ flags |= MAP_LOCKED;
+
+ /* Save vm_flags used to calculate prot and flags, and recheck later. */
+ vm_flags = vma->vm_flags;
+ file = get_file(vma->vm_file);
+
+ mmap_read_unlock(mm);
+
+ /* Call outside mmap_lock to be consistent with other callers. */
+ ret = security_mmap_file(file, prot, flags);
+ if (ret) {
+ fput(file);
+ return ret;
+ }
+
+ ret = -EINVAL;
+
+ /* OK security check passed, take write lock + let it rip. */
+ if (mmap_write_lock_killable(mm)) {
+ fput(file);
return -EINTR;
+ }
vma = vma_lookup(mm, start);
- if (!vma || !(vma->vm_flags & VM_SHARED))
+ if (!vma)
+ goto out;
+
+ /* Make sure things didn't change under us. */
+ if (vma->vm_flags != vm_flags)
+ goto out;
+ if (vma->vm_file != file)
goto out;
if (start + size > vma->vm_end) {
@@ -3030,25 +3079,11 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
goto out;
}
- prot |= vma->vm_flags & VM_READ ? PROT_READ : 0;
- prot |= vma->vm_flags & VM_WRITE ? PROT_WRITE : 0;
- prot |= vma->vm_flags & VM_EXEC ? PROT_EXEC : 0;
-
- flags &= MAP_NONBLOCK;
- flags |= MAP_SHARED | MAP_FIXED | MAP_POPULATE;
- if (vma->vm_flags & VM_LOCKED)
- flags |= MAP_LOCKED;
-
- file = get_file(vma->vm_file);
- ret = security_mmap_file(vma->vm_file, prot, flags);
- if (ret)
- goto out_fput;
ret = do_mmap(vma->vm_file, start, size,
prot, flags, 0, pgoff, &populate, NULL);
-out_fput:
- fput(file);
out:
mmap_write_unlock(mm);
+ fput(file);
if (populate)
mm_populate(ret, populate);
if (!IS_ERR_VALUE(ret))
--
2.25.1
From: Waiman Long <longman(a)redhat.com>
[ Upstream commit 85b2b9c16d053364e2004883140538e73b333cdb ]
A circular lock dependency splat has been seen involving down_trylock():
======================================================
WARNING: possible circular locking dependency detected
6.12.0-41.el10.s390x+debug
------------------------------------------------------
dd/32479 is trying to acquire lock:
0015a20accd0d4f8 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0x26/0x90
but task is already holding lock:
000000017e461698 (&zone->lock){-.-.}-{2:2}, at: rmqueue_bulk+0xac/0x8f0
the existing dependency chain (in reverse order) is:
-> #4 (&zone->lock){-.-.}-{2:2}:
-> #3 (hrtimer_bases.lock){-.-.}-{2:2}:
-> #2 (&rq->__lock){-.-.}-{2:2}:
-> #1 (&p->pi_lock){-.-.}-{2:2}:
-> #0 ((console_sem).lock){-.-.}-{2:2}:
The console_sem -> pi_lock dependency is due to calling try_to_wake_up()
while holding the console_sem raw_spinlock. This dependency can be broken
by using wake_q to do the wakeup instead of calling try_to_wake_up()
under the console_sem lock. This will also make the semaphore's
raw_spinlock become a terminal lock without taking any further locks
underneath it.
The hrtimer_bases.lock is a raw_spinlock while zone->lock is a
spinlock. The hrtimer_bases.lock -> zone->lock dependency happens via
the debug_objects_fill_pool() helper function in the debugobjects code.
-> #4 (&zone->lock){-.-.}-{2:2}:
__lock_acquire+0xe86/0x1cc0
lock_acquire.part.0+0x258/0x630
lock_acquire+0xb8/0xe0
_raw_spin_lock_irqsave+0xb4/0x120
rmqueue_bulk+0xac/0x8f0
__rmqueue_pcplist+0x580/0x830
rmqueue_pcplist+0xfc/0x470
rmqueue.isra.0+0xdec/0x11b0
get_page_from_freelist+0x2ee/0xeb0
__alloc_pages_noprof+0x2c2/0x520
alloc_pages_mpol_noprof+0x1fc/0x4d0
alloc_pages_noprof+0x8c/0xe0
allocate_slab+0x320/0x460
___slab_alloc+0xa58/0x12b0
__slab_alloc.isra.0+0x42/0x60
kmem_cache_alloc_noprof+0x304/0x350
fill_pool+0xf6/0x450
debug_object_activate+0xfe/0x360
enqueue_hrtimer+0x34/0x190
__run_hrtimer+0x3c8/0x4c0
__hrtimer_run_queues+0x1b2/0x260
hrtimer_interrupt+0x316/0x760
do_IRQ+0x9a/0xe0
do_irq_async+0xf6/0x160
Normally a raw_spinlock to spinlock dependency is not legitimate
and will be warned if CONFIG_PROVE_RAW_LOCK_NESTING is enabled,
but debug_objects_fill_pool() is an exception as it explicitly
allows this dependency for non-PREEMPT_RT kernel without causing
PROVE_RAW_LOCK_NESTING lockdep splat. As a result, this dependency is
legitimate and not a bug.
Anyway, semaphore is the only locking primitive left that is still
using try_to_wake_up() to do wakeup inside critical section, all the
other locking primitives had been migrated to use wake_q to do wakeup
outside of the critical section. It is also possible that there are
other circular locking dependencies involving printk/console_sem or
other existing/new semaphores lurking somewhere which may show up in
the future. Let just do the migration now to wake_q to avoid headache
like this.
Reported-by: yzbot+ed801a886dfdbfe7136d(a)syzkaller.appspotmail.com
Signed-off-by: Waiman Long <longman(a)redhat.com>
Signed-off-by: Boqun Feng <boqun.feng(a)gmail.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Link: https://lore.kernel.org/r/20250307232717.1759087-3-boqun.feng@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/locking/semaphore.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/kernel/locking/semaphore.c b/kernel/locking/semaphore.c
index d9dd94defc0a9..19389fdbfdfb1 100644
--- a/kernel/locking/semaphore.c
+++ b/kernel/locking/semaphore.c
@@ -29,6 +29,7 @@
#include <linux/export.h>
#include <linux/sched.h>
#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
#include <linux/semaphore.h>
#include <linux/spinlock.h>
#include <linux/ftrace.h>
@@ -37,7 +38,7 @@ static noinline void __down(struct semaphore *sem);
static noinline int __down_interruptible(struct semaphore *sem);
static noinline int __down_killable(struct semaphore *sem);
static noinline int __down_timeout(struct semaphore *sem, long timeout);
-static noinline void __up(struct semaphore *sem);
+static noinline void __up(struct semaphore *sem, struct wake_q_head *wake_q);
/**
* down - acquire the semaphore
@@ -178,13 +179,16 @@ EXPORT_SYMBOL(down_timeout);
void up(struct semaphore *sem)
{
unsigned long flags;
+ DEFINE_WAKE_Q(wake_q);
raw_spin_lock_irqsave(&sem->lock, flags);
if (likely(list_empty(&sem->wait_list)))
sem->count++;
else
- __up(sem);
+ __up(sem, &wake_q);
raw_spin_unlock_irqrestore(&sem->lock, flags);
+ if (!wake_q_empty(&wake_q))
+ wake_up_q(&wake_q);
}
EXPORT_SYMBOL(up);
@@ -252,11 +256,12 @@ static noinline int __sched __down_timeout(struct semaphore *sem, long timeout)
return __down_common(sem, TASK_UNINTERRUPTIBLE, timeout);
}
-static noinline void __sched __up(struct semaphore *sem)
+static noinline void __sched __up(struct semaphore *sem,
+ struct wake_q_head *wake_q)
{
struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
struct semaphore_waiter, list);
list_del(&waiter->list);
waiter->up = true;
- wake_up_process(waiter->task);
+ wake_q_add(wake_q, waiter->task);
}
--
2.39.5
[ Upstream commit 5ac9b4e935dfc6af41eee2ddc21deb5c36507a9f ]
>From memfd_secret(2) manpage:
The memory areas backing the file created with memfd_secret(2) are
visible only to the processes that have access to the file descriptor.
The memory region is removed from the kernel page tables and only the
page tables of the processes holding the file descriptor map the
corresponding physical memory. (Thus, the pages in the region can't be
accessed by the kernel itself, so that, for example, pointers to the
region can't be passed to system calls.)
We need to handle this special case gracefully in build ID fetching
code. Return -EFAULT whenever secretmem file is passed to build_id_parse()
family of APIs. Original report and repro can be found in [0].
[0] https://lore.kernel.org/bpf/ZwyG8Uro%2FSyTXAni@ly-workstation/
Fixes: de3ec364c3c3 ("lib/buildid: add single folio-based file reader abstraction")
Reported-by: Yi Lai <yi1.lai(a)intel.com>
Suggested-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Link: https://lore.kernel.org/bpf/20241017175431.6183-A-hca@linux.ibm.com
Link: https://lore.kernel.org/bpf/20241017174713.2157873-1-andrii@kernel.org
[ Chen Linxuan: backport same logic without folio-based changes ]
Cc: stable(a)vger.kernel.org
Fixes: 88a16a130933 ("perf: Add build id data in mmap2 event")
Signed-off-by: Chen Linxuan <chenlinxuan(a)deepin.org>
---
v1 -> v2: use vma_is_secretmem() instead of directly checking
vma->vm_file->f_op == &secretmem_fops
---
lib/buildid.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/lib/buildid.c b/lib/buildid.c
index 9fc46366597e..34315d09b544 100644
--- a/lib/buildid.c
+++ b/lib/buildid.c
@@ -5,6 +5,7 @@
#include <linux/elf.h>
#include <linux/kernel.h>
#include <linux/pagemap.h>
+#include <linux/secretmem.h>
#define BUILD_ID 3
@@ -157,6 +158,10 @@ int build_id_parse(struct vm_area_struct *vma, unsigned char *build_id,
if (!vma->vm_file)
return -EINVAL;
+ /* reject secretmem */
+ if (vma_is_secretmem(vma))
+ return -EFAULT;
+
page = find_get_page(vma->vm_file->f_mapping, 0);
if (!page)
return -EFAULT; /* page not mapped */
--
2.48.1
From: Waiman Long <longman(a)redhat.com>
[ Upstream commit 85b2b9c16d053364e2004883140538e73b333cdb ]
A circular lock dependency splat has been seen involving down_trylock():
======================================================
WARNING: possible circular locking dependency detected
6.12.0-41.el10.s390x+debug
------------------------------------------------------
dd/32479 is trying to acquire lock:
0015a20accd0d4f8 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0x26/0x90
but task is already holding lock:
000000017e461698 (&zone->lock){-.-.}-{2:2}, at: rmqueue_bulk+0xac/0x8f0
the existing dependency chain (in reverse order) is:
-> #4 (&zone->lock){-.-.}-{2:2}:
-> #3 (hrtimer_bases.lock){-.-.}-{2:2}:
-> #2 (&rq->__lock){-.-.}-{2:2}:
-> #1 (&p->pi_lock){-.-.}-{2:2}:
-> #0 ((console_sem).lock){-.-.}-{2:2}:
The console_sem -> pi_lock dependency is due to calling try_to_wake_up()
while holding the console_sem raw_spinlock. This dependency can be broken
by using wake_q to do the wakeup instead of calling try_to_wake_up()
under the console_sem lock. This will also make the semaphore's
raw_spinlock become a terminal lock without taking any further locks
underneath it.
The hrtimer_bases.lock is a raw_spinlock while zone->lock is a
spinlock. The hrtimer_bases.lock -> zone->lock dependency happens via
the debug_objects_fill_pool() helper function in the debugobjects code.
-> #4 (&zone->lock){-.-.}-{2:2}:
__lock_acquire+0xe86/0x1cc0
lock_acquire.part.0+0x258/0x630
lock_acquire+0xb8/0xe0
_raw_spin_lock_irqsave+0xb4/0x120
rmqueue_bulk+0xac/0x8f0
__rmqueue_pcplist+0x580/0x830
rmqueue_pcplist+0xfc/0x470
rmqueue.isra.0+0xdec/0x11b0
get_page_from_freelist+0x2ee/0xeb0
__alloc_pages_noprof+0x2c2/0x520
alloc_pages_mpol_noprof+0x1fc/0x4d0
alloc_pages_noprof+0x8c/0xe0
allocate_slab+0x320/0x460
___slab_alloc+0xa58/0x12b0
__slab_alloc.isra.0+0x42/0x60
kmem_cache_alloc_noprof+0x304/0x350
fill_pool+0xf6/0x450
debug_object_activate+0xfe/0x360
enqueue_hrtimer+0x34/0x190
__run_hrtimer+0x3c8/0x4c0
__hrtimer_run_queues+0x1b2/0x260
hrtimer_interrupt+0x316/0x760
do_IRQ+0x9a/0xe0
do_irq_async+0xf6/0x160
Normally a raw_spinlock to spinlock dependency is not legitimate
and will be warned if CONFIG_PROVE_RAW_LOCK_NESTING is enabled,
but debug_objects_fill_pool() is an exception as it explicitly
allows this dependency for non-PREEMPT_RT kernel without causing
PROVE_RAW_LOCK_NESTING lockdep splat. As a result, this dependency is
legitimate and not a bug.
Anyway, semaphore is the only locking primitive left that is still
using try_to_wake_up() to do wakeup inside critical section, all the
other locking primitives had been migrated to use wake_q to do wakeup
outside of the critical section. It is also possible that there are
other circular locking dependencies involving printk/console_sem or
other existing/new semaphores lurking somewhere which may show up in
the future. Let just do the migration now to wake_q to avoid headache
like this.
Reported-by: yzbot+ed801a886dfdbfe7136d(a)syzkaller.appspotmail.com
Signed-off-by: Waiman Long <longman(a)redhat.com>
Signed-off-by: Boqun Feng <boqun.feng(a)gmail.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Link: https://lore.kernel.org/r/20250307232717.1759087-3-boqun.feng@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/locking/semaphore.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/kernel/locking/semaphore.c b/kernel/locking/semaphore.c
index 9aa855a96c4ae..aadde65402913 100644
--- a/kernel/locking/semaphore.c
+++ b/kernel/locking/semaphore.c
@@ -29,6 +29,7 @@
#include <linux/export.h>
#include <linux/sched.h>
#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
#include <linux/semaphore.h>
#include <linux/spinlock.h>
#include <linux/ftrace.h>
@@ -37,7 +38,7 @@ static noinline void __down(struct semaphore *sem);
static noinline int __down_interruptible(struct semaphore *sem);
static noinline int __down_killable(struct semaphore *sem);
static noinline int __down_timeout(struct semaphore *sem, long timeout);
-static noinline void __up(struct semaphore *sem);
+static noinline void __up(struct semaphore *sem, struct wake_q_head *wake_q);
/**
* down - acquire the semaphore
@@ -178,13 +179,16 @@ EXPORT_SYMBOL(down_timeout);
void up(struct semaphore *sem)
{
unsigned long flags;
+ DEFINE_WAKE_Q(wake_q);
raw_spin_lock_irqsave(&sem->lock, flags);
if (likely(list_empty(&sem->wait_list)))
sem->count++;
else
- __up(sem);
+ __up(sem, &wake_q);
raw_spin_unlock_irqrestore(&sem->lock, flags);
+ if (!wake_q_empty(&wake_q))
+ wake_up_q(&wake_q);
}
EXPORT_SYMBOL(up);
@@ -252,11 +256,12 @@ static noinline int __sched __down_timeout(struct semaphore *sem, long timeout)
return __down_common(sem, TASK_UNINTERRUPTIBLE, timeout);
}
-static noinline void __sched __up(struct semaphore *sem)
+static noinline void __sched __up(struct semaphore *sem,
+ struct wake_q_head *wake_q)
{
struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
struct semaphore_waiter, list);
list_del(&waiter->list);
waiter->up = true;
- wake_up_process(waiter->task);
+ wake_q_add(wake_q, waiter->task);
}
--
2.39.5
From: Eric Dumazet <edumazet(a)google.com>
tcp_abort() has the same issue than the one fixed in the prior patch
in tcp_write_err().
commit 5ce4645c23cf5f048eb8e9ce49e514bababdee85 upstream.
To apply commit bac76cf89816bff06c4ec2f3df97dc34e150a1c4,
this patch must be applied first.
In order to get consistent results from tcp_poll(), we must call
sk_error_report() after tcp_done().
We can use tcp_done_with_error() to centralize this logic.
Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
Signed-off-by: Eric Dumazet <edumazet(a)google.com>
Acked-by: Neal Cardwell <ncardwell(a)google.com>
Link: https://lore.kernel.org/r/20240528125253.1966136-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
[youngmin: Resolved minor conflict in net/ipv4/tcp.c]
Signed-off-by: Youngmin Nam <youngmin.nam(a)samsung.com>
---
net/ipv4/tcp.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7d591a0cf0c7..1ad3a20eb9b7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4755,13 +4755,9 @@ int tcp_abort(struct sock *sk, int err)
bh_lock_sock(sk);
if (!sock_flag(sk, SOCK_DEAD)) {
- WRITE_ONCE(sk->sk_err, err);
- /* This barrier is coupled with smp_rmb() in tcp_poll() */
- smp_wmb();
- sk_error_report(sk);
if (tcp_need_reset(sk->sk_state))
tcp_send_active_reset(sk, GFP_ATOMIC);
- tcp_done(sk);
+ tcp_done_with_error(sk, err);
}
bh_unlock_sock(sk);
--
2.39.2
[ Upstream commit 5ac9b4e935dfc6af41eee2ddc21deb5c36507a9f ]
>From memfd_secret(2) manpage:
The memory areas backing the file created with memfd_secret(2) are
visible only to the processes that have access to the file descriptor.
The memory region is removed from the kernel page tables and only the
page tables of the processes holding the file descriptor map the
corresponding physical memory. (Thus, the pages in the region can't be
accessed by the kernel itself, so that, for example, pointers to the
region can't be passed to system calls.)
We need to handle this special case gracefully in build ID fetching
code. Return -EFAULT whenever secretmem file is passed to build_id_parse()
family of APIs. Original report and repro can be found in [0].
[0] https://lore.kernel.org/bpf/ZwyG8Uro%2FSyTXAni@ly-workstation/
Fixes: de3ec364c3c3 ("lib/buildid: add single folio-based file reader abstraction")
Reported-by: Yi Lai <yi1.lai(a)intel.com>
Suggested-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Link: https://lore.kernel.org/bpf/20241017175431.6183-A-hca@linux.ibm.com
Link: https://lore.kernel.org/bpf/20241017174713.2157873-1-andrii@kernel.org
[ Chen Linxuan: backport same logic without folio-based changes ]
Cc: stable(a)vger.kernel.org
Fixes: 88a16a130933 ("perf: Add build id data in mmap2 event")
Signed-off-by: Chen Linxuan <chenlinxuan(a)deepin.org>
---
v1 -> v2: use vma_is_secretmem() instead of directly checking
vma->vm_file->f_op == &secretmem_fops
v2 -> v3: keep original comment
---
lib/buildid.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/lib/buildid.c b/lib/buildid.c
index 9fc46366597e..8d839ff5548e 100644
--- a/lib/buildid.c
+++ b/lib/buildid.c
@@ -5,6 +5,7 @@
#include <linux/elf.h>
#include <linux/kernel.h>
#include <linux/pagemap.h>
+#include <linux/secretmem.h>
#define BUILD_ID 3
@@ -157,6 +158,10 @@ int build_id_parse(struct vm_area_struct *vma, unsigned char *build_id,
if (!vma->vm_file)
return -EINVAL;
+ /* reject secretmem folios created with memfd_secret() */
+ if (vma_is_secretmem(vma))
+ return -EFAULT;
+
page = find_get_page(vma->vm_file->f_mapping, 0);
if (!page)
return -EFAULT; /* page not mapped */
--
2.48.1
[ Upstream commit 5ac9b4e935dfc6af41eee2ddc21deb5c36507a9f ]
>From memfd_secret(2) manpage:
The memory areas backing the file created with memfd_secret(2) are
visible only to the processes that have access to the file descriptor.
The memory region is removed from the kernel page tables and only the
page tables of the processes holding the file descriptor map the
corresponding physical memory. (Thus, the pages in the region can't be
accessed by the kernel itself, so that, for example, pointers to the
region can't be passed to system calls.)
We need to handle this special case gracefully in build ID fetching
code. Return -EFAULT whenever secretmem file is passed to build_id_parse()
family of APIs. Original report and repro can be found in [0].
[0] https://lore.kernel.org/bpf/ZwyG8Uro%2FSyTXAni@ly-workstation/
Fixes: de3ec364c3c3 ("lib/buildid: add single folio-based file reader abstraction")
Reported-by: Yi Lai <yi1.lai(a)intel.com>
Suggested-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Link: https://lore.kernel.org/bpf/20241017175431.6183-A-hca@linux.ibm.com
Link: https://lore.kernel.org/bpf/20241017174713.2157873-1-andrii@kernel.org
[ Chen Linxuan: backport same logic without folio-based changes ]
Cc: stable(a)vger.kernel.org
Fixes: 88a16a130933 ("perf: Add build id data in mmap2 event")
Signed-off-by: Chen Linxuan <chenlinxuan(a)deepin.org>
---
v1 -> v2: use vma_is_secretmem() instead of directly checking
vma->vm_file->f_op == &secretmem_fops
---
lib/buildid.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/lib/buildid.c b/lib/buildid.c
index 9fc46366597e..34315d09b544 100644
--- a/lib/buildid.c
+++ b/lib/buildid.c
@@ -5,6 +5,7 @@
#include <linux/elf.h>
#include <linux/kernel.h>
#include <linux/pagemap.h>
+#include <linux/secretmem.h>
#define BUILD_ID 3
@@ -157,6 +158,10 @@ int build_id_parse(struct vm_area_struct *vma, unsigned char *build_id,
if (!vma->vm_file)
return -EINVAL;
+ /* reject secretmem */
+ if (vma_is_secretmem(vma))
+ return -EFAULT;
+
page = find_get_page(vma->vm_file->f_mapping, 0);
if (!page)
return -EFAULT; /* page not mapped */
--
2.48.1
From: Waiman Long <longman(a)redhat.com>
[ Upstream commit 85b2b9c16d053364e2004883140538e73b333cdb ]
A circular lock dependency splat has been seen involving down_trylock():
======================================================
WARNING: possible circular locking dependency detected
6.12.0-41.el10.s390x+debug
------------------------------------------------------
dd/32479 is trying to acquire lock:
0015a20accd0d4f8 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0x26/0x90
but task is already holding lock:
000000017e461698 (&zone->lock){-.-.}-{2:2}, at: rmqueue_bulk+0xac/0x8f0
the existing dependency chain (in reverse order) is:
-> #4 (&zone->lock){-.-.}-{2:2}:
-> #3 (hrtimer_bases.lock){-.-.}-{2:2}:
-> #2 (&rq->__lock){-.-.}-{2:2}:
-> #1 (&p->pi_lock){-.-.}-{2:2}:
-> #0 ((console_sem).lock){-.-.}-{2:2}:
The console_sem -> pi_lock dependency is due to calling try_to_wake_up()
while holding the console_sem raw_spinlock. This dependency can be broken
by using wake_q to do the wakeup instead of calling try_to_wake_up()
under the console_sem lock. This will also make the semaphore's
raw_spinlock become a terminal lock without taking any further locks
underneath it.
The hrtimer_bases.lock is a raw_spinlock while zone->lock is a
spinlock. The hrtimer_bases.lock -> zone->lock dependency happens via
the debug_objects_fill_pool() helper function in the debugobjects code.
-> #4 (&zone->lock){-.-.}-{2:2}:
__lock_acquire+0xe86/0x1cc0
lock_acquire.part.0+0x258/0x630
lock_acquire+0xb8/0xe0
_raw_spin_lock_irqsave+0xb4/0x120
rmqueue_bulk+0xac/0x8f0
__rmqueue_pcplist+0x580/0x830
rmqueue_pcplist+0xfc/0x470
rmqueue.isra.0+0xdec/0x11b0
get_page_from_freelist+0x2ee/0xeb0
__alloc_pages_noprof+0x2c2/0x520
alloc_pages_mpol_noprof+0x1fc/0x4d0
alloc_pages_noprof+0x8c/0xe0
allocate_slab+0x320/0x460
___slab_alloc+0xa58/0x12b0
__slab_alloc.isra.0+0x42/0x60
kmem_cache_alloc_noprof+0x304/0x350
fill_pool+0xf6/0x450
debug_object_activate+0xfe/0x360
enqueue_hrtimer+0x34/0x190
__run_hrtimer+0x3c8/0x4c0
__hrtimer_run_queues+0x1b2/0x260
hrtimer_interrupt+0x316/0x760
do_IRQ+0x9a/0xe0
do_irq_async+0xf6/0x160
Normally a raw_spinlock to spinlock dependency is not legitimate
and will be warned if CONFIG_PROVE_RAW_LOCK_NESTING is enabled,
but debug_objects_fill_pool() is an exception as it explicitly
allows this dependency for non-PREEMPT_RT kernel without causing
PROVE_RAW_LOCK_NESTING lockdep splat. As a result, this dependency is
legitimate and not a bug.
Anyway, semaphore is the only locking primitive left that is still
using try_to_wake_up() to do wakeup inside critical section, all the
other locking primitives had been migrated to use wake_q to do wakeup
outside of the critical section. It is also possible that there are
other circular locking dependencies involving printk/console_sem or
other existing/new semaphores lurking somewhere which may show up in
the future. Let just do the migration now to wake_q to avoid headache
like this.
Reported-by: yzbot+ed801a886dfdbfe7136d(a)syzkaller.appspotmail.com
Signed-off-by: Waiman Long <longman(a)redhat.com>
Signed-off-by: Boqun Feng <boqun.feng(a)gmail.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Link: https://lore.kernel.org/r/20250307232717.1759087-3-boqun.feng@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/locking/semaphore.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/kernel/locking/semaphore.c b/kernel/locking/semaphore.c
index 9ee381e4d2a4d..a26c915430ba0 100644
--- a/kernel/locking/semaphore.c
+++ b/kernel/locking/semaphore.c
@@ -29,6 +29,7 @@
#include <linux/export.h>
#include <linux/sched.h>
#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
#include <linux/semaphore.h>
#include <linux/spinlock.h>
#include <linux/ftrace.h>
@@ -37,7 +38,7 @@ static noinline void __down(struct semaphore *sem);
static noinline int __down_interruptible(struct semaphore *sem);
static noinline int __down_killable(struct semaphore *sem);
static noinline int __down_timeout(struct semaphore *sem, long timeout);
-static noinline void __up(struct semaphore *sem);
+static noinline void __up(struct semaphore *sem, struct wake_q_head *wake_q);
/**
* down - acquire the semaphore
@@ -182,13 +183,16 @@ EXPORT_SYMBOL(down_timeout);
void up(struct semaphore *sem)
{
unsigned long flags;
+ DEFINE_WAKE_Q(wake_q);
raw_spin_lock_irqsave(&sem->lock, flags);
if (likely(list_empty(&sem->wait_list)))
sem->count++;
else
- __up(sem);
+ __up(sem, &wake_q);
raw_spin_unlock_irqrestore(&sem->lock, flags);
+ if (!wake_q_empty(&wake_q))
+ wake_up_q(&wake_q);
}
EXPORT_SYMBOL(up);
@@ -256,11 +260,12 @@ static noinline int __sched __down_timeout(struct semaphore *sem, long timeout)
return __down_common(sem, TASK_UNINTERRUPTIBLE, timeout);
}
-static noinline void __sched __up(struct semaphore *sem)
+static noinline void __sched __up(struct semaphore *sem,
+ struct wake_q_head *wake_q)
{
struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
struct semaphore_waiter, list);
list_del(&waiter->list);
waiter->up = true;
- wake_up_process(waiter->task);
+ wake_q_add(wake_q, waiter->task);
}
--
2.39.5
[ Upstream commit 5ac9b4e935dfc6af41eee2ddc21deb5c36507a9f ]
>From memfd_secret(2) manpage:
The memory areas backing the file created with memfd_secret(2) are
visible only to the processes that have access to the file descriptor.
The memory region is removed from the kernel page tables and only the
page tables of the processes holding the file descriptor map the
corresponding physical memory. (Thus, the pages in the region can't be
accessed by the kernel itself, so that, for example, pointers to the
region can't be passed to system calls.)
We need to handle this special case gracefully in build ID fetching
code. Return -EFAULT whenever secretmem file is passed to build_id_parse()
family of APIs. Original report and repro can be found in [0].
[0] https://lore.kernel.org/bpf/ZwyG8Uro%2FSyTXAni@ly-workstation/
Fixes: de3ec364c3c3 ("lib/buildid: add single folio-based file reader abstraction")
Reported-by: Yi Lai <yi1.lai(a)intel.com>
Suggested-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Link: https://lore.kernel.org/bpf/20241017175431.6183-A-hca@linux.ibm.com
Link: https://lore.kernel.org/bpf/20241017174713.2157873-1-andrii@kernel.org
[ Chen Linxuan: backport same logic without folio-based changes ]
Cc: stable(a)vger.kernel.org
Fixes: 88a16a130933 ("perf: Add build id data in mmap2 event")
Signed-off-by: Chen Linxuan <chenlinxuan(a)deepin.org>
---
v1 -> v2: use vma_is_secretmem() instead of directly checking
vma->vm_file->f_op == &secretmem_fops
v2 -> v3: keep original comment
---
lib/buildid.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/lib/buildid.c b/lib/buildid.c
index 9fc46366597e..8d839ff5548e 100644
--- a/lib/buildid.c
+++ b/lib/buildid.c
@@ -5,6 +5,7 @@
#include <linux/elf.h>
#include <linux/kernel.h>
#include <linux/pagemap.h>
+#include <linux/secretmem.h>
#define BUILD_ID 3
@@ -157,6 +158,10 @@ int build_id_parse(struct vm_area_struct *vma, unsigned char *build_id,
if (!vma->vm_file)
return -EINVAL;
+ /* reject secretmem folios created with memfd_secret() */
+ if (vma_is_secretmem(vma))
+ return -EFAULT;
+
page = find_get_page(vma->vm_file->f_mapping, 0);
if (!page)
return -EFAULT; /* page not mapped */
--
2.48.1
From: Waiman Long <longman(a)redhat.com>
[ Upstream commit 85b2b9c16d053364e2004883140538e73b333cdb ]
A circular lock dependency splat has been seen involving down_trylock():
======================================================
WARNING: possible circular locking dependency detected
6.12.0-41.el10.s390x+debug
------------------------------------------------------
dd/32479 is trying to acquire lock:
0015a20accd0d4f8 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0x26/0x90
but task is already holding lock:
000000017e461698 (&zone->lock){-.-.}-{2:2}, at: rmqueue_bulk+0xac/0x8f0
the existing dependency chain (in reverse order) is:
-> #4 (&zone->lock){-.-.}-{2:2}:
-> #3 (hrtimer_bases.lock){-.-.}-{2:2}:
-> #2 (&rq->__lock){-.-.}-{2:2}:
-> #1 (&p->pi_lock){-.-.}-{2:2}:
-> #0 ((console_sem).lock){-.-.}-{2:2}:
The console_sem -> pi_lock dependency is due to calling try_to_wake_up()
while holding the console_sem raw_spinlock. This dependency can be broken
by using wake_q to do the wakeup instead of calling try_to_wake_up()
under the console_sem lock. This will also make the semaphore's
raw_spinlock become a terminal lock without taking any further locks
underneath it.
The hrtimer_bases.lock is a raw_spinlock while zone->lock is a
spinlock. The hrtimer_bases.lock -> zone->lock dependency happens via
the debug_objects_fill_pool() helper function in the debugobjects code.
-> #4 (&zone->lock){-.-.}-{2:2}:
__lock_acquire+0xe86/0x1cc0
lock_acquire.part.0+0x258/0x630
lock_acquire+0xb8/0xe0
_raw_spin_lock_irqsave+0xb4/0x120
rmqueue_bulk+0xac/0x8f0
__rmqueue_pcplist+0x580/0x830
rmqueue_pcplist+0xfc/0x470
rmqueue.isra.0+0xdec/0x11b0
get_page_from_freelist+0x2ee/0xeb0
__alloc_pages_noprof+0x2c2/0x520
alloc_pages_mpol_noprof+0x1fc/0x4d0
alloc_pages_noprof+0x8c/0xe0
allocate_slab+0x320/0x460
___slab_alloc+0xa58/0x12b0
__slab_alloc.isra.0+0x42/0x60
kmem_cache_alloc_noprof+0x304/0x350
fill_pool+0xf6/0x450
debug_object_activate+0xfe/0x360
enqueue_hrtimer+0x34/0x190
__run_hrtimer+0x3c8/0x4c0
__hrtimer_run_queues+0x1b2/0x260
hrtimer_interrupt+0x316/0x760
do_IRQ+0x9a/0xe0
do_irq_async+0xf6/0x160
Normally a raw_spinlock to spinlock dependency is not legitimate
and will be warned if CONFIG_PROVE_RAW_LOCK_NESTING is enabled,
but debug_objects_fill_pool() is an exception as it explicitly
allows this dependency for non-PREEMPT_RT kernel without causing
PROVE_RAW_LOCK_NESTING lockdep splat. As a result, this dependency is
legitimate and not a bug.
Anyway, semaphore is the only locking primitive left that is still
using try_to_wake_up() to do wakeup inside critical section, all the
other locking primitives had been migrated to use wake_q to do wakeup
outside of the critical section. It is also possible that there are
other circular locking dependencies involving printk/console_sem or
other existing/new semaphores lurking somewhere which may show up in
the future. Let just do the migration now to wake_q to avoid headache
like this.
Reported-by: yzbot+ed801a886dfdbfe7136d(a)syzkaller.appspotmail.com
Signed-off-by: Waiman Long <longman(a)redhat.com>
Signed-off-by: Boqun Feng <boqun.feng(a)gmail.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Link: https://lore.kernel.org/r/20250307232717.1759087-3-boqun.feng@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/locking/semaphore.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/kernel/locking/semaphore.c b/kernel/locking/semaphore.c
index 34bfae72f2952..de9117c0e671e 100644
--- a/kernel/locking/semaphore.c
+++ b/kernel/locking/semaphore.c
@@ -29,6 +29,7 @@
#include <linux/export.h>
#include <linux/sched.h>
#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
#include <linux/semaphore.h>
#include <linux/spinlock.h>
#include <linux/ftrace.h>
@@ -38,7 +39,7 @@ static noinline void __down(struct semaphore *sem);
static noinline int __down_interruptible(struct semaphore *sem);
static noinline int __down_killable(struct semaphore *sem);
static noinline int __down_timeout(struct semaphore *sem, long timeout);
-static noinline void __up(struct semaphore *sem);
+static noinline void __up(struct semaphore *sem, struct wake_q_head *wake_q);
/**
* down - acquire the semaphore
@@ -183,13 +184,16 @@ EXPORT_SYMBOL(down_timeout);
void __sched up(struct semaphore *sem)
{
unsigned long flags;
+ DEFINE_WAKE_Q(wake_q);
raw_spin_lock_irqsave(&sem->lock, flags);
if (likely(list_empty(&sem->wait_list)))
sem->count++;
else
- __up(sem);
+ __up(sem, &wake_q);
raw_spin_unlock_irqrestore(&sem->lock, flags);
+ if (!wake_q_empty(&wake_q))
+ wake_up_q(&wake_q);
}
EXPORT_SYMBOL(up);
@@ -269,11 +273,12 @@ static noinline int __sched __down_timeout(struct semaphore *sem, long timeout)
return __down_common(sem, TASK_UNINTERRUPTIBLE, timeout);
}
-static noinline void __sched __up(struct semaphore *sem)
+static noinline void __sched __up(struct semaphore *sem,
+ struct wake_q_head *wake_q)
{
struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
struct semaphore_waiter, list);
list_del(&waiter->list);
waiter->up = true;
- wake_up_process(waiter->task);
+ wake_q_add(wake_q, waiter->task);
}
--
2.39.5
ufs-exynos driver enables the shareability option for gs101 which
means the descriptors need to be allocated as cacheable.
Fix the DT node and update bindings to add the dma-coherent property.
This fixes the UFS stability issues we have seen with the upstream
UFS driver.
Note this DT fix can go in independently of the other UFS fixes series
I sent recently [1], as the bootloader already leaves the sharability
bits enabled.
regards,
Peter
[1] https://lore.kernel.org/linux-scsi/20250226220414.343659-1-peter.griffin@li…
To: André Draszik <andre.draszik(a)linaro.org>
To: Tudor Ambarus <tudor.ambarus(a)linaro.org>
To: Rob Herring <robh(a)kernel.org>
To: Krzysztof Kozlowski <krzk+dt(a)kernel.org>
To: Conor Dooley <conor+dt(a)kernel.org>
To: Alim Akhtar <alim.akhtar(a)samsung.com>
To: Avri Altman <avri.altman(a)wdc.com>
To: Bart Van Assche <bvanassche(a)acm.org>
To: Martin K. Petersen <martin.petersen(a)oracle.com>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski(a)linaro.org>
Cc: linux-arm-kernel(a)lists.infradead.org
Cc: linux-samsung-soc(a)vger.kernel.org
Cc: devicetree(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-scsi(a)vger.kernel.org
Cc: kernel-team(a)android.com
Cc: willmcvicker(a)google.com
Signed-off-by: Peter Griffin <peter.griffin(a)linaro.org>
---
Peter Griffin (2):
arm64: dts: exynos: gs101: ufs: add dma-coherent property
scsi: ufs: dt-bindings: exynos: add dma-coherent property for gs101
Documentation/devicetree/bindings/ufs/samsung,exynos-ufs.yaml | 2 ++
arch/arm64/boot/dts/exynos/google/gs101.dtsi | 1 +
2 files changed, 3 insertions(+)
---
base-commit: b323d8e7bc03d27dec646bfdccb7d1a92411f189
change-id: 20250314-ufs-dma-coherent-980f2467690d
Best regards,
--
Peter Griffin <peter.griffin(a)linaro.org>
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 5daa0c35a1f0e7a6c3b8ba9cb721e7d1ace6e619
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025031621-july-parkway-796f@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5daa0c35a1f0e7a6c3b8ba9cb721e7d1ace6e619 Mon Sep 17 00:00:00 2001
From: Matthew Maurer <mmaurer(a)google.com>
Date: Wed, 8 Jan 2025 23:35:08 +0000
Subject: [PATCH] rust: Disallow BTF generation with Rust + LTO
The kernel cannot currently self-parse BTF containing Rust debug
information. pahole uses the language of the CU to determine whether to
filter out debug information when generating the BTF. When LTO is
enabled, Rust code can cross CU boundaries, resulting in Rust debug
information in CUs labeled as C. This results in a system which cannot
parse its own BTF.
Signed-off-by: Matthew Maurer <mmaurer(a)google.com>
Cc: stable(a)vger.kernel.org
Fixes: c1177979af9c ("btf, scripts: Exclude Rust CUs with pahole")
Link: https://lore.kernel.org/r/20250108-rust-btf-lto-incompat-v1-1-60243ff6d820@…
Signed-off-by: Miguel Ojeda <ojeda(a)kernel.org>
diff --git a/init/Kconfig b/init/Kconfig
index d0d021b3fa3b..324c2886b2ea 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1973,7 +1973,7 @@ config RUST
depends on !MODVERSIONS || GENDWARFKSYMS
depends on !GCC_PLUGIN_RANDSTRUCT
depends on !RANDSTRUCT
- depends on !DEBUG_INFO_BTF || PAHOLE_HAS_LANG_EXCLUDE
+ depends on !DEBUG_INFO_BTF || (PAHOLE_HAS_LANG_EXCLUDE && !LTO)
depends on !CFI_CLANG || HAVE_CFI_ICALL_NORMALIZE_INTEGERS_RUSTC
select CFI_ICALL_NORMALIZE_INTEGERS if CFI_CLANG
depends on !CALL_PADDING || RUSTC_VERSION >= 108100
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 5daa0c35a1f0e7a6c3b8ba9cb721e7d1ace6e619
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025031622-rocker-narrow-b1e3@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5daa0c35a1f0e7a6c3b8ba9cb721e7d1ace6e619 Mon Sep 17 00:00:00 2001
From: Matthew Maurer <mmaurer(a)google.com>
Date: Wed, 8 Jan 2025 23:35:08 +0000
Subject: [PATCH] rust: Disallow BTF generation with Rust + LTO
The kernel cannot currently self-parse BTF containing Rust debug
information. pahole uses the language of the CU to determine whether to
filter out debug information when generating the BTF. When LTO is
enabled, Rust code can cross CU boundaries, resulting in Rust debug
information in CUs labeled as C. This results in a system which cannot
parse its own BTF.
Signed-off-by: Matthew Maurer <mmaurer(a)google.com>
Cc: stable(a)vger.kernel.org
Fixes: c1177979af9c ("btf, scripts: Exclude Rust CUs with pahole")
Link: https://lore.kernel.org/r/20250108-rust-btf-lto-incompat-v1-1-60243ff6d820@…
Signed-off-by: Miguel Ojeda <ojeda(a)kernel.org>
diff --git a/init/Kconfig b/init/Kconfig
index d0d021b3fa3b..324c2886b2ea 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1973,7 +1973,7 @@ config RUST
depends on !MODVERSIONS || GENDWARFKSYMS
depends on !GCC_PLUGIN_RANDSTRUCT
depends on !RANDSTRUCT
- depends on !DEBUG_INFO_BTF || PAHOLE_HAS_LANG_EXCLUDE
+ depends on !DEBUG_INFO_BTF || (PAHOLE_HAS_LANG_EXCLUDE && !LTO)
depends on !CFI_CLANG || HAVE_CFI_ICALL_NORMALIZE_INTEGERS_RUSTC
select CFI_ICALL_NORMALIZE_INTEGERS if CFI_CLANG
depends on !CALL_PADDING || RUSTC_VERSION >= 108100
The patch below does not apply to the 6.13-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.13.y
git checkout FETCH_HEAD
git cherry-pick -x 9360dfe4cbd62ff1eb8217b815964931523b75b3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025031611-flatly-revolving-8c7b@gregkh' --subject-prefix 'PATCH 6.13.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9360dfe4cbd62ff1eb8217b815964931523b75b3 Mon Sep 17 00:00:00 2001
From: Andrea Righi <arighi(a)nvidia.com>
Date: Mon, 3 Mar 2025 18:51:59 +0100
Subject: [PATCH] sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids
range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel
crash.
To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and
trigger an scx error if an invalid CPU is specified.
Fixes: f0e1a0643a59b ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable(a)vger.kernel.org # v6.12+
Signed-off-by: Andrea Righi <arighi(a)nvidia.com>
Signed-off-by: Tejun Heo <tj(a)kernel.org>
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0f1da199cfc7..7b9dfee858e7 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6422,6 +6422,9 @@ static bool check_builtin_idle_enabled(void)
__bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
u64 wake_flags, bool *is_idle)
{
+ if (!ops_cpu_valid(prev_cpu, NULL))
+ goto prev_cpu;
+
if (!check_builtin_idle_enabled())
goto prev_cpu;
The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.12.y
git checkout FETCH_HEAD
git cherry-pick -x 9360dfe4cbd62ff1eb8217b815964931523b75b3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025031614-broiler-overgrown-4bd4@gregkh' --subject-prefix 'PATCH 6.12.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9360dfe4cbd62ff1eb8217b815964931523b75b3 Mon Sep 17 00:00:00 2001
From: Andrea Righi <arighi(a)nvidia.com>
Date: Mon, 3 Mar 2025 18:51:59 +0100
Subject: [PATCH] sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids
range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel
crash.
To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and
trigger an scx error if an invalid CPU is specified.
Fixes: f0e1a0643a59b ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable(a)vger.kernel.org # v6.12+
Signed-off-by: Andrea Righi <arighi(a)nvidia.com>
Signed-off-by: Tejun Heo <tj(a)kernel.org>
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0f1da199cfc7..7b9dfee858e7 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6422,6 +6422,9 @@ static bool check_builtin_idle_enabled(void)
__bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
u64 wake_flags, bool *is_idle)
{
+ if (!ops_cpu_valid(prev_cpu, NULL))
+ goto prev_cpu;
+
if (!check_builtin_idle_enabled())
goto prev_cpu;
Set the MMC_CAP_NEED_RSP_BUSY capability for the sdhci-pxav3 host to
prevent conversion of R1B responses to R1. Without this, the eMMC card
in the samsung,coreprimevelte smartphone using the Marvell PXA1908 SoC
with this mmc host doesn't probe with the ETIMEDOUT error originating in
__mmc_poll_for_busy.
Note that the other issues reported for this phone and host, namely
floods of "Tuning failed, falling back to fixed sampling clock" dmesg
messages for the eMMC and unstable SDIO are not mitigated by this
change.
Link: https://lore.kernel.org/r/20200310153340.5593-1-ulf.hansson@linaro.org/
Link: https://lore.kernel.org/r/D7204PWIGQGI.1FRFQPPIEE2P9@matfyz.cz/
Link: https://lore.kernel.org/r/20250115-pxa1908-lkml-v14-0-847d24f3665a@skole.hr/
Cc: Duje Mihanović <duje.mihanovic(a)skole.hr>
Cc: stable(a)vger.kernel.org
Signed-off-by: Karel Balej <balejk(a)matfyz.cz>
---
drivers/mmc/host/sdhci-pxav3.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/mmc/host/sdhci-pxav3.c b/drivers/mmc/host/sdhci-pxav3.c
index 990723a008ae..3fb56face3d8 100644
--- a/drivers/mmc/host/sdhci-pxav3.c
+++ b/drivers/mmc/host/sdhci-pxav3.c
@@ -399,6 +399,7 @@ static int sdhci_pxav3_probe(struct platform_device *pdev)
if (!IS_ERR(pxa->clk_core))
clk_prepare_enable(pxa->clk_core);
+ host->mmc->caps |= MMC_CAP_NEED_RSP_BUSY;
/* enable 1/8V DDR capable */
host->mmc->caps |= MMC_CAP_1_8V_DDR;
--
2.48.1
Add device awake calls in case of rproc boot and rproc shutdown path.
Currently, device awake call is only present in the recovery path
of remoteproc. If a user stops and starts rproc by using the sysfs
interface, then on pm suspension the firmware loading fails. Keep the
device awake in such a case just like it is done for the recovery path.
Fixes: a781e5aa59110 ("remoteproc: core: Prevent system suspend during remoteproc recovery")
Signed-off-by: Souradeep Chowdhury <quic_schowdhu(a)quicinc.com>
Cc: stable(a)vger.kernel.org
---
drivers/remoteproc/remoteproc_core.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index c2cf0d277729..908a7b8f6c7e 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1916,7 +1916,8 @@ int rproc_boot(struct rproc *rproc)
pr_err("invalid rproc handle\n");
return -EINVAL;
}
-
+
+ pm_stay_awake(rproc->dev.parent);
dev = &rproc->dev;
ret = mutex_lock_interruptible(&rproc->lock);
@@ -1961,6 +1962,7 @@ int rproc_boot(struct rproc *rproc)
atomic_dec(&rproc->power);
unlock_mutex:
mutex_unlock(&rproc->lock);
+ pm_relax(rproc->dev.parent);
return ret;
}
EXPORT_SYMBOL(rproc_boot);
@@ -1991,6 +1993,7 @@ int rproc_shutdown(struct rproc *rproc)
struct device *dev = &rproc->dev;
int ret = 0;
+ pm_stay_awake(rproc->dev.parent);
ret = mutex_lock_interruptible(&rproc->lock);
if (ret) {
dev_err(dev, "can't lock rproc %s: %d\n", rproc->name, ret);
@@ -2027,6 +2030,7 @@ int rproc_shutdown(struct rproc *rproc)
rproc->table_ptr = NULL;
out:
mutex_unlock(&rproc->lock);
+ pm_relax(rproc->dev.parent);
return ret;
}
EXPORT_SYMBOL(rproc_shutdown);
--
2.34.1
Add device awake calls in case of rproc boot and rproc shutdown path.
Currently, device awake call is only present in the recovery path
of remoteproc. If a user stops and starts rproc by using the sysfs
interface, then on pm suspension the firmware loading fails. Keep the
device awake in such a case just like it is done for the recovery path.
Signed-off-by: Souradeep Chowdhury <quic_schowdhu(a)quicinc.com>
---
drivers/remoteproc/remoteproc_core.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index c2cf0d277729..908a7b8f6c7e 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1916,7 +1916,8 @@ int rproc_boot(struct rproc *rproc)
pr_err("invalid rproc handle\n");
return -EINVAL;
}
-
+
+ pm_stay_awake(rproc->dev.parent);
dev = &rproc->dev;
ret = mutex_lock_interruptible(&rproc->lock);
@@ -1961,6 +1962,7 @@ int rproc_boot(struct rproc *rproc)
atomic_dec(&rproc->power);
unlock_mutex:
mutex_unlock(&rproc->lock);
+ pm_relax(rproc->dev.parent);
return ret;
}
EXPORT_SYMBOL(rproc_boot);
@@ -1991,6 +1993,7 @@ int rproc_shutdown(struct rproc *rproc)
struct device *dev = &rproc->dev;
int ret = 0;
+ pm_stay_awake(rproc->dev.parent);
ret = mutex_lock_interruptible(&rproc->lock);
if (ret) {
dev_err(dev, "can't lock rproc %s: %d\n", rproc->name, ret);
@@ -2027,6 +2030,7 @@ int rproc_shutdown(struct rproc *rproc)
rproc->table_ptr = NULL;
out:
mutex_unlock(&rproc->lock);
+ pm_relax(rproc->dev.parent);
return ret;
}
EXPORT_SYMBOL(rproc_shutdown);
--
2.34.1
Add device awake calls in case of rproc boot and rproc shutdown path.
Currently, device awake call is only present in the recovery path
of remoteproc. If a user stops and starts rproc by using the sysfs
interface, then on pm suspension the firmware loading fails. Keep the
device awake in such a case just like it is done for the recovery path.
Signed-off-by: Souradeep Chowdhury <quic_schowdhu(a)quicinc.com>
Cc: stable(a)vger.kernel.org
---
drivers/remoteproc/remoteproc_core.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index c2cf0d277729..908a7b8f6c7e 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1916,7 +1916,8 @@ int rproc_boot(struct rproc *rproc)
pr_err("invalid rproc handle\n");
return -EINVAL;
}
-
+
+ pm_stay_awake(rproc->dev.parent);
dev = &rproc->dev;
ret = mutex_lock_interruptible(&rproc->lock);
@@ -1961,6 +1962,7 @@ int rproc_boot(struct rproc *rproc)
atomic_dec(&rproc->power);
unlock_mutex:
mutex_unlock(&rproc->lock);
+ pm_relax(rproc->dev.parent);
return ret;
}
EXPORT_SYMBOL(rproc_boot);
@@ -1991,6 +1993,7 @@ int rproc_shutdown(struct rproc *rproc)
struct device *dev = &rproc->dev;
int ret = 0;
+ pm_stay_awake(rproc->dev.parent);
ret = mutex_lock_interruptible(&rproc->lock);
if (ret) {
dev_err(dev, "can't lock rproc %s: %d\n", rproc->name, ret);
@@ -2027,6 +2030,7 @@ int rproc_shutdown(struct rproc *rproc)
rproc->table_ptr = NULL;
out:
mutex_unlock(&rproc->lock);
+ pm_relax(rproc->dev.parent);
return ret;
}
EXPORT_SYMBOL(rproc_shutdown);
--
2.34.1
The quilt patch titled
Subject: sparc/mm: avoid calling arch_enter/leave_lazy_mmu() in set_ptes
has been removed from the -mm tree. Its filename was
sparc-mm-avoid-calling-arch_enter-leave_lazy_mmu-in-set_ptes.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Ryan Roberts <ryan.roberts(a)arm.com>
Subject: sparc/mm: avoid calling arch_enter/leave_lazy_mmu() in set_ptes
Date: Mon, 3 Mar 2025 14:15:38 +0000
With commit 1a10a44dfc1d ("sparc64: implement the new page table range
API") set_ptes was added to the sparc architecture. The implementation
included calling arch_enter/leave_lazy_mmu() calls.
The patch removes the usage of arch_enter/leave_lazy_mmu() since this
implies nesting of lazy mmu regions which is not supported. Without this
fix, lazy mmu mode is effectively disabled because we exit the mode after
the first set_ptes:
remap_pte_range()
-> arch_enter_lazy_mmu()
-> set_ptes()
-> arch_enter_lazy_mmu()
-> arch_leave_lazy_mmu()
-> arch_leave_lazy_mmu()
Powerpc suffered the same problem and fixed it in a corresponding way with
commit 47b8def9358c ("powerpc/mm: Avoid calling
arch_enter/leave_lazy_mmu() in set_ptes").
Link: https://lkml.kernel.org/r/20250303141542.3371656-5-ryan.roberts@arm.com
Fixes: 1a10a44dfc1d ("sparc64: implement the new page table range API")
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Acked-by: Andreas Larsson <andreas(a)gaisler.com>
Acked-by: Juergen Gross <jgross(a)suse.com>
Cc: Borislav Betkov <bp(a)alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky(a)oracle.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Juegren Gross <jgross(a)suse.com>
Cc: Matthew Wilcow (Oracle) <willy(a)infradead.org>
Cc: Thomas Gleinxer <tglx(a)linutronix.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/sparc/include/asm/pgtable_64.h | 2 --
1 file changed, 2 deletions(-)
--- a/arch/sparc/include/asm/pgtable_64.h~sparc-mm-avoid-calling-arch_enter-leave_lazy_mmu-in-set_ptes
+++ a/arch/sparc/include/asm/pgtable_64.h
@@ -936,7 +936,6 @@ static inline void __set_pte_at(struct m
static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr)
{
- arch_enter_lazy_mmu_mode();
for (;;) {
__set_pte_at(mm, addr, ptep, pte, 0);
if (--nr == 0)
@@ -945,7 +944,6 @@ static inline void set_ptes(struct mm_st
pte_val(pte) += PAGE_SIZE;
addr += PAGE_SIZE;
}
- arch_leave_lazy_mmu_mode();
}
#define set_ptes set_ptes
_
Patches currently in -mm which might be from ryan.roberts(a)arm.com are
mm-use-ptep_get-instead-of-directly-dereferencing-pte_t.patch
The quilt patch titled
Subject: sparc/mm: disable preemption in lazy mmu mode
has been removed from the -mm tree. Its filename was
sparc-mm-disable-preemption-in-lazy-mmu-mode.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Ryan Roberts <ryan.roberts(a)arm.com>
Subject: sparc/mm: disable preemption in lazy mmu mode
Date: Mon, 3 Mar 2025 14:15:37 +0000
Since commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy
updates") it's been possible for arch_[enter|leave]_lazy_mmu_mode() to be
called without holding a page table lock (for the kernel mappings case),
and therefore it is possible that preemption may occur while in the lazy
mmu mode. The Sparc lazy mmu implementation is not robust to preemption
since it stores the lazy mode state in a per-cpu structure and does not
attempt to manage that state on task switch.
Powerpc had the same issue and fixed it by explicitly disabling preemption
in arch_enter_lazy_mmu_mode() and re-enabling in
arch_leave_lazy_mmu_mode(). See commit b9ef323ea168 ("powerpc/64s:
Disable preemption in hash lazy mmu mode").
Given Sparc's lazy mmu mode is based on powerpc's, let's fix it in the
same way here.
Link: https://lkml.kernel.org/r/20250303141542.3371656-4-ryan.roberts@arm.com
Fixes: 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy updates")
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Acked-by: Andreas Larsson <andreas(a)gaisler.com>
Acked-by: Juergen Gross <jgross(a)suse.com>
Cc: Borislav Betkov <bp(a)alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky(a)oracle.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Juegren Gross <jgross(a)suse.com>
Cc: Matthew Wilcow (Oracle) <willy(a)infradead.org>
Cc: Thomas Gleinxer <tglx(a)linutronix.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/sparc/mm/tlb.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--- a/arch/sparc/mm/tlb.c~sparc-mm-disable-preemption-in-lazy-mmu-mode
+++ a/arch/sparc/mm/tlb.c
@@ -52,8 +52,10 @@ out:
void arch_enter_lazy_mmu_mode(void)
{
- struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
+ struct tlb_batch *tb;
+ preempt_disable();
+ tb = this_cpu_ptr(&tlb_batch);
tb->active = 1;
}
@@ -64,6 +66,7 @@ void arch_leave_lazy_mmu_mode(void)
if (tb->tlb_nr)
flush_tlb_pending();
tb->active = 0;
+ preempt_enable();
}
static void tlb_batch_add_one(struct mm_struct *mm, unsigned long vaddr,
_
Patches currently in -mm which might be from ryan.roberts(a)arm.com are
mm-use-ptep_get-instead-of-directly-dereferencing-pte_t.patch
The quilt patch titled
Subject: mm: fix lazy mmu docs and usage
has been removed from the -mm tree. Its filename was
mm-fix-lazy-mmu-docs-and-usage.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Ryan Roberts <ryan.roberts(a)arm.com>
Subject: mm: fix lazy mmu docs and usage
Date: Mon, 3 Mar 2025 14:15:35 +0000
Patch series "Fix lazy mmu mode", v2.
I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As
part of that, I will extend lazy mmu mode to cover kernel mappings in
vmalloc table walkers. While lazy mmu mode is already used for kernel
mappings in a few places, this will extend it's use significantly.
Having reviewed the existing lazy mmu implementations in powerpc, sparc
and x86, it looks like there are a bunch of bugs, some of which may be
more likely to trigger once I extend the use of lazy mmu. So this series
attempts to clarify the requirements and fix all the bugs in advance of
that series. See patch #1 commit log for all the details.
This patch (of 5):
The docs, implementations and use of arch_[enter|leave]_lazy_mmu_mode() is
a bit of a mess (to put it politely). There are a number of issues
related to nesting of lazy mmu regions and confusion over whether the
task, when in a lazy mmu region, is preemptible or not. Fix all the
issues relating to the core-mm. Follow up commits will fix the
arch-specific implementations. 3 arches implement lazy mmu; powerpc,
sparc and x86.
When arch_[enter|leave]_lazy_mmu_mode() was first introduced by commit
6606c3e0da53 ("[PATCH] paravirt: lazy mmu mode hooks.patch"), it was
expected that lazy mmu regions would never nest and that the appropriate
page table lock(s) would be held while in the region, thus ensuring the
region is non-preemptible. Additionally lazy mmu regions were only used
during manipulation of user mappings.
Commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy
updates") started invoking the lazy mmu mode in apply_to_pte_range(),
which is used for both user and kernel mappings. For kernel mappings the
region is no longer protected by any lock so there is no longer any
guarantee about non-preemptibility. Additionally, for RT configs, the
holding the PTL only implies no CPU migration, it doesn't prevent
preemption.
Commit bcc6cc832573 ("mm: add default definition of set_ptes()") added
arch_[enter|leave]_lazy_mmu_mode() to the default implementation of
set_ptes(), used by x86. So after this commit, lazy mmu regions can be
nested. Additionally commit 1a10a44dfc1d ("sparc64: implement the new
page table range API") and commit 9fee28baa601 ("powerpc: implement the
new page table range API") did the same for the sparc and powerpc
set_ptes() overrides.
powerpc couldn't deal with preemption so avoids it in commit b9ef323ea168
("powerpc/64s: Disable preemption in hash lazy mmu mode"), which
explicitly disables preemption for the whole region in its implementation.
x86 can support preemption (or at least it could until it tried to add
support nesting; more on this below). Sparc looks to be totally broken in
the face of preemption, as far as I can tell.
powerpc can't deal with nesting, so avoids it in commit 47b8def9358c
("powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes"),
which removes the lazy mmu calls from its implementation of set_ptes().
x86 attempted to support nesting in commit 49147beb0ccb ("x86/xen: allow
nesting of same lazy mode") but as far as I can tell, this breaks its
support for preemption.
In short, it's all a mess; the semantics for
arch_[enter|leave]_lazy_mmu_mode() are not clearly defined and as a result
the implementations all have different expectations, sticking plasters and
bugs.
arm64 is aiming to start using these hooks, so let's clean everything up
before adding an arm64 implementation. Update the documentation to state
that lazy mmu regions can never be nested, must not be called in interrupt
context and preemption may or may not be enabled for the duration of the
region. And fix the generic implementation of set_ptes() to avoid
nesting.
arch-specific fixes to conform to the new spec will proceed this one.
These issues were spotted by code review and I have no evidence of issues
being reported in the wild.
Link: https://lkml.kernel.org/r/20250303141542.3371656-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20250303141542.3371656-2-ryan.roberts@arm.com
Fixes: bcc6cc832573 ("mm: add default definition of set_ptes()")
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Acked-by: Juergen Gross <jgross(a)suse.com>
Cc: Andreas Larsson <andreas(a)gaisler.com>
Cc: Borislav Betkov <bp(a)alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky(a)oracle.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Juegren Gross <jgross(a)suse.com>
Cc: Matthew Wilcow (Oracle) <willy(a)infradead.org>
Cc: Thomas Gleinxer <tglx(a)linutronix.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/pgtable.h | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
--- a/include/linux/pgtable.h~mm-fix-lazy-mmu-docs-and-usage
+++ a/include/linux/pgtable.h
@@ -222,10 +222,14 @@ static inline int pmd_dirty(pmd_t pmd)
* hazard could result in the direct mode hypervisor case, since the actual
* write to the page tables may not yet have taken place, so reads though
* a raw PTE pointer after it has been modified are not guaranteed to be
- * up to date. This mode can only be entered and left under the protection of
- * the page table locks for all page tables which may be modified. In the UP
- * case, this is required so that preemption is disabled, and in the SMP case,
- * it must synchronize the delayed page table writes properly on other CPUs.
+ * up to date.
+ *
+ * In the general case, no lock is guaranteed to be held between entry and exit
+ * of the lazy mode. So the implementation must assume preemption may be enabled
+ * and cpu migration is possible; it must take steps to be robust against this.
+ * (In practice, for user PTE updates, the appropriate page table lock(s) are
+ * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
+ * and the mode cannot be used in interrupt context.
*/
#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
#define arch_enter_lazy_mmu_mode() do {} while (0)
@@ -287,7 +291,6 @@ static inline void set_ptes(struct mm_st
{
page_table_check_ptes_set(mm, ptep, pte, nr);
- arch_enter_lazy_mmu_mode();
for (;;) {
set_pte(ptep, pte);
if (--nr == 0)
@@ -295,7 +298,6 @@ static inline void set_ptes(struct mm_st
ptep++;
pte = pte_next_pfn(pte);
}
- arch_leave_lazy_mmu_mode();
}
#endif
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
_
Patches currently in -mm which might be from ryan.roberts(a)arm.com are
mm-use-ptep_get-instead-of-directly-dereferencing-pte_t.patch
The quilt patch titled
Subject: mm: make page_mapped_in_vma() hugetlb walk aware
has been removed from the -mm tree. Its filename was
mm-make-page_mapped_in_vma-hugetlb-walk-aware.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Jane Chu <jane.chu(a)oracle.com>
Subject: mm: make page_mapped_in_vma() hugetlb walk aware
Date: Mon, 24 Feb 2025 14:14:45 -0700
When a process consumes a UE in a page, the memory failure handler
attempts to collect information for a potential SIGBUS. If the page is an
anonymous page, page_mapped_in_vma(page, vma) is invoked in order to
1. retrieve the vaddr from the process' address space,
2. verify that the vaddr is indeed mapped to the poisoned page,
where 'page' is the precise small page with UE.
It's been observed that when injecting poison to a non-head subpage of an
anonymous hugetlb page, no SIGBUS shows up, while injecting to the head
page produces a SIGBUS. The cause is that, though hugetlb_walk() returns
a valid pmd entry (on x86), but check_pte() detects mismatch between the
head page per the pmd and the input subpage. Thus the vaddr is considered
not mapped to the subpage and the process is not collected for SIGBUS
purpose. This is the calling stack:
collect_procs_anon
page_mapped_in_vma
page_vma_mapped_walk
hugetlb_walk
huge_pte_lock
check_pte
check_pte() header says that it
"check if [pvmw->pfn, @pvmw->pfn + @pvmw->nr_pages) is mapped at the @pvmw->pte"
but practically works only if pvmw->pfn is the head page pfn at pvmw->pte.
Hindsight acknowledging that some pvmw->pte could point to a hugepage of
some sort such that it makes sense to make check_pte() work for hugepage.
Link: https://lkml.kernel.org/r/20250224211445.2663312-1-jane.chu@oracle.com
Signed-off-by: Jane Chu <jane.chu(a)oracle.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Kirill A. Shuemov <kirill.shutemov(a)linux.intel.com>
Cc: linmiaohe <linmiaohe(a)huawei.com>
Cc: Matthew Wilcow (Oracle) <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_vma_mapped.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
--- a/mm/page_vma_mapped.c~mm-make-page_mapped_in_vma-hugetlb-walk-aware
+++ a/mm/page_vma_mapped.c
@@ -84,6 +84,7 @@ again:
* mapped at the @pvmw->pte
* @pvmw: page_vma_mapped_walk struct, includes a pair pte and pfn range
* for checking
+ * @pte_nr: the number of small pages described by @pvmw->pte.
*
* page_vma_mapped_walk() found a place where pfn range is *potentially*
* mapped. check_pte() has to validate this.
@@ -100,7 +101,7 @@ again:
* Otherwise, return false.
*
*/
-static bool check_pte(struct page_vma_mapped_walk *pvmw)
+static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
{
unsigned long pfn;
pte_t ptent = ptep_get(pvmw->pte);
@@ -132,7 +133,11 @@ static bool check_pte(struct page_vma_ma
pfn = pte_pfn(ptent);
}
- return (pfn - pvmw->pfn) < pvmw->nr_pages;
+ if ((pfn + pte_nr - 1) < pvmw->pfn)
+ return false;
+ if (pfn > (pvmw->pfn + pvmw->nr_pages - 1))
+ return false;
+ return true;
}
/* Returns true if the two ranges overlap. Careful to not overflow. */
@@ -207,7 +212,7 @@ bool page_vma_mapped_walk(struct page_vm
return false;
pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
- if (!check_pte(pvmw))
+ if (!check_pte(pvmw, pages_per_huge_page(hstate)))
return not_found(pvmw);
return true;
}
@@ -290,7 +295,7 @@ restart:
goto next_pte;
}
this_pte:
- if (check_pte(pvmw))
+ if (check_pte(pvmw, 1))
return true;
next_pte:
do {
_
Patches currently in -mm which might be from jane.chu(a)oracle.com are
The quilt patch titled
Subject: mm/damon: avoid applying DAMOS action to same entity multiple times
has been removed from the -mm tree. Its filename was
mm-damon-avoid-applying-damos-action-to-same-entity-multiple-times.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: SeongJae Park <sj(a)kernel.org>
Subject: mm/damon: avoid applying DAMOS action to same entity multiple times
Date: Fri, 7 Feb 2025 13:20:33 -0800
'paddr' DAMON operations set can apply a DAMOS scheme's action to a large
folio multiple times in single DAMOS-regions-walk if the folio is laid on
multiple DAMON regions. Add a field for DAMOS scheme object that can be
used by the underlying ops to know what was the last entity that the
scheme's action has applied. The core layer unsets the field when each
DAMOS-regions-walk is done for the given scheme. And update 'paddr' ops
to use the infrastructure to avoid the problem.
Link: https://lkml.kernel.org/r/20250207212033.45269-3-sj@kernel.org
Fixes: 57223ac29584 ("mm/damon/paddr: support the pageout scheme")
Signed-off-by: SeongJae Park <sj(a)kernel.org>
Reported-by: Usama Arif <usamaarif642(a)gmail.com>
Closes: https://lore.kernel.org/20250203225604.44742-3-usamaarif642@gmail.com
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/damon.h | 11 +++++++++++
mm/damon/core.c | 1 +
mm/damon/paddr.c | 39 +++++++++++++++++++++++++++------------
3 files changed, 39 insertions(+), 12 deletions(-)
--- a/include/linux/damon.h~mm-damon-avoid-applying-damos-action-to-same-entity-multiple-times
+++ a/include/linux/damon.h
@@ -432,6 +432,7 @@ struct damos_access_pattern {
* @wmarks: Watermarks for automated (in)activation of this scheme.
* @target_nid: Destination node if @action is "migrate_{hot,cold}".
* @filters: Additional set of &struct damos_filter for &action.
+ * @last_applied: Last @action applied ops-managing entity.
* @stat: Statistics of this scheme.
* @list: List head for siblings.
*
@@ -454,6 +455,15 @@ struct damos_access_pattern {
* implementation could check pages of the region and skip &action to respect
* &filters
*
+ * The minimum entity that @action can be applied depends on the underlying
+ * &struct damon_operations. Since it may not be aligned with the core layer
+ * abstract, namely &struct damon_region, &struct damon_operations could apply
+ * @action to same entity multiple times. Large folios that underlying on
+ * multiple &struct damon region objects could be such examples. The &struct
+ * damon_operations can use @last_applied to avoid that. DAMOS core logic
+ * unsets @last_applied when each regions walking for applying the scheme is
+ * finished.
+ *
* After applying the &action to each region, &stat_count and &stat_sz is
* updated to reflect the number of regions and total size of regions that the
* &action is applied.
@@ -482,6 +492,7 @@ struct damos {
int target_nid;
};
struct list_head filters;
+ void *last_applied;
struct damos_stat stat;
struct list_head list;
};
--- a/mm/damon/core.c~mm-damon-avoid-applying-damos-action-to-same-entity-multiple-times
+++ a/mm/damon/core.c
@@ -1856,6 +1856,7 @@ static void kdamond_apply_schemes(struct
s->next_apply_sis = c->passed_sample_intervals +
(s->apply_interval_us ? s->apply_interval_us :
c->attrs.aggr_interval) / sample_interval;
+ s->last_applied = NULL;
}
}
--- a/mm/damon/paddr.c~mm-damon-avoid-applying-damos-action-to-same-entity-multiple-times
+++ a/mm/damon/paddr.c
@@ -254,6 +254,17 @@ static bool damos_pa_filter_out(struct d
return false;
}
+static bool damon_pa_invalid_damos_folio(struct folio *folio, struct damos *s)
+{
+ if (!folio)
+ return true;
+ if (folio == s->last_applied) {
+ folio_put(folio);
+ return true;
+ }
+ return false;
+}
+
static unsigned long damon_pa_pageout(struct damon_region *r, struct damos *s,
unsigned long *sz_filter_passed)
{
@@ -261,6 +272,7 @@ static unsigned long damon_pa_pageout(st
LIST_HEAD(folio_list);
bool install_young_filter = true;
struct damos_filter *filter;
+ struct folio *folio;
/* check access in page level again by default */
damos_for_each_filter(filter, s) {
@@ -279,9 +291,8 @@ static unsigned long damon_pa_pageout(st
addr = r->ar.start;
while (addr < r->ar.end) {
- struct folio *folio = damon_get_folio(PHYS_PFN(addr));
-
- if (!folio) {
+ folio = damon_get_folio(PHYS_PFN(addr));
+ if (damon_pa_invalid_damos_folio(folio, s)) {
addr += PAGE_SIZE;
continue;
}
@@ -307,6 +318,7 @@ put_folio:
damos_destroy_filter(filter);
applied = reclaim_pages(&folio_list);
cond_resched();
+ s->last_applied = folio;
return applied * PAGE_SIZE;
}
@@ -315,12 +327,12 @@ static inline unsigned long damon_pa_mar
unsigned long *sz_filter_passed)
{
unsigned long addr, applied = 0;
+ struct folio *folio;
addr = r->ar.start;
while (addr < r->ar.end) {
- struct folio *folio = damon_get_folio(PHYS_PFN(addr));
-
- if (!folio) {
+ folio = damon_get_folio(PHYS_PFN(addr));
+ if (damon_pa_invalid_damos_folio(folio, s)) {
addr += PAGE_SIZE;
continue;
}
@@ -339,6 +351,7 @@ put_folio:
addr += folio_size(folio);
folio_put(folio);
}
+ s->last_applied = folio;
return applied * PAGE_SIZE;
}
@@ -482,12 +495,12 @@ static unsigned long damon_pa_migrate(st
{
unsigned long addr, applied;
LIST_HEAD(folio_list);
+ struct folio *folio;
addr = r->ar.start;
while (addr < r->ar.end) {
- struct folio *folio = damon_get_folio(PHYS_PFN(addr));
-
- if (!folio) {
+ folio = damon_get_folio(PHYS_PFN(addr));
+ if (damon_pa_invalid_damos_folio(folio, s)) {
addr += PAGE_SIZE;
continue;
}
@@ -506,6 +519,7 @@ put_folio:
}
applied = damon_pa_migrate_pages(&folio_list, s->target_nid);
cond_resched();
+ s->last_applied = folio;
return applied * PAGE_SIZE;
}
@@ -523,15 +537,15 @@ static unsigned long damon_pa_stat(struc
{
unsigned long addr;
LIST_HEAD(folio_list);
+ struct folio *folio;
if (!damon_pa_scheme_has_filter(s))
return 0;
addr = r->ar.start;
while (addr < r->ar.end) {
- struct folio *folio = damon_get_folio(PHYS_PFN(addr));
-
- if (!folio) {
+ folio = damon_get_folio(PHYS_PFN(addr));
+ if (damon_pa_invalid_damos_folio(folio, s)) {
addr += PAGE_SIZE;
continue;
}
@@ -541,6 +555,7 @@ static unsigned long damon_pa_stat(struc
addr += folio_size(folio);
folio_put(folio);
}
+ s->last_applied = folio;
return 0;
}
_
Patches currently in -mm which might be from sj(a)kernel.org are
mm-damon-sysfs-schemes-let-damon_sysfs_scheme_set_filters-be-used-for-different-named-directories.patch
mm-damon-sysfs-schemes-implement-core_filters-and-ops_filters-directories.patch
mm-damon-sysfs-schemes-commit-filters-in-coreops_filters-directories.patch
mm-damon-core-expose-damos_filter_for_ops-to-damon-kernel-api-callers.patch
mm-damon-sysfs-schemes-record-filters-of-which-layer-should-be-added-to-the-given-filters-directory.patch
mm-damon-sysfs-schemes-return-error-when-for-attempts-to-install-filters-on-wrong-sysfs-directory.patch
docs-abi-damon-document-coreops_filters-directories.patch
docs-admin-guide-mm-damon-usage-update-for-coreops_filters-directories.patch
mm-damon-sysfs-validate-user-inputs-from-damon_sysfs_commit_input.patch
mm-damon-core-invoke-kdamond_call-after-merging-is-done-if-possible.patch
mm-damon-core-make-damon_set_attrs-be-safe-to-be-called-from-damon_call.patch
mm-damon-sysfs-handle-commit-command-using-damon_call.patch
mm-damon-sysfs-remove-damon_sysfs_cmd_request-code-from-damon_sysfs_handle_cmd.patch
mm-damon-sysfs-remove-damon_sysfs_cmd_request_callback-and-its-callers.patch
mm-damon-sysfs-remove-damon_sysfs_cmd_request-and-its-readers.patch
mm-damon-sysfs-schemes-remove-obsolete-comment-for-damon_sysfs_schemes_clear_regions.patch
mm-damon-remove-damon_callback-private.patch
mm-damon-remove-before_start-of-damon_callback.patch
mm-damon-remove-damon_callback-after_sampling.patch
mm-damon-remove-damon_callback-before_damos_apply.patch
mm-damon-remove-damon_operations-reset_aggregated.patch
mm-damon-sysfs-schemes-avoid-wformat-security-warning-on-damon_sysfs_access_pattern_add_range_dir.patch
mm-madvise-use-is_memory_failure-from-madvise_do_behavior.patch
mm-madvise-split-out-populate-behavior-check-logic.patch
mm-madvise-deduplicate-madvise_do_behavior-skip-case-handlings.patch
mm-madvise-remove-len-parameter-of-madvise_do_behavior.patch
The quilt patch titled
Subject: mm/damon/ops: have damon_get_folio return folio even for tail pages
has been removed from the -mm tree. Its filename was
mm-damon-ops-have-damon_get_folio-return-folio-even-for-tail-pages.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Usama Arif <usamaarif642(a)gmail.com>
Subject: mm/damon/ops: have damon_get_folio return folio even for tail pages
Date: Fri, 7 Feb 2025 13:20:32 -0800
Patch series "mm/damon/paddr: fix large folios access and schemes handling".
DAMON operations set for physical address space, namely 'paddr', treats
tail pages as unaccessed always. It can also apply DAMOS action to a
large folio multiple times within single DAMOS' regions walking. As a
result, the monitoring output has poor quality and DAMOS works in
unexpected ways when large folios are being used. Fix those.
The patches were parts of Usama's hugepage_size DAMOS filter patch
series[1]. The first fix has collected from there with a slight commit
message change for the subject prefix. The second fix is re-written by SJ
and posted as an RFC before this series. The second one also got a slight
commit message change for the subject prefix.
[1] https://lore.kernel.org/20250203225604.44742-1-usamaarif642@gmail.com
[2] https://lore.kernel.org/20250206231103.38298-1-sj@kernel.org
This patch (of 2):
This effectively adds support for large folios in damon for paddr, as
damon_pa_mkold/young won't get a null folio from this function and won't
ignore it, hence access will be checked and reported. This also means
that larger folios will be considered for different DAMOS actions like
pageout, prioritization and migration. As these DAMOS actions will
consider larger folios, iterate through the region at folio_size and not
PAGE_SIZE intervals. This should not have an affect on vaddr, as
damon_young_pmd_entry considers pmd entries.
Link: https://lkml.kernel.org/r/20250207212033.45269-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250207212033.45269-2-sj@kernel.org
Fixes: a28397beb55b ("mm/damon: implement primitives for physical address space monitoring")
Signed-off-by: Usama Arif <usamaarif642(a)gmail.com>
Signed-off-by: SeongJae Park <sj(a)kernel.org>
Reviewed-by: SeongJae Park <sj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/damon/ops-common.c | 2 +-
mm/damon/paddr.c | 24 ++++++++++++++++++------
2 files changed, 19 insertions(+), 7 deletions(-)
--- a/mm/damon/ops-common.c~mm-damon-ops-have-damon_get_folio-return-folio-even-for-tail-pages
+++ a/mm/damon/ops-common.c
@@ -26,7 +26,7 @@ struct folio *damon_get_folio(unsigned l
struct page *page = pfn_to_online_page(pfn);
struct folio *folio;
- if (!page || PageTail(page))
+ if (!page)
return NULL;
folio = page_folio(page);
--- a/mm/damon/paddr.c~mm-damon-ops-have-damon_get_folio-return-folio-even-for-tail-pages
+++ a/mm/damon/paddr.c
@@ -277,11 +277,14 @@ static unsigned long damon_pa_pageout(st
damos_add_filter(s, filter);
}
- for (addr = r->ar.start; addr < r->ar.end; addr += PAGE_SIZE) {
+ addr = r->ar.start;
+ while (addr < r->ar.end) {
struct folio *folio = damon_get_folio(PHYS_PFN(addr));
- if (!folio)
+ if (!folio) {
+ addr += PAGE_SIZE;
continue;
+ }
if (damos_pa_filter_out(s, folio))
goto put_folio;
@@ -297,6 +300,7 @@ static unsigned long damon_pa_pageout(st
else
list_add(&folio->lru, &folio_list);
put_folio:
+ addr += folio_size(folio);
folio_put(folio);
}
if (install_young_filter)
@@ -312,11 +316,14 @@ static inline unsigned long damon_pa_mar
{
unsigned long addr, applied = 0;
- for (addr = r->ar.start; addr < r->ar.end; addr += PAGE_SIZE) {
+ addr = r->ar.start;
+ while (addr < r->ar.end) {
struct folio *folio = damon_get_folio(PHYS_PFN(addr));
- if (!folio)
+ if (!folio) {
+ addr += PAGE_SIZE;
continue;
+ }
if (damos_pa_filter_out(s, folio))
goto put_folio;
@@ -329,6 +336,7 @@ static inline unsigned long damon_pa_mar
folio_deactivate(folio);
applied += folio_nr_pages(folio);
put_folio:
+ addr += folio_size(folio);
folio_put(folio);
}
return applied * PAGE_SIZE;
@@ -475,11 +483,14 @@ static unsigned long damon_pa_migrate(st
unsigned long addr, applied;
LIST_HEAD(folio_list);
- for (addr = r->ar.start; addr < r->ar.end; addr += PAGE_SIZE) {
+ addr = r->ar.start;
+ while (addr < r->ar.end) {
struct folio *folio = damon_get_folio(PHYS_PFN(addr));
- if (!folio)
+ if (!folio) {
+ addr += PAGE_SIZE;
continue;
+ }
if (damos_pa_filter_out(s, folio))
goto put_folio;
@@ -490,6 +501,7 @@ static unsigned long damon_pa_migrate(st
goto put_folio;
list_add(&folio->lru, &folio_list);
put_folio:
+ addr += folio_size(folio);
folio_put(folio);
}
applied = damon_pa_migrate_pages(&folio_list, s->target_nid);
_
Patches currently in -mm which might be from usamaarif642(a)gmail.com are
The quilt patch titled
Subject: mm/rmap: reject hugetlb folios in folio_make_device_exclusive()
has been removed from the -mm tree. Its filename was
mm-rmap-reject-hugetlb-folios-in-folio_make_device_exclusive.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: mm/rmap: reject hugetlb folios in folio_make_device_exclusive()
Date: Mon, 10 Feb 2025 20:37:44 +0100
Even though FOLL_SPLIT_PMD on hugetlb now always fails with -EOPNOTSUPP,
let's add a safety net in case FOLL_SPLIT_PMD usage would ever be
reworked.
In particular, before commit 9cb28da54643 ("mm/gup: handle hugetlb in the
generic follow_page_mask code"), GUP(FOLL_SPLIT_PMD) would just have
returned a page. In particular, hugetlb folios that are not PMD-sized
would never have been prone to FOLL_SPLIT_PMD.
hugetlb folios can be anonymous, and page_make_device_exclusive_one() is
not really prepared for handling them at all. So let's spell that out.
Link: https://lkml.kernel.org/r/20250210193801.781278-3-david@redhat.com
Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Alistair Popple <apopple(a)nvidia.com>
Tested-by: Alistair Popple <apopple(a)nvidia.com>
Cc: Alex Shi <alexs(a)kernel.org>
Cc: Danilo Krummrich <dakr(a)kernel.org>
Cc: Dave Airlie <airlied(a)gmail.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: Karol Herbst <kherbst(a)redhat.com>
Cc: Liam Howlett <liam.howlett(a)oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Lyude <lyude(a)redhat.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat(a)kernel.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Pasha Tatashin <pasha.tatashin(a)soleen.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: SeongJae Park <sj(a)kernel.org>
Cc: Simona Vetter <simona.vetter(a)ffwll.ch>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Yanteng Si <si.yanteng(a)linux.dev>
Cc: Barry Song <v-songbaohua(a)oppo.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/rmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/rmap.c~mm-rmap-reject-hugetlb-folios-in-folio_make_device_exclusive
+++ a/mm/rmap.c
@@ -2499,7 +2499,7 @@ static bool folio_make_device_exclusive(
* Restrict to anonymous folios for now to avoid potential writeback
* issues.
*/
- if (!folio_test_anon(folio))
+ if (!folio_test_anon(folio) || folio_test_hugetlb(folio))
return false;
rmap_walk(folio, &rwc);
_
Patches currently in -mm which might be from david(a)redhat.com are
mm-factor-out-large-folio-handling-from-folio_order-into-folio_large_order.patch
mm-factor-out-large-folio-handling-from-folio_nr_pages-into-folio_large_nr_pages.patch
mm-let-_folio_nr_pages-overlay-memcg_data-in-first-tail-page.patch
mm-let-_folio_nr_pages-overlay-memcg_data-in-first-tail-page-fix.patch
mm-move-hugetlb-specific-things-in-folio-to-page.patch
mm-move-_pincount-in-folio-to-page-on-32bit.patch
mm-move-_entire_mapcount-in-folio-to-page-on-32bit.patch
mm-rmap-pass-dst_vma-to-folio_dup_file_rmap_pte-and-friends.patch
mm-rmap-pass-vma-to-__folio_add_rmap.patch
mm-rmap-abstract-large-mapcount-operations-for-large-folios-hugetlb.patch
bit_spinlock-__always_inline-unlock-functions.patch
mm-rmap-use-folio_large_nr_pages-in-add-remove-functions.patch
mm-rmap-basic-mm-owner-tracking-for-large-folios-hugetlb.patch
mm-copy-on-write-cow-reuse-support-for-pte-mapped-thp.patch
mm-convert-folio_likely_mapped_shared-to-folio_maybe_mapped_shared.patch
mm-config_no_page_mapcount-to-prepare-for-not-maintain-per-page-mapcounts-in-large-folios.patch
fs-proc-page-remove-per-page-mapcount-dependency-for-proc-kpagecount-config_no_page_mapcount.patch
fs-proc-task_mmu-remove-per-page-mapcount-dependency-for-pm_mmap_exclusive-config_no_page_mapcount.patch
fs-proc-task_mmu-remove-per-page-mapcount-dependency-for-mapmax-config_no_page_mapcount.patch
fs-proc-task_mmu-remove-per-page-mapcount-dependency-for-smaps-smaps_rollup-config_no_page_mapcount.patch
mm-stop-maintaining-the-per-page-mapcount-of-large-folios-config_no_page_mapcount.patch
The quilt patch titled
Subject: mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs
has been removed from the -mm tree. Its filename was
mm-gup-reject-foll_split_pmd-with-hugetlb-vmas.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs
Date: Mon, 10 Feb 2025 20:37:43 +0100
Patch series "mm: fixes for device-exclusive entries (hmm)", v2.
Discussing the PageTail() call in make_device_exclusive_range() with
Willy, I recently discovered [1] that device-exclusive handling does not
properly work with THP, making the hmm-tests selftests fail if THPs are
enabled on the system.
Looking into more details, I found that hugetlb is not properly fenced,
and I realized that something that was bugging me for longer -- how
device-exclusive entries interact with mapcounts -- completely breaks
migration/swapout/split/hwpoison handling of these folios while they have
device-exclusive PTEs.
The program below can be used to allocate 1 GiB worth of pages and making
them device-exclusive on a kernel with CONFIG_TEST_HMM.
Once they are device-exclusive, these folios cannot get swapped out
(proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how much
one forces memory reclaim), and when having a memory block onlined to
ZONE_MOVABLE, trying to offline it will loop forever and complain about
failed migration of a page that should be movable.
# echo offline > /sys/devices/system/memory/memory136/state
# echo online_movable > /sys/devices/system/memory/memory136/state
# ./hmm-swap &
... wait until everything is device-exclusive
# echo offline > /sys/devices/system/memory/memory136/state
[ 285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
index:0x7f20671f7 pfn:0x442b6a
[ 285.196618][T14882] memcg:ffff888179298000
[ 285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
[ 285.201734][T14882] raw: ...
[ 285.204464][T14882] raw: ...
[ 285.207196][T14882] page dumped because: migration failure
[ 285.209072][T14882] page_owner tracks the page as allocated
[ 285.210915][T14882] page last allocated via order 0, migratetype
Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
[ 285.216765][T14882] post_alloc_hook+0x197/0x1b0
[ 285.218874][T14882] get_page_from_freelist+0x76e/0x3280
[ 285.220864][T14882] __alloc_frozen_pages_noprof+0x38e/0x2740
[ 285.223302][T14882] alloc_pages_mpol+0x1fc/0x540
[ 285.225130][T14882] folio_alloc_mpol_noprof+0x36/0x340
[ 285.227222][T14882] vma_alloc_folio_noprof+0xee/0x1a0
[ 285.229074][T14882] __handle_mm_fault+0x2b38/0x56a0
[ 285.230822][T14882] handle_mm_fault+0x368/0x9f0
...
This series fixes all issues I found so far. There is no easy way to fix
without a bigger rework/cleanup. I have a bunch of cleanups on top (some
previous sent, some the result of the discussion in v1) that I will send
out separately once this landed and I get to it.
I wish we could just use some special present PROT_NONE PTEs instead of
these (non-present, non-none) fake-swap entries; but that just results in
the same problem we keep having (lack of spare PTE bits), and staring at
other similar fake-swap entries, that ship has sailed.
With this series, make_device_exclusive() doesn't actually belong into
mm/rmap.c anymore, but I'll leave moving that for another day.
I only tested this series with the hmm-tests selftests due to lack of HW,
so I'd appreciate some testing, especially if the interaction between two
GPUs wanting a device-exclusive entry works as expected.
<program>
#include <stdio.h>
#include <fcntl.h>
#include <stdint.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <linux/types.h>
#include <linux/ioctl.h>
#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)
struct hmm_dmirror_cmd {
__u64 addr;
__u64 ptr;
__u64 npages;
__u64 cpages;
__u64 faults;
};
const size_t size = 1 * 1024 * 1024 * 1024ul;
const size_t chunk_size = 2 * 1024 * 1024ul;
int main(void)
{
struct hmm_dmirror_cmd cmd;
size_t cur_size;
int fd, ret;
char *addr, *mirror;
fd = open("/dev/hmm_dmirror1", O_RDWR, 0);
if (fd < 0) {
perror("open failed\n");
exit(1);
}
addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap failed\n");
exit(1);
}
madvise(addr, size, MADV_NOHUGEPAGE);
memset(addr, 1, size);
mirror = malloc(chunk_size);
for (cur_size = 0; cur_size < size; cur_size += chunk_size) {
cmd.addr = (uintptr_t)addr + cur_size;
cmd.ptr = (uintptr_t)mirror;
cmd.npages = chunk_size / getpagesize();
ret = ioctl(fd, HMM_DMIRROR_EXCLUSIVE, &cmd);
if (ret) {
perror("ioctl failed\n");
exit(1);
}
}
pause();
return 0;
}
</program>
[1] https://lkml.kernel.org/r/25e02685-4f1d-47fa-be5b-01ff85bb0ce2@redhat.com
This patch (of 17):
We only have two FOLL_SPLIT_PMD users. While uprobe refuses hugetlb
early, make_device_exclusive_range() can end up getting called on hugetlb
VMAs.
Right now, this means that with a PMD-sized hugetlb page, we can end up
calling split_huge_pmd(), because pmd_trans_huge() also succeeds with
hugetlb PMDs.
For example, using a modified hmm-test selftest one can trigger:
[ 207.017134][T14945] ------------[ cut here ]------------
[ 207.018614][T14945] kernel BUG at mm/page_table_check.c:87!
[ 207.019716][T14945] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[ 207.021072][T14945] CPU: 3 UID: 0 PID: ...
[ 207.023036][T14945] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
[ 207.024834][T14945] RIP: 0010:page_table_check_clear.part.0+0x488/0x510
[ 207.026128][T14945] Code: ...
[ 207.029965][T14945] RSP: 0018:ffffc9000cb8f348 EFLAGS: 00010293
[ 207.031139][T14945] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffffffff8249a0cd
[ 207.032649][T14945] RDX: ffff88811e883c80 RSI: ffffffff8249a357 RDI: ffff88811e883c80
[ 207.034183][T14945] RBP: ffff888105c0a050 R08: 0000000000000005 R09: 0000000000000000
[ 207.035688][T14945] R10: 00000000ffffffff R11: 0000000000000003 R12: 0000000000000001
[ 207.037203][T14945] R13: 0000000000000200 R14: 0000000000000001 R15: dffffc0000000000
[ 207.038711][T14945] FS: 00007f2783275740(0000) GS:ffff8881f4980000(0000) knlGS:0000000000000000
[ 207.040407][T14945] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 207.041660][T14945] CR2: 00007f2782c00000 CR3: 0000000132356000 CR4: 0000000000750ef0
[ 207.043196][T14945] PKRU: 55555554
[ 207.043880][T14945] Call Trace:
[ 207.044506][T14945] <TASK>
[ 207.045086][T14945] ? __die+0x51/0x92
[ 207.045864][T14945] ? die+0x29/0x50
[ 207.046596][T14945] ? do_trap+0x250/0x320
[ 207.047430][T14945] ? do_error_trap+0xe7/0x220
[ 207.048346][T14945] ? page_table_check_clear.part.0+0x488/0x510
[ 207.049535][T14945] ? handle_invalid_op+0x34/0x40
[ 207.050494][T14945] ? page_table_check_clear.part.0+0x488/0x510
[ 207.051681][T14945] ? exc_invalid_op+0x2e/0x50
[ 207.052589][T14945] ? asm_exc_invalid_op+0x1a/0x20
[ 207.053596][T14945] ? page_table_check_clear.part.0+0x1fd/0x510
[ 207.054790][T14945] ? page_table_check_clear.part.0+0x487/0x510
[ 207.055993][T14945] ? page_table_check_clear.part.0+0x488/0x510
[ 207.057195][T14945] ? page_table_check_clear.part.0+0x487/0x510
[ 207.058384][T14945] __page_table_check_pmd_clear+0x34b/0x5a0
[ 207.059524][T14945] ? __pfx___page_table_check_pmd_clear+0x10/0x10
[ 207.060775][T14945] ? __pfx___mutex_unlock_slowpath+0x10/0x10
[ 207.061940][T14945] ? __pfx___lock_acquire+0x10/0x10
[ 207.062967][T14945] pmdp_huge_clear_flush+0x279/0x360
[ 207.064024][T14945] split_huge_pmd_locked+0x82b/0x3750
...
Before commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic
follow_page_mask code"), we would have ignored the flag; instead, let's
simply refuse the combination completely in check_vma_flags(): the caller
is likely not prepared to handle any hugetlb folios.
We'll teach make_device_exclusive_range() separately to ignore any hugetlb
folios as a future-proof safety net.
Link: https://lkml.kernel.org/r/20250210193801.781278-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250210193801.781278-2-david@redhat.com
Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: John Hubbard <jhubbard(a)nvidia.com>
Reviewed-by: Alistair Popple <apopple(a)nvidia.com>
Tested-by: Alistair Popple <apopple(a)nvidia.com>
Cc: Alex Shi <alexs(a)kernel.org>
Cc: Danilo Krummrich <dakr(a)kernel.org>
Cc: Dave Airlie <airlied(a)gmail.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: Karol Herbst <kherbst(a)redhat.com>
Cc: Liam Howlett <liam.howlett(a)oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Lyude <lyude(a)redhat.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat(a)kernel.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Pasha Tatashin <pasha.tatashin(a)soleen.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: SeongJae Park <sj(a)kernel.org>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Yanteng Si <si.yanteng(a)linux.dev>
Cc: Simona Vetter <simona.vetter(a)ffwll.ch>
Cc: Barry Song <v-songbaohua(a)oppo.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/gup.c | 3 +++
1 file changed, 3 insertions(+)
--- a/mm/gup.c~mm-gup-reject-foll_split_pmd-with-hugetlb-vmas
+++ a/mm/gup.c
@@ -1283,6 +1283,9 @@ static int check_vma_flags(struct vm_are
if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
return -EOPNOTSUPP;
+ if ((gup_flags & FOLL_SPLIT_PMD) && is_vm_hugetlb_page(vma))
+ return -EOPNOTSUPP;
+
if (vma_is_secretmem(vma))
return -EFAULT;
_
Patches currently in -mm which might be from david(a)redhat.com are
mm-factor-out-large-folio-handling-from-folio_order-into-folio_large_order.patch
mm-factor-out-large-folio-handling-from-folio_nr_pages-into-folio_large_nr_pages.patch
mm-let-_folio_nr_pages-overlay-memcg_data-in-first-tail-page.patch
mm-let-_folio_nr_pages-overlay-memcg_data-in-first-tail-page-fix.patch
mm-move-hugetlb-specific-things-in-folio-to-page.patch
mm-move-_pincount-in-folio-to-page-on-32bit.patch
mm-move-_entire_mapcount-in-folio-to-page-on-32bit.patch
mm-rmap-pass-dst_vma-to-folio_dup_file_rmap_pte-and-friends.patch
mm-rmap-pass-vma-to-__folio_add_rmap.patch
mm-rmap-abstract-large-mapcount-operations-for-large-folios-hugetlb.patch
bit_spinlock-__always_inline-unlock-functions.patch
mm-rmap-use-folio_large_nr_pages-in-add-remove-functions.patch
mm-rmap-basic-mm-owner-tracking-for-large-folios-hugetlb.patch
mm-copy-on-write-cow-reuse-support-for-pte-mapped-thp.patch
mm-convert-folio_likely_mapped_shared-to-folio_maybe_mapped_shared.patch
mm-config_no_page_mapcount-to-prepare-for-not-maintain-per-page-mapcounts-in-large-folios.patch
fs-proc-page-remove-per-page-mapcount-dependency-for-proc-kpagecount-config_no_page_mapcount.patch
fs-proc-task_mmu-remove-per-page-mapcount-dependency-for-pm_mmap_exclusive-config_no_page_mapcount.patch
fs-proc-task_mmu-remove-per-page-mapcount-dependency-for-mapmax-config_no_page_mapcount.patch
fs-proc-task_mmu-remove-per-page-mapcount-dependency-for-smaps-smaps_rollup-config_no_page_mapcount.patch
mm-stop-maintaining-the-per-page-mapcount-of-large-folios-config_no_page_mapcount.patch
From: Eric Dumazet <edumazet(a)google.com>
tcp_abort() has the same issue than the one fixed in the prior patch
in tcp_write_err().
commit 5ce4645c23cf5f048eb8e9ce49e514bababdee85 upstream.
To apply commit bac76cf89816bff06c4ec2f3df97dc34e150a1c4,
this patch must be applied first.
In order to get consistent results from tcp_poll(), we must call
sk_error_report() after tcp_done().
We can use tcp_done_with_error() to centralize this logic.
Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
Signed-off-by: Eric Dumazet <edumazet(a)google.com>
Acked-by: Neal Cardwell <ncardwell(a)google.com>
Link: https://lore.kernel.org/r/20240528125253.1966136-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Cc: <stable(a)vger.kernel.org> # v5.10+
[youngmin: Resolved minor conflict in net/ipv4/tcp.c]
Signed-off-by: Youngmin Nam <youngmin.nam(a)samsung.com>
---
net/ipv4/tcp.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7ad82be40f34..9fe164aa185c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4630,13 +4630,9 @@ int tcp_abort(struct sock *sk, int err)
bh_lock_sock(sk);
if (!sock_flag(sk, SOCK_DEAD)) {
- WRITE_ONCE(sk->sk_err, err);
- /* This barrier is coupled with smp_rmb() in tcp_poll() */
- smp_wmb();
- sk_error_report(sk);
if (tcp_need_reset(sk->sk_state))
tcp_send_active_reset(sk, GFP_ATOMIC);
- tcp_done(sk);
+ tcp_done_with_error(sk, err);
}
bh_unlock_sock(sk);
--
2.39.2
This fix is the deadline version of the change made to the rt scheduler
here:
https://lore.kernel.org/lkml/20250225180553.167995-1-harshit@nutanix.com/
Please go through the original change for more details on the issue.
In this fix we bail out or retry in the push_dl_task, if the task is no
longer at the head of pushable tasks list because this list changed
while trying to lock the runqueue of the other CPU.
Signed-off-by: Harshit Agarwal <harshit(a)nutanix.com>
Cc: stable(a)vger.kernel.org
---
kernel/sched/deadline.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 38e4537790af..c5048969c640 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2704,6 +2704,7 @@ static int push_dl_task(struct rq *rq)
{
struct task_struct *next_task;
struct rq *later_rq;
+ struct task_struct *task;
int ret = 0;
next_task = pick_next_pushable_dl_task(rq);
@@ -2734,15 +2735,30 @@ static int push_dl_task(struct rq *rq)
/* Will lock the rq it'll find */
later_rq = find_lock_later_rq(next_task, rq);
- if (!later_rq) {
- struct task_struct *task;
+ task = pick_next_pushable_dl_task(rq);
+ if (later_rq && (!task || task != next_task)) {
+ /*
+ * We must check all this again, since
+ * find_lock_later_rq releases rq->lock and it is
+ * then possible that next_task has migrated and
+ * is no longer at the head of the pushable list.
+ */
+ double_unlock_balance(rq, later_rq);
+ if (!task) {
+ /* No more tasks */
+ goto out;
+ }
+ put_task_struct(next_task);
+ next_task = task;
+ goto retry;
+ }
+ if (!later_rq) {
/*
* We must check all this again, since
* find_lock_later_rq releases rq->lock and it is
* then possible that next_task has migrated.
*/
- task = pick_next_pushable_dl_task(rq);
if (task == next_task) {
/*
* The task is still there. We don't try
@@ -2751,9 +2767,10 @@ static int push_dl_task(struct rq *rq)
goto out;
}
- if (!task)
+ if (!task) {
/* No more tasks */
goto out;
+ }
put_task_struct(next_task);
next_task = task;
--
2.22.3
Upon encountering errors during the HSIC pinctrl handling section the
regulator should be disabled.
Use devm_add_action_or_reset() to let the regulator-disabling routine be
handled by device resource management stack.
Found by Linux Verification Center (linuxtesting.org).
Fixes: 4d6141288c33 ("usb: chipidea: imx: pinctrl for HSIC is optional")
Cc: stable(a)vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin(a)ispras.ru>
---
v2: simplify the patch taking advantage of devm-helper (Frank Li)
It looks like devm_regulator_get_enable_optional() exists for this very
use case utilized in the driver but it's not present in old supported
stable kernels and I may say those dependency-patches wouldn't apply there
cleanly. So fix the problem first, further code simplification is a
subject to cleanup patch.
drivers/usb/chipidea/ci_hdrc_imx.c | 25 +++++++++++++++++--------
1 file changed, 17 insertions(+), 8 deletions(-)
diff --git a/drivers/usb/chipidea/ci_hdrc_imx.c b/drivers/usb/chipidea/ci_hdrc_imx.c
index 619779eef333..d942b3c72640 100644
--- a/drivers/usb/chipidea/ci_hdrc_imx.c
+++ b/drivers/usb/chipidea/ci_hdrc_imx.c
@@ -336,6 +336,13 @@ static int ci_hdrc_imx_notify_event(struct ci_hdrc *ci, unsigned int event)
return ret;
}
+static void ci_hdrc_imx_disable_regulator(void *arg)
+{
+ struct ci_hdrc_imx_data *data = arg;
+
+ regulator_disable(data->hsic_pad_regulator);
+}
+
static int ci_hdrc_imx_probe(struct platform_device *pdev)
{
struct ci_hdrc_imx_data *data;
@@ -394,6 +401,13 @@ static int ci_hdrc_imx_probe(struct platform_device *pdev)
"Failed to enable HSIC pad regulator\n");
goto err_put;
}
+ ret = devm_add_action_or_reset(dev,
+ ci_hdrc_imx_disable_regulator, data);
+ if (ret) {
+ dev_err(dev,
+ "Failed to add regulator devm action\n");
+ goto err_put;
+ }
}
}
@@ -432,11 +446,11 @@ static int ci_hdrc_imx_probe(struct platform_device *pdev)
ret = imx_get_clks(dev);
if (ret)
- goto disable_hsic_regulator;
+ goto qos_remove_request;
ret = imx_prepare_enable_clks(dev);
if (ret)
- goto disable_hsic_regulator;
+ goto qos_remove_request;
ret = clk_prepare_enable(data->clk_wakeup);
if (ret)
@@ -526,10 +540,7 @@ static int ci_hdrc_imx_probe(struct platform_device *pdev)
clk_disable_unprepare(data->clk_wakeup);
err_wakeup_clk:
imx_disable_unprepare_clks(dev);
-disable_hsic_regulator:
- if (data->hsic_pad_regulator)
- /* don't overwrite original ret (cf. EPROBE_DEFER) */
- regulator_disable(data->hsic_pad_regulator);
+qos_remove_request:
if (pdata.flags & CI_HDRC_PMQOS)
cpu_latency_qos_remove_request(&data->pm_qos_req);
data->ci_pdev = NULL;
@@ -557,8 +568,6 @@ static void ci_hdrc_imx_remove(struct platform_device *pdev)
clk_disable_unprepare(data->clk_wakeup);
if (data->plat_data->flags & CI_HDRC_PMQOS)
cpu_latency_qos_remove_request(&data->pm_qos_req);
- if (data->hsic_pad_regulator)
- regulator_disable(data->hsic_pad_regulator);
}
if (data->usbmisc_data)
put_device(data->usbmisc_data->dev);
--
2.48.1
usbmisc is an optional device property so it is totally valid for the
corresponding data->usbmisc_data to have a NULL value.
Check that before dereferencing the pointer.
Found by Linux Verification Center (linuxtesting.org) with Svace static
analysis tool.
Fixes: 74adad500346 ("usb: chipidea: ci_hdrc_imx: decrement device's refcount in .remove() and in the error path of .probe()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin(a)ispras.ru>
---
drivers/usb/chipidea/ci_hdrc_imx.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/chipidea/ci_hdrc_imx.c b/drivers/usb/chipidea/ci_hdrc_imx.c
index 1a7fc638213e..619779eef333 100644
--- a/drivers/usb/chipidea/ci_hdrc_imx.c
+++ b/drivers/usb/chipidea/ci_hdrc_imx.c
@@ -534,7 +534,8 @@ static int ci_hdrc_imx_probe(struct platform_device *pdev)
cpu_latency_qos_remove_request(&data->pm_qos_req);
data->ci_pdev = NULL;
err_put:
- put_device(data->usbmisc_data->dev);
+ if (data->usbmisc_data)
+ put_device(data->usbmisc_data->dev);
return ret;
}
@@ -559,7 +560,8 @@ static void ci_hdrc_imx_remove(struct platform_device *pdev)
if (data->hsic_pad_regulator)
regulator_disable(data->hsic_pad_regulator);
}
- put_device(data->usbmisc_data->dev);
+ if (data->usbmisc_data)
+ put_device(data->usbmisc_data->dev);
}
static void ci_hdrc_imx_shutdown(struct platform_device *pdev)
--
2.48.1
The quilt patch titled
Subject: mm/page_alloc: fix memory accept before watermarks gets initialized
has been removed from the -mm tree. Its filename was
mm-page_alloc-fix-memory-accept-before-watermarks-gets-initialized.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Subject: mm/page_alloc: fix memory accept before watermarks gets initialized
Date: Mon, 10 Mar 2025 10:28:55 +0200
Watermarks are initialized during the postcore initcall. Until then, all
watermarks are set to zero. This causes cond_accept_memory() to
incorrectly skip memory acceptance because a watermark of 0 is always met.
This can lead to a premature OOM on boot.
To ensure progress, accept one MAX_ORDER page if the watermark is zero.
Link: https://lkml.kernel.org/r/20250310082855.2587122-1-kirill.shutemov@linux.in…
Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Tested-by: Farrah Chen <farrah.chen(a)intel.com>
Reported-by: Farrah Chen <farrah.chen(a)intel.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Pankaj Gupta <pankaj.gupta(a)amd.com>
Cc: Ashish Kalra <ashish.kalra(a)amd.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe(a)intel.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: "Mike Rapoport (IBM)" <rppt(a)kernel.org>
Cc: Thomas Lendacky <thomas.lendacky(a)amd.com>
Cc: <stable(a)vger.kernel.org> [6.5+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
--- a/mm/page_alloc.c~mm-page_alloc-fix-memory-accept-before-watermarks-gets-initialized
+++ a/mm/page_alloc.c
@@ -7004,7 +7004,7 @@ static inline bool has_unaccepted_memory
static bool cond_accept_memory(struct zone *zone, unsigned int order)
{
- long to_accept;
+ long to_accept, wmark;
bool ret = false;
if (!has_unaccepted_memory())
@@ -7013,8 +7013,18 @@ static bool cond_accept_memory(struct zo
if (list_empty(&zone->unaccepted_pages))
return false;
+ wmark = promo_wmark_pages(zone);
+
+ /*
+ * Watermarks have not been initialized yet.
+ *
+ * Accepting one MAX_ORDER page to ensure progress.
+ */
+ if (!wmark)
+ return try_to_accept_memory_one(zone);
+
/* How much to accept to get to promo watermark? */
- to_accept = promo_wmark_pages(zone) -
+ to_accept = wmark -
(zone_page_state(zone, NR_FREE_PAGES) -
__zone_watermark_unusable_free(zone, order, 0) -
zone_page_state(zone, NR_UNACCEPTED));
_
Patches currently in -mm which might be from kirill.shutemov(a)linux.intel.com are
The quilt patch titled
Subject: memcg: drain obj stock on cpu hotplug teardown
has been removed from the -mm tree. Its filename was
memcg-drain-obj-stock-on-cpu-hotplug-teardown.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Shakeel Butt <shakeel.butt(a)linux.dev>
Subject: memcg: drain obj stock on cpu hotplug teardown
Date: Mon, 10 Mar 2025 16:09:34 -0700
Currently on cpu hotplug teardown, only memcg stock is drained but we
need to drain the obj stock as well otherwise we will miss the stats
accumulated on the target cpu as well as the nr_bytes cached. The stats
include MEMCG_KMEM, NR_SLAB_RECLAIMABLE_B & NR_SLAB_UNRECLAIMABLE_B. In
addition we are leaking reference to struct obj_cgroup object.
Link: https://lkml.kernel.org/r/20250310230934.2913113-1-shakeel.butt@linux.dev
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Reviewed-by: Roman Gushchin <roman.gushchin(a)linux.dev>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 9 +++++++++
1 file changed, 9 insertions(+)
--- a/mm/memcontrol.c~memcg-drain-obj-stock-on-cpu-hotplug-teardown
+++ a/mm/memcontrol.c
@@ -1921,9 +1921,18 @@ void drain_all_stock(struct mem_cgroup *
static int memcg_hotplug_cpu_dead(unsigned int cpu)
{
struct memcg_stock_pcp *stock;
+ struct obj_cgroup *old;
+ unsigned long flags;
stock = &per_cpu(memcg_stock, cpu);
+
+ /* drain_obj_stock requires stock_lock */
+ local_lock_irqsave(&memcg_stock.stock_lock, flags);
+ old = drain_obj_stock(stock);
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+
drain_stock(stock);
+ obj_cgroup_put(old);
return 0;
}
_
Patches currently in -mm which might be from shakeel.butt(a)linux.dev are
memcg-add-hierarchical-effective-limits-for-v2.patch
memcg-dont-call-propagate_protected_usage-for-v1.patch
page_counter-track-failcnt-only-for-legacy-cgroups.patch
page_counter-reduce-struct-page_counter-size.patch
memcg-bypass-root-memcg-check-for-skmem-charging.patch
memcg-avoid-refill_stock-for-root-memcg.patch
memcg-move-do_memsw_account-to-config_memcg_v1.patch
The quilt patch titled
Subject: mm/huge_memory: drop beyond-EOF folios with the right number of refs
has been removed from the -mm tree. Its filename was
mm-huge_memory-drop-beyond-eof-folios-with-the-right-number-of-refs.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Zi Yan <ziy(a)nvidia.com>
Subject: mm/huge_memory: drop beyond-EOF folios with the right number of refs
Date: Mon, 10 Mar 2025 11:57:27 -0400
When an after-split folio is large and needs to be dropped due to EOF,
folio_put_refs(folio, folio_nr_pages(folio)) should be used to drop all
page cache refs. Otherwise, the folio will not be freed, causing memory
leak.
This leak would happen on a filesystem with blocksize > page_size and a
truncate is performed, where the blocksize makes folios split to >0 order
ones, causing truncated folios not being freed.
Link: https://lkml.kernel.org/r/20250310155727.472846-1-ziy@nvidia.com
Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages")
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
Reported-by: Hugh Dickins <hughd(a)google.com>
Closes: https://lore.kernel.org/all/fcbadb7f-dd3e-21df-f9a7-2853b53183c4@google.com/
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov(a)linux.intel.com>
Cc: Luis Chamberalin <mcgrof(a)kernel.org>
Cc: Matthew Wilcow (Oracle) <willy(a)infradead.org>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Pankaj Raghav <p.raghav(a)samsung.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Yang Shi <yang(a)os.amperecomputing.com>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/huge_memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/huge_memory.c~mm-huge_memory-drop-beyond-eof-folios-with-the-right-number-of-refs
+++ a/mm/huge_memory.c
@@ -3304,7 +3304,7 @@ static void __split_huge_page(struct pag
folio_account_cleaned(tail,
inode_to_wb(folio->mapping->host));
__filemap_remove_folio(tail, NULL);
- folio_put(tail);
+ folio_put_refs(tail, folio_nr_pages(tail));
} else if (!folio_test_anon(folio)) {
__xa_store(&folio->mapping->i_pages, tail->index,
tail, 0);
_
Patches currently in -mm which might be from ziy(a)nvidia.com are
selftests-mm-make-file-backed-thp-split-work-by-writing-pmd-size-data.patch
mm-huge_memory-allow-split-shmem-large-folio-to-any-lower-order.patch
selftests-mm-test-splitting-file-backed-thp-to-any-lower-order.patch
xarray-add-xas_try_split-to-split-a-multi-index-entry.patch
mm-huge_memory-add-two-new-not-yet-used-functions-for-folio_split.patch
mm-huge_memory-add-two-new-not-yet-used-functions-for-folio_split-fix.patch
mm-huge_memory-add-two-new-not-yet-used-functions-for-folio_split-fix-2.patch
mm-huge_memory-move-folio-split-common-code-to-__folio_split.patch
mm-huge_memory-add-buddy-allocator-like-non-uniform-folio_split.patch
mm-huge_memory-remove-the-old-unused-__split_huge_page.patch
mm-huge_memory-add-folio_split-to-debugfs-testing-interface.patch
mm-truncate-use-folio_split-in-truncate-operation.patch
selftests-mm-add-tests-for-folio_split-buddy-allocator-like-split.patch
mm-filemap-use-xas_try_split-in-__filemap_add_folio.patch
mm-shmem-use-xas_try_split-in-shmem_split_large_entry.patch
The quilt patch titled
Subject: selftests/mm: run_vmtests.sh: fix half_ufd_size_MB calculation
has been removed from the -mm tree. Its filename was
selftests-mm-run_vmtestssh-fix-half_ufd_size_mb-calculation.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Rafael Aquini <raquini(a)redhat.com>
Subject: selftests/mm: run_vmtests.sh: fix half_ufd_size_MB calculation
Date: Tue, 18 Feb 2025 14:22:51 -0500
We noticed that uffd-stress test was always failing to run when invoked
for the hugetlb profiles on x86_64 systems with a processor count of 64 or
bigger:
...
# ------------------------------------
# running ./uffd-stress hugetlb 128 32
# ------------------------------------
# ERROR: invalid MiB (errno=9, @uffd-stress.c:459)
...
# [FAIL]
not ok 3 uffd-stress hugetlb 128 32 # exit=1
...
The problem boils down to how run_vmtests.sh (mis)calculates the size of
the region it feeds to uffd-stress. The latter expects to see an amount
of MiB while the former is just giving out the number of free hugepages
halved down. This measurement discrepancy ends up violating uffd-stress'
assertion on number of hugetlb pages allocated per CPU, causing it to bail
out with the error above.
This commit fixes that issue by adjusting run_vmtests.sh's
half_ufd_size_MB calculation so it properly renders the region size in
MiB, as expected, while maintaining all of its original constraints in
place.
Link: https://lkml.kernel.org/r/20250218192251.53243-1-aquini@redhat.com
Fixes: 2e47a445d7b3 ("selftests/mm: run_vmtests.sh: fix hugetlb mem size calculation")
Signed-off-by: Rafael Aquini <raquini(a)redhat.com>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Peter Xu <peterx(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
tools/testing/selftests/mm/run_vmtests.sh | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/tools/testing/selftests/mm/run_vmtests.sh~selftests-mm-run_vmtestssh-fix-half_ufd_size_mb-calculation
+++ a/tools/testing/selftests/mm/run_vmtests.sh
@@ -304,7 +304,9 @@ uffd_stress_bin=./uffd-stress
CATEGORY="userfaultfd" run_test ${uffd_stress_bin} anon 20 16
# Hugetlb tests require source and destination huge pages. Pass in half
# the size of the free pages we have, which is used for *each*.
-half_ufd_size_MB=$((freepgs / 2))
+# uffd-stress expects a region expressed in MiB, so we adjust
+# half_ufd_size_MB accordingly.
+half_ufd_size_MB=$(((freepgs * hpgsize_KB) / 1024 / 2))
CATEGORY="userfaultfd" run_test ${uffd_stress_bin} hugetlb "$half_ufd_size_MB" 32
CATEGORY="userfaultfd" run_test ${uffd_stress_bin} hugetlb-private "$half_ufd_size_MB" 32
CATEGORY="userfaultfd" run_test ${uffd_stress_bin} shmem 20 16
_
Patches currently in -mm which might be from raquini(a)redhat.com are
The quilt patch titled
Subject: mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT
has been removed from the -mm tree. Its filename was
mm-fix-error-handling-in-__filemap_get_folio-with-fgp_nowait.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: "Raphael S. Carvalho" <raphaelsc(a)scylladb.com>
Subject: mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT
Date: Mon, 24 Feb 2025 11:37:00 -0300
original report:
https://lore.kernel.org/all/CAKhLTr1UL3ePTpYjXOx2AJfNk8Ku2EdcEfu+CH1sf3Asr=…
When doing buffered writes with FGP_NOWAIT, under memory pressure, the
system returned ENOMEM despite there being plenty of available memory, to
be reclaimed from page cache. The user space used io_uring interface,
which in turn submits I/O with FGP_NOWAIT (the fast path).
retsnoop pointed to iomap_get_folio:
00:34:16.180612 -> 00:34:16.180651 TID/PID 253786/253721
(reactor-1/combined_tests):
entry_SYSCALL_64_after_hwframe+0x76
do_syscall_64+0x82
__do_sys_io_uring_enter+0x265
io_submit_sqes+0x209
io_issue_sqe+0x5b
io_write+0xdd
xfs_file_buffered_write+0x84
iomap_file_buffered_write+0x1a6
32us [-ENOMEM] iomap_write_begin+0x408
iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x���
pos=0 len=4096 foliop=0xffffb32c296b7b80
! 4us [-ENOMEM] iomap_get_folio
iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x���
pos=0 len=4096
This is likely a regression caused by 66dabbb65d67 ("mm: return an ERR_PTR
from __filemap_get_folio"), which moved error handling from
io_map_get_folio() to __filemap_get_folio(), but broke FGP_NOWAIT
handling, so ENOMEM is being escaped to user space. Had it correctly
returned -EAGAIN with NOWAIT, either io_uring or user space itself would
be able to retry the request.
It's not enough to patch io_uring since the iomap interface is the one
responsible for it, and pwritev2(RWF_NOWAIT) and AIO interfaces must
return the proper error too.
The patch was tested with scylladb test suite (its original reproducer),
and the tests all pass now when memory is pressured.
Link: https://lkml.kernel.org/r/20250224143700.23035-1-raphaelsc@scylladb.com
Fixes: 66dabbb65d67 ("mm: return an ERR_PTR from __filemap_get_folio")
Signed-off-by: Raphael S. Carvalho <raphaelsc(a)scylladb.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Dave Chinner <dchinner(a)redhat.com>
Cc: "Darrick J. Wong" <djwong(a)kernel.org>
Cc: Matthew Wilcow (Oracle) <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/filemap.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
--- a/mm/filemap.c~mm-fix-error-handling-in-__filemap_get_folio-with-fgp_nowait
+++ a/mm/filemap.c
@@ -1986,8 +1986,19 @@ no_page:
if (err == -EEXIST)
goto repeat;
- if (err)
+ if (err) {
+ /*
+ * When NOWAIT I/O fails to allocate folios this could
+ * be due to a nonblocking memory allocation and not
+ * because the system actually is out of memory.
+ * Return -EAGAIN so that there caller retries in a
+ * blocking fashion instead of propagating -ENOMEM
+ * to the application.
+ */
+ if ((fgp_flags & FGP_NOWAIT) && err == -ENOMEM)
+ err = -EAGAIN;
return ERR_PTR(err);
+ }
/*
* filemap_add_folio locks the page, and for mmap
* we expect an unlocked page.
_
Patches currently in -mm which might be from raphaelsc(a)scylladb.com are
The quilt patch titled
Subject: mm/migrate: fix shmem xarray update during migration
has been removed from the -mm tree. Its filename was
mm-migrate-fix-shmem-xarray-update-during-migration.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Zi Yan <ziy(a)nvidia.com>
Subject: mm/migrate: fix shmem xarray update during migration
Date: Wed, 5 Mar 2025 15:04:03 -0500
A shmem folio can be either in page cache or in swap cache, but not at the
same time. Namely, once it is in swap cache, folio->mapping should be
NULL, and the folio is no longer in a shmem mapping.
In __folio_migrate_mapping(), to determine the number of xarray entries to
update, folio_test_swapbacked() is used, but that conflates shmem in page
cache case and shmem in swap cache case. It leads to xarray multi-index
entry corruption, since it turns a sibling entry to a normal entry during
xas_store() (see [1] for a userspace reproduction). Fix it by only using
folio_test_swapcache() to determine whether xarray is storing swap cache
entries or not to choose the right number of xarray entries to update.
[1] https://lore.kernel.org/linux-mm/Z8idPCkaJW1IChjT@casper.infradead.org/
Note:
In __split_huge_page(), folio_test_anon() && folio_test_swapcache() is
used to get swap_cache address space, but that ignores the shmem folio in
swap cache case. It could lead to NULL pointer dereferencing when a
in-swap-cache shmem folio is split at __xa_store(), since
!folio_test_anon() is true and folio->mapping is NULL. But fortunately,
its caller split_huge_page_to_list_to_order() bails out early with EBUSY
when folio->mapping is NULL. So no need to take care of it here.
Link: https://lkml.kernel.org/r/20250305200403.2822855-1-ziy@nvidia.com
Fixes: fc346d0a70a1 ("mm: migrate high-order folios in swap cache correctly")
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
Reported-by: Liu Shixin <liushixin2(a)huawei.com>
Closes: https://lore.kernel.org/all/28546fb4-5210-bf75-16d6-43e1f8646080@huawei.com/
Suggested-by: Hugh Dickins <hughd(a)google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Reviewed-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Barry Song <baohua(a)kernel.org>
Cc: Charan Teja Kalla <quic_charante(a)quicinc.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Lance Yang <ioworker0(a)gmail.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/migrate.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
--- a/mm/migrate.c~mm-migrate-fix-shmem-xarray-update-during-migration
+++ a/mm/migrate.c
@@ -518,15 +518,13 @@ static int __folio_migrate_mapping(struc
if (folio_test_anon(folio) && folio_test_large(folio))
mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON, 1);
folio_ref_add(newfolio, nr); /* add cache reference */
- if (folio_test_swapbacked(folio)) {
+ if (folio_test_swapbacked(folio))
__folio_set_swapbacked(newfolio);
- if (folio_test_swapcache(folio)) {
- folio_set_swapcache(newfolio);
- newfolio->private = folio_get_private(folio);
- }
+ if (folio_test_swapcache(folio)) {
+ folio_set_swapcache(newfolio);
+ newfolio->private = folio_get_private(folio);
entries = nr;
} else {
- VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
entries = 1;
}
_
Patches currently in -mm which might be from ziy(a)nvidia.com are
selftests-mm-make-file-backed-thp-split-work-by-writing-pmd-size-data.patch
mm-huge_memory-allow-split-shmem-large-folio-to-any-lower-order.patch
selftests-mm-test-splitting-file-backed-thp-to-any-lower-order.patch
xarray-add-xas_try_split-to-split-a-multi-index-entry.patch
mm-huge_memory-add-two-new-not-yet-used-functions-for-folio_split.patch
mm-huge_memory-add-two-new-not-yet-used-functions-for-folio_split-fix.patch
mm-huge_memory-add-two-new-not-yet-used-functions-for-folio_split-fix-2.patch
mm-huge_memory-move-folio-split-common-code-to-__folio_split.patch
mm-huge_memory-add-buddy-allocator-like-non-uniform-folio_split.patch
mm-huge_memory-remove-the-old-unused-__split_huge_page.patch
mm-huge_memory-add-folio_split-to-debugfs-testing-interface.patch
mm-truncate-use-folio_split-in-truncate-operation.patch
selftests-mm-add-tests-for-folio_split-buddy-allocator-like-split.patch
mm-filemap-use-xas_try_split-in-__filemap_add_folio.patch
mm-shmem-use-xas_try_split-in-shmem_split_large_entry.patch
fpga_mgr_test_img_load_sgt() allocates memory for sgt using
kunit_kzalloc() however it does not check if the allocation failed.
It then passes sgt to sg_alloc_table(), which passes it to
__sg_alloc_table(). This function calls memset() on sgt in an attempt to
zero it out. If the allocation fails then sgt will be NULL and the
memset will trigger a NULL pointer dereference.
Fix this by checking the allocation with KUNIT_ASSERT_NOT_ERR_OR_NULL().
Fixes: ccbc1c302115 ("fpga: add an initial KUnit suite for the FPGA Manager")
Cc: stable(a)vger.kernel.org
Signed-off-by: Qasim Ijaz <qasdev00(a)gmail.com>
---
drivers/fpga/tests/fpga-mgr-test.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/fpga/tests/fpga-mgr-test.c b/drivers/fpga/tests/fpga-mgr-test.c
index 9cb37aefbac4..1902ebf5a298 100644
--- a/drivers/fpga/tests/fpga-mgr-test.c
+++ b/drivers/fpga/tests/fpga-mgr-test.c
@@ -263,6 +263,7 @@ static void fpga_mgr_test_img_load_sgt(struct kunit *test)
img_buf = init_test_buffer(test, IMAGE_SIZE);
sgt = kunit_kzalloc(test, sizeof(*sgt), GFP_KERNEL);
+ KUNIT_ASSERT_NOT_ERR_OR_NULL(test, sgt);
ret = sg_alloc_table(sgt, 1, GFP_KERNEL);
KUNIT_ASSERT_EQ(test, ret, 0);
sg_init_one(sgt->sgl, img_buf, IMAGE_SIZE);
--
2.39.5
Currently, RCU boost testing in rcutorture is broken because it relies on
having RT throttling disabled. This means the test will always pass (or
rarely fail). This occurs because recently, RT throttling was replaced
by DL server which boosts CFS tasks even when rcutorture tried to
disable throttling (see rcu_torture_disable_rt_throttle()). However, the
systctl_sched_rt_runtime variable is not considered thus still allowing
RT tasks to be preempted by CFS tasks.
Therefore this patch prevents DL server from starting when RCU torture
sets the sysctl_sched_rt_runtime to -1.
With this patch, boosting in TREE09 fails reliably if RCU_BOOST=n.
Steven also mentioned that this could fix RT usecases where users do not
want DL server to be interfering.
Cc: stable(a)vger.kernel.org
Cc: Paul E. McKenney <paulmck(a)kernel.org>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Fixes: cea5a3472ac4 ("sched/fair: Cleanup fair_server")
Signed-off-by: Joel Fernandes <joelagnelf(a)nvidia.com>
---
v1->v2:
Updated Fixes tag (Steven)
Moved the stoppage of DL server to fair (Juri)
kernel/sched/fair.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c0ef435a7aa..d7ba333393f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1242,7 +1242,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
* against fair_server such that it can account for this time
* and possibly avoid running this period.
*/
- if (dl_server_active(&rq->fair_server))
+ if (dl_server_active(&rq->fair_server) && rt_bandwidth_enabled())
dl_server_update(&rq->fair_server, delta_exec);
}
@@ -5957,7 +5957,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
sub_nr_running(rq, queued_delta);
/* Stop the fair server if throttling resulted in no runnable tasks */
- if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
+ if (rq_h_nr_queued && !rq->cfs.h_nr_queued && dl_server_active(&rq->fair_server))
dl_server_stop(&rq->fair_server);
done:
/*
@@ -6056,7 +6056,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
}
/* Start the fair server if un-throttling resulted in new runnable tasks */
- if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
+ if (!rq_h_nr_queued && rq->cfs.h_nr_queued && rt_bandwidth_enabled())
dl_server_start(&rq->fair_server);
/* At this point se is NULL and we are at root level*/
@@ -7005,9 +7005,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!rq_h_nr_queued && rq->cfs.h_nr_queued) {
/* Account for idle runtime */
- if (!rq->nr_running)
+ if (!rq->nr_running && rt_bandwidth_enabled())
dl_server_update_idle_time(rq, rq->curr);
- dl_server_start(&rq->fair_server);
+
+ if (rt_bandwidth_enabled())
+ dl_server_start(&rq->fair_server);
}
/* At this point se is NULL and we are at root level*/
@@ -7134,7 +7136,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
sub_nr_running(rq, h_nr_queued);
- if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
+ if (rq_h_nr_queued && !rq->cfs.h_nr_queued && dl_server_active(&rq->fair_server))
dl_server_stop(&rq->fair_server);
/* balance early to pull high priority tasks */
--
2.43.0
Hi,
As an exhibiting company at The Intec International Trade Fair for Machine Tools, Manufacturing and Automation (INTEC 2025) we have access to the fully updated attendee list, including last-minute registrants who attended the show.
This exclusive list can help you:
✅ Connect with high-intent leads
✅ Follow up with attendees who visited (or missed) your booth
✅ Maximize your ROI from the event
If you're Interested, I’d love to share more details—and I can also provide pricing information.
Looking forward to your thoughts.
Regards,
Jennifer Martin
Sr. Marketing Manager
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x d4234d131b0a3f9e65973f1cdc71bb3560f5d14b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025031635-spider-hurricane-4475@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d4234d131b0a3f9e65973f1cdc71bb3560f5d14b Mon Sep 17 00:00:00 2001
From: Zhenhua Huang <quic_zhenhuah(a)quicinc.com>
Date: Tue, 4 Mar 2025 15:27:00 +0800
Subject: [PATCH] arm64: mm: Populate vmemmap at the page level if not section
aligned
On the arm64 platform with 4K base page config, SECTION_SIZE_BITS is set
to 27, making one section 128M. The related page struct which vmemmap
points to is 2M then.
Commit c1cc1552616d ("arm64: MMU initialisation") optimizes the
vmemmap to populate at the PMD section level which was suitable
initially since hot plug granule is always one section(128M). However,
commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
introduced a 2M(SUBSECTION_SIZE) hot plug granule, which disrupted the
existing arm64 assumptions.
The first problem is that if start or end is not aligned to a section
boundary, such as when a subsection is hot added, populating the entire
section is wasteful.
The next problem is if we hotplug something that spans part of 128 MiB
section (subsections, let's call it memblock1), and then hotplug something
that spans another part of a 128 MiB section(subsections, let's call it
memblock2), and subsequently unplug memblock1, vmemmap_free() will clear
the entire PMD entry which also supports memblock2 even though memblock2
is still active.
Assuming hotplug/unplug sizes are guaranteed to be symmetric. Do the
fix similar to x86-64: populate to pages levels if start/end is not aligned
with section boundary.
Cc: stable(a)vger.kernel.org # v5.4+
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Acked-by: David Hildenbrand <david(a)redhat.com>
Signed-off-by: Zhenhua Huang <quic_zhenhuah(a)quicinc.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Reviewed-by: Catalin Marinas <catalin.marinas(a)arm.com>
Link: https://lore.kernel.org/r/20250304072700.3405036-1-quic_zhenhuah@quicinc.com
Signed-off-by: Will Deacon <will(a)kernel.org>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b4df5bc5b1b8..1dfe1a8efdbe 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1177,8 +1177,11 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
struct vmem_altmap *altmap)
{
WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
+ /* [start, end] should be within one section */
+ WARN_ON_ONCE(end - start > PAGES_PER_SECTION * sizeof(struct page));
- if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
+ if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
+ (end - start < PAGES_PER_SECTION * sizeof(struct page)))
return vmemmap_populate_basepages(start, end, node, altmap);
else
return vmemmap_populate_hugepages(start, end, node, altmap);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x d4234d131b0a3f9e65973f1cdc71bb3560f5d14b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025031628-percolate-sulphate-3806@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d4234d131b0a3f9e65973f1cdc71bb3560f5d14b Mon Sep 17 00:00:00 2001
From: Zhenhua Huang <quic_zhenhuah(a)quicinc.com>
Date: Tue, 4 Mar 2025 15:27:00 +0800
Subject: [PATCH] arm64: mm: Populate vmemmap at the page level if not section
aligned
On the arm64 platform with 4K base page config, SECTION_SIZE_BITS is set
to 27, making one section 128M. The related page struct which vmemmap
points to is 2M then.
Commit c1cc1552616d ("arm64: MMU initialisation") optimizes the
vmemmap to populate at the PMD section level which was suitable
initially since hot plug granule is always one section(128M). However,
commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
introduced a 2M(SUBSECTION_SIZE) hot plug granule, which disrupted the
existing arm64 assumptions.
The first problem is that if start or end is not aligned to a section
boundary, such as when a subsection is hot added, populating the entire
section is wasteful.
The next problem is if we hotplug something that spans part of 128 MiB
section (subsections, let's call it memblock1), and then hotplug something
that spans another part of a 128 MiB section(subsections, let's call it
memblock2), and subsequently unplug memblock1, vmemmap_free() will clear
the entire PMD entry which also supports memblock2 even though memblock2
is still active.
Assuming hotplug/unplug sizes are guaranteed to be symmetric. Do the
fix similar to x86-64: populate to pages levels if start/end is not aligned
with section boundary.
Cc: stable(a)vger.kernel.org # v5.4+
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Acked-by: David Hildenbrand <david(a)redhat.com>
Signed-off-by: Zhenhua Huang <quic_zhenhuah(a)quicinc.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Reviewed-by: Catalin Marinas <catalin.marinas(a)arm.com>
Link: https://lore.kernel.org/r/20250304072700.3405036-1-quic_zhenhuah@quicinc.com
Signed-off-by: Will Deacon <will(a)kernel.org>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b4df5bc5b1b8..1dfe1a8efdbe 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1177,8 +1177,11 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
struct vmem_altmap *altmap)
{
WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
+ /* [start, end] should be within one section */
+ WARN_ON_ONCE(end - start > PAGES_PER_SECTION * sizeof(struct page));
- if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
+ if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
+ (end - start < PAGES_PER_SECTION * sizeof(struct page)))
return vmemmap_populate_basepages(start, end, node, altmap);
else
return vmemmap_populate_hugepages(start, end, node, altmap);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x d4234d131b0a3f9e65973f1cdc71bb3560f5d14b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025031627-capture-pouring-5da4@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d4234d131b0a3f9e65973f1cdc71bb3560f5d14b Mon Sep 17 00:00:00 2001
From: Zhenhua Huang <quic_zhenhuah(a)quicinc.com>
Date: Tue, 4 Mar 2025 15:27:00 +0800
Subject: [PATCH] arm64: mm: Populate vmemmap at the page level if not section
aligned
On the arm64 platform with 4K base page config, SECTION_SIZE_BITS is set
to 27, making one section 128M. The related page struct which vmemmap
points to is 2M then.
Commit c1cc1552616d ("arm64: MMU initialisation") optimizes the
vmemmap to populate at the PMD section level which was suitable
initially since hot plug granule is always one section(128M). However,
commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
introduced a 2M(SUBSECTION_SIZE) hot plug granule, which disrupted the
existing arm64 assumptions.
The first problem is that if start or end is not aligned to a section
boundary, such as when a subsection is hot added, populating the entire
section is wasteful.
The next problem is if we hotplug something that spans part of 128 MiB
section (subsections, let's call it memblock1), and then hotplug something
that spans another part of a 128 MiB section(subsections, let's call it
memblock2), and subsequently unplug memblock1, vmemmap_free() will clear
the entire PMD entry which also supports memblock2 even though memblock2
is still active.
Assuming hotplug/unplug sizes are guaranteed to be symmetric. Do the
fix similar to x86-64: populate to pages levels if start/end is not aligned
with section boundary.
Cc: stable(a)vger.kernel.org # v5.4+
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Acked-by: David Hildenbrand <david(a)redhat.com>
Signed-off-by: Zhenhua Huang <quic_zhenhuah(a)quicinc.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Reviewed-by: Catalin Marinas <catalin.marinas(a)arm.com>
Link: https://lore.kernel.org/r/20250304072700.3405036-1-quic_zhenhuah@quicinc.com
Signed-off-by: Will Deacon <will(a)kernel.org>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b4df5bc5b1b8..1dfe1a8efdbe 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1177,8 +1177,11 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
struct vmem_altmap *altmap)
{
WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
+ /* [start, end] should be within one section */
+ WARN_ON_ONCE(end - start > PAGES_PER_SECTION * sizeof(struct page));
- if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
+ if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
+ (end - start < PAGES_PER_SECTION * sizeof(struct page)))
return vmemmap_populate_basepages(start, end, node, altmap);
else
return vmemmap_populate_hugepages(start, end, node, altmap);
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x d4234d131b0a3f9e65973f1cdc71bb3560f5d14b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025031627-theorize-manned-7263@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d4234d131b0a3f9e65973f1cdc71bb3560f5d14b Mon Sep 17 00:00:00 2001
From: Zhenhua Huang <quic_zhenhuah(a)quicinc.com>
Date: Tue, 4 Mar 2025 15:27:00 +0800
Subject: [PATCH] arm64: mm: Populate vmemmap at the page level if not section
aligned
On the arm64 platform with 4K base page config, SECTION_SIZE_BITS is set
to 27, making one section 128M. The related page struct which vmemmap
points to is 2M then.
Commit c1cc1552616d ("arm64: MMU initialisation") optimizes the
vmemmap to populate at the PMD section level which was suitable
initially since hot plug granule is always one section(128M). However,
commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
introduced a 2M(SUBSECTION_SIZE) hot plug granule, which disrupted the
existing arm64 assumptions.
The first problem is that if start or end is not aligned to a section
boundary, such as when a subsection is hot added, populating the entire
section is wasteful.
The next problem is if we hotplug something that spans part of 128 MiB
section (subsections, let's call it memblock1), and then hotplug something
that spans another part of a 128 MiB section(subsections, let's call it
memblock2), and subsequently unplug memblock1, vmemmap_free() will clear
the entire PMD entry which also supports memblock2 even though memblock2
is still active.
Assuming hotplug/unplug sizes are guaranteed to be symmetric. Do the
fix similar to x86-64: populate to pages levels if start/end is not aligned
with section boundary.
Cc: stable(a)vger.kernel.org # v5.4+
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Acked-by: David Hildenbrand <david(a)redhat.com>
Signed-off-by: Zhenhua Huang <quic_zhenhuah(a)quicinc.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Reviewed-by: Catalin Marinas <catalin.marinas(a)arm.com>
Link: https://lore.kernel.org/r/20250304072700.3405036-1-quic_zhenhuah@quicinc.com
Signed-off-by: Will Deacon <will(a)kernel.org>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b4df5bc5b1b8..1dfe1a8efdbe 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1177,8 +1177,11 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
struct vmem_altmap *altmap)
{
WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
+ /* [start, end] should be within one section */
+ WARN_ON_ONCE(end - start > PAGES_PER_SECTION * sizeof(struct page));
- if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
+ if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
+ (end - start < PAGES_PER_SECTION * sizeof(struct page)))
return vmemmap_populate_basepages(start, end, node, altmap);
else
return vmemmap_populate_hugepages(start, end, node, altmap);
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 91cf42c63f2d8a9c1bcdfe923218e079b32e1a69
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025031631-sensuous-cataract-f3f3@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 91cf42c63f2d8a9c1bcdfe923218e079b32e1a69 Mon Sep 17 00:00:00 2001
From: Conor Dooley <conor.dooley(a)microchip.com>
Date: Mon, 3 Mar 2025 10:47:40 +0000
Subject: [PATCH] spi: microchip-core: prevent RX overflows when transmit size
> FIFO size
When the size of a transfer exceeds the size of the FIFO (32 bytes), RX
overflows will be generated and receive data will be corrupted and
warnings will be produced. For example, here's an error generated by a
transfer of 36 bytes:
spi_master spi0: mchp_corespi_interrupt: RX OVERFLOW: rxlen: 4, txlen: 0
The driver is currently split between handling receiving in the
interrupt handler, and sending outside of it. Move all handling out of
the interrupt handling, and explicitly link the number of bytes read of
of the RX FIFO to the number written into the TX one. This both resolves
the overflow problems as well as simplifying the flow of the driver.
CC: stable(a)vger.kernel.org
Fixes: 9ac8d17694b6 ("spi: add support for microchip fpga spi controllers")
Signed-off-by: Conor Dooley <conor.dooley(a)microchip.com>
Link: https://patch.msgid.link/20250303-veal-snooper-712c1dfad336@wendy
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-microchip-core.c b/drivers/spi/spi-microchip-core.c
index 5b6af55855ef..62ba0bd9cbb7 100644
--- a/drivers/spi/spi-microchip-core.c
+++ b/drivers/spi/spi-microchip-core.c
@@ -70,8 +70,7 @@
#define INT_RX_CHANNEL_OVERFLOW BIT(2)
#define INT_TX_CHANNEL_UNDERRUN BIT(3)
-#define INT_ENABLE_MASK (CONTROL_RX_DATA_INT | CONTROL_TX_DATA_INT | \
- CONTROL_RX_OVER_INT | CONTROL_TX_UNDER_INT)
+#define INT_ENABLE_MASK (CONTROL_RX_OVER_INT | CONTROL_TX_UNDER_INT)
#define REG_CONTROL (0x00)
#define REG_FRAME_SIZE (0x04)
@@ -133,10 +132,15 @@ static inline void mchp_corespi_disable(struct mchp_corespi *spi)
mchp_corespi_write(spi, REG_CONTROL, control);
}
-static inline void mchp_corespi_read_fifo(struct mchp_corespi *spi)
+static inline void mchp_corespi_read_fifo(struct mchp_corespi *spi, int fifo_max)
{
- while (spi->rx_len >= spi->n_bytes && !(mchp_corespi_read(spi, REG_STATUS) & STATUS_RXFIFO_EMPTY)) {
- u32 data = mchp_corespi_read(spi, REG_RX_DATA);
+ for (int i = 0; i < fifo_max; i++) {
+ u32 data;
+
+ while (mchp_corespi_read(spi, REG_STATUS) & STATUS_RXFIFO_EMPTY)
+ ;
+
+ data = mchp_corespi_read(spi, REG_RX_DATA);
spi->rx_len -= spi->n_bytes;
@@ -211,11 +215,10 @@ static inline void mchp_corespi_set_xfer_size(struct mchp_corespi *spi, int len)
mchp_corespi_write(spi, REG_FRAMESUP, len);
}
-static inline void mchp_corespi_write_fifo(struct mchp_corespi *spi)
+static inline void mchp_corespi_write_fifo(struct mchp_corespi *spi, int fifo_max)
{
- int fifo_max, i = 0;
+ int i = 0;
- fifo_max = DIV_ROUND_UP(min(spi->tx_len, FIFO_DEPTH), spi->n_bytes);
mchp_corespi_set_xfer_size(spi, fifo_max);
while ((i < fifo_max) && !(mchp_corespi_read(spi, REG_STATUS) & STATUS_TXFIFO_FULL)) {
@@ -413,19 +416,6 @@ static irqreturn_t mchp_corespi_interrupt(int irq, void *dev_id)
if (intfield == 0)
return IRQ_NONE;
- if (intfield & INT_TXDONE)
- mchp_corespi_write(spi, REG_INT_CLEAR, INT_TXDONE);
-
- if (intfield & INT_RXRDY) {
- mchp_corespi_write(spi, REG_INT_CLEAR, INT_RXRDY);
-
- if (spi->rx_len)
- mchp_corespi_read_fifo(spi);
- }
-
- if (!spi->rx_len && !spi->tx_len)
- finalise = true;
-
if (intfield & INT_RX_CHANNEL_OVERFLOW) {
mchp_corespi_write(spi, REG_INT_CLEAR, INT_RX_CHANNEL_OVERFLOW);
finalise = true;
@@ -512,9 +502,14 @@ static int mchp_corespi_transfer_one(struct spi_controller *host,
mchp_corespi_write(spi, REG_SLAVE_SELECT, spi->pending_slave_select);
- while (spi->tx_len)
- mchp_corespi_write_fifo(spi);
+ while (spi->tx_len) {
+ int fifo_max = DIV_ROUND_UP(min(spi->tx_len, FIFO_DEPTH), spi->n_bytes);
+ mchp_corespi_write_fifo(spi, fifo_max);
+ mchp_corespi_read_fifo(spi, fifo_max);
+ }
+
+ spi_finalize_current_transfer(host);
return 1;
}
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 91cf42c63f2d8a9c1bcdfe923218e079b32e1a69
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025031631-outskirts-catfight-a453@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 91cf42c63f2d8a9c1bcdfe923218e079b32e1a69 Mon Sep 17 00:00:00 2001
From: Conor Dooley <conor.dooley(a)microchip.com>
Date: Mon, 3 Mar 2025 10:47:40 +0000
Subject: [PATCH] spi: microchip-core: prevent RX overflows when transmit size
> FIFO size
When the size of a transfer exceeds the size of the FIFO (32 bytes), RX
overflows will be generated and receive data will be corrupted and
warnings will be produced. For example, here's an error generated by a
transfer of 36 bytes:
spi_master spi0: mchp_corespi_interrupt: RX OVERFLOW: rxlen: 4, txlen: 0
The driver is currently split between handling receiving in the
interrupt handler, and sending outside of it. Move all handling out of
the interrupt handling, and explicitly link the number of bytes read of
of the RX FIFO to the number written into the TX one. This both resolves
the overflow problems as well as simplifying the flow of the driver.
CC: stable(a)vger.kernel.org
Fixes: 9ac8d17694b6 ("spi: add support for microchip fpga spi controllers")
Signed-off-by: Conor Dooley <conor.dooley(a)microchip.com>
Link: https://patch.msgid.link/20250303-veal-snooper-712c1dfad336@wendy
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-microchip-core.c b/drivers/spi/spi-microchip-core.c
index 5b6af55855ef..62ba0bd9cbb7 100644
--- a/drivers/spi/spi-microchip-core.c
+++ b/drivers/spi/spi-microchip-core.c
@@ -70,8 +70,7 @@
#define INT_RX_CHANNEL_OVERFLOW BIT(2)
#define INT_TX_CHANNEL_UNDERRUN BIT(3)
-#define INT_ENABLE_MASK (CONTROL_RX_DATA_INT | CONTROL_TX_DATA_INT | \
- CONTROL_RX_OVER_INT | CONTROL_TX_UNDER_INT)
+#define INT_ENABLE_MASK (CONTROL_RX_OVER_INT | CONTROL_TX_UNDER_INT)
#define REG_CONTROL (0x00)
#define REG_FRAME_SIZE (0x04)
@@ -133,10 +132,15 @@ static inline void mchp_corespi_disable(struct mchp_corespi *spi)
mchp_corespi_write(spi, REG_CONTROL, control);
}
-static inline void mchp_corespi_read_fifo(struct mchp_corespi *spi)
+static inline void mchp_corespi_read_fifo(struct mchp_corespi *spi, int fifo_max)
{
- while (spi->rx_len >= spi->n_bytes && !(mchp_corespi_read(spi, REG_STATUS) & STATUS_RXFIFO_EMPTY)) {
- u32 data = mchp_corespi_read(spi, REG_RX_DATA);
+ for (int i = 0; i < fifo_max; i++) {
+ u32 data;
+
+ while (mchp_corespi_read(spi, REG_STATUS) & STATUS_RXFIFO_EMPTY)
+ ;
+
+ data = mchp_corespi_read(spi, REG_RX_DATA);
spi->rx_len -= spi->n_bytes;
@@ -211,11 +215,10 @@ static inline void mchp_corespi_set_xfer_size(struct mchp_corespi *spi, int len)
mchp_corespi_write(spi, REG_FRAMESUP, len);
}
-static inline void mchp_corespi_write_fifo(struct mchp_corespi *spi)
+static inline void mchp_corespi_write_fifo(struct mchp_corespi *spi, int fifo_max)
{
- int fifo_max, i = 0;
+ int i = 0;
- fifo_max = DIV_ROUND_UP(min(spi->tx_len, FIFO_DEPTH), spi->n_bytes);
mchp_corespi_set_xfer_size(spi, fifo_max);
while ((i < fifo_max) && !(mchp_corespi_read(spi, REG_STATUS) & STATUS_TXFIFO_FULL)) {
@@ -413,19 +416,6 @@ static irqreturn_t mchp_corespi_interrupt(int irq, void *dev_id)
if (intfield == 0)
return IRQ_NONE;
- if (intfield & INT_TXDONE)
- mchp_corespi_write(spi, REG_INT_CLEAR, INT_TXDONE);
-
- if (intfield & INT_RXRDY) {
- mchp_corespi_write(spi, REG_INT_CLEAR, INT_RXRDY);
-
- if (spi->rx_len)
- mchp_corespi_read_fifo(spi);
- }
-
- if (!spi->rx_len && !spi->tx_len)
- finalise = true;
-
if (intfield & INT_RX_CHANNEL_OVERFLOW) {
mchp_corespi_write(spi, REG_INT_CLEAR, INT_RX_CHANNEL_OVERFLOW);
finalise = true;
@@ -512,9 +502,14 @@ static int mchp_corespi_transfer_one(struct spi_controller *host,
mchp_corespi_write(spi, REG_SLAVE_SELECT, spi->pending_slave_select);
- while (spi->tx_len)
- mchp_corespi_write_fifo(spi);
+ while (spi->tx_len) {
+ int fifo_max = DIV_ROUND_UP(min(spi->tx_len, FIFO_DEPTH), spi->n_bytes);
+ mchp_corespi_write_fifo(spi, fifo_max);
+ mchp_corespi_read_fifo(spi, fifo_max);
+ }
+
+ spi_finalize_current_transfer(host);
return 1;
}
On Sat, Mar 15, 2025 at 07:57:55PM +0100, proxy0(a)tutamail.com wrote:
> Mar 15, 2025, 9:28 PM by lukas(a)wunner.de:
> > After dwelling on this for a while, I'm thinking that it may re-introduce
> > the issue fixed by commit f5eff5591b8f ("PCI: pciehp: Fix AB-BA deadlock
> > between reset_lock and device_lock"):
> >
> > Looking at the second and third stack trace in its commit message,
> > down_write(reset_lock) in pciehp_reset_slot() is basically equivalent
> > to synchronize_irq() and we're holding device_lock() at that point,
> > hindering progress of pciehp_ist().
> >
> > So I think I have guided you in the wrong direction and I apologize
> > for that.
> >
> > However it seems to me that this should be solvable with the small
> > patch below. Am I missing something?
> >
> > @Joel Mathew Thomas, could you give the below patch a spin and see
> > if it helps?
>
> I've tested the patch series along with the additional patch provided.
>
> Kernel: 6.14.0-rc6-00043-g3571e8b091f4-dirty-pci-hotplug-reset-fixes-eventmask-fix
>
> Patches applied:
> - [PATCH 1/4] PCI/hotplug: Disable HPIE over reset
> - [PATCH 2/4] PCI/hotplug: Clearing HPIE for the duration of reset is enough
> - [PATCH 3/4] PCI/hotplug: reset_lock is not required synchronizing with irq thread
> - [PATCH 4/4] PCI/hotplug: Don't enable HPIE in poll mode
> - The latest patch from you:
> + /* Ignore events masked by pciehp_reset_slot(). */
> + events &= ctrl->slot_ctrl;
> + if (!events)
> + return IRQ_HANDLED;
Could you test *only* the quoted diff, i.e. without patches [1/4] - [4/4],
on top of a recent kernel?
Sorry for not having been clear about this.
I believe that patch [1/4] will re-introduce a deadlock we've
already fixed two years ago, so the small diff above seeks to
replace it with a simpler approach that will hopefully avoid
the issue as well.
Thanks,
Lukas
Hi,
I prepared this series about 6 months ago, but never got around to
sending it in. In mainline, we got rid of remap_pfn_range() on the
io_uring side, and this just backports it to 6.6-stable as well.
This eliminates issues with fragmented memory, and hence it'd
be nice to have it in 6.6 stable as well. 6.6-stable also has
mmap support for provided ring buffers, and since we've had an
issue related to remap_pfn_range() there, let's bring it to 6.6
stable as well as a precaution.
--
Jens Axboe