From: Andrei Vagin <avagin(a)google.com>
The inc_rlimit_get_ucounts() increments the specified rlimit counter and
then checks its limit. If the value exceeds the limit, the function
returns an error without decrementing the counter.
v2: changed the goto label name [Roman]
Fixes: 15bc01effefe ("ucounts: Fix signal ucount refcounting")
Signed-off-by: Andrei Vagin <avagin(a)google.com>
Co-developed-by: Roman Gushchin <roman.gushchin(a)linux.dev>
Signed-off-by: Roman Gushchin <roman.gushchin(a)linux.dev>
Tested-by: Roman Gushchin <roman.gushchin(a)linux.dev>
Acked-by: Alexey Gladkov <legion(a)kernel.org>
Cc: Kees Cook <kees(a)kernel.org>
Cc: Andrei Vagin <avagin(a)google.com>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Alexey Gladkov <legion(a)kernel.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
---
kernel/ucount.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 8c07714ff27d..9469102c5ac0 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -317,7 +317,7 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
for (iter = ucounts; iter; iter = iter->ns->ucounts) {
long new = atomic_long_add_return(1, &iter->rlimit[type]);
if (new < 0 || new > max)
- goto unwind;
+ goto dec_unwind;
if (iter == ucounts)
ret = new;
max = get_userns_rlimit_max(iter->ns, type);
@@ -334,7 +334,6 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
dec_unwind:
dec = atomic_long_sub_return(1, &iter->rlimit[type]);
WARN_ON_ONCE(dec < 0);
-unwind:
do_dec_rlimit_put_ucounts(ucounts, iter, type);
return 0;
}
--
2.47.0.163.g1226f6d8fa-goog
Syzkaller reported a hung task with uevent_show() on stack trace. That
specific issue was addressed by another commit [0], but even with that
fix applied (for example, running v6.12-rc5) we face another type of hung
task that comes from the same reproducer [1]. By investigating that, we
could narrow it to the following path:
(a) Syzkaller emulates a Realtek USB WiFi adapter using raw-gadget and
dummy_hcd infrastructure.
(b) During the probe of rtl8192cu, the driver ends-up performing an efuse
read procedure (which is related to EEPROM load IIUC), and here lies the
issue: the function read_efuse() calls read_efuse_byte() many times, as
loop iterations depending on the efuse size (in our example, 512 in total).
This procedure for reading efuse bytes relies in a loop that performs an
I/O read up to *10k* times in case of failures. We measured the time of
the loop inside read_efuse_byte() alone, and in this reproducer (which
involves the dummy_hcd emulation layer), it takes 15 seconds each. As a
consequence, we have the driver stuck in its probe routine for big time,
exposing a stack trace like below if we attempt to reboot the system, for
example:
task:kworker/0:3 state:D stack:0 pid:662 tgid:662 ppid:2 flags:0x00004000
Workqueue: usb_hub_wq hub_event
Call Trace:
__schedule+0xe22/0xeb6
schedule_timeout+0xe7/0x132
__wait_for_common+0xb5/0x12e
usb_start_wait_urb+0xc5/0x1ef
? usb_alloc_urb+0x95/0xa4
usb_control_msg+0xff/0x184
_usbctrl_vendorreq_sync+0xa0/0x161
_usb_read_sync+0xb3/0xc5
read_efuse_byte+0x13c/0x146
read_efuse+0x351/0x5f0
efuse_read_all_map+0x42/0x52
rtl_efuse_shadow_map_update+0x60/0xef
rtl_get_hwinfo+0x5d/0x1c2
rtl92cu_read_eeprom_info+0x10a/0x8d5
? rtl92c_read_chip_version+0x14f/0x17e
rtl_usb_probe+0x323/0x851
usb_probe_interface+0x278/0x34b
really_probe+0x202/0x4a4
__driver_probe_device+0x166/0x1b2
driver_probe_device+0x2f/0xd8
[...]
We propose hereby to drastically reduce the attempts of doing the I/O reads
in case of failures, restricted to USB devices (given that they're inherently
slower than PCIe ones). By retrying up to 10 times (instead of 10000), we
got reponsiveness in the reproducer, while seems reasonable to believe
that there's no sane USB device implementation in the field requiring this
amount of retries at every I/O read in order to properly work. Based on
that assumption, it'd be good to have it backported to stable but maybe not
since driver implementation (the 10k number comes from day 0), perhaps up
to 6.x series makes sense.
[0] Commit 15fffc6a5624 ("driver core: Fix uevent_show() vs driver detach race").
[1] A note about that: this syzkaller report presents multiple reproducers
that differs by the type of emulated USB device. For this specific case,
check the entry from 2024/08/08 06:23 in the list of crashes; the C repro
is available at https://syzkaller.appspot.com/text?tag=ReproC&x=1521fc83980000.
Cc: stable(a)vger.kernel.org # v6.1+
Reported-by: syzbot+edd9fe0d3a65b14588d5(a)syzkaller.appspotmail.com
Signed-off-by: Guilherme G. Piccoli <gpiccoli(a)igalia.com>
---
V3:
- Switch variable declaration to reverse xmas tree order
(thanks Ping-Ke Shih for the suggestion).
V2:
- Restrict the change to USB device only (thanks Ping-Ke Shih).
- Tested in 2 USB devices by Bitterblue Smith - thanks a lot!
V1: https://lore.kernel.org/lkml/20241025150226.896613-1-gpiccoli@igalia.com/
drivers/net/wireless/realtek/rtlwifi/efuse.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/net/wireless/realtek/rtlwifi/efuse.c b/drivers/net/wireless/realtek/rtlwifi/efuse.c
index 82cf5fb5175f..0ff553f650f9 100644
--- a/drivers/net/wireless/realtek/rtlwifi/efuse.c
+++ b/drivers/net/wireless/realtek/rtlwifi/efuse.c
@@ -162,9 +162,19 @@ void efuse_write_1byte(struct ieee80211_hw *hw, u16 address, u8 value)
void read_efuse_byte(struct ieee80211_hw *hw, u16 _offset, u8 *pbuf)
{
struct rtl_priv *rtlpriv = rtl_priv(hw);
+ u16 retry, max_attempts;
u32 value32;
u8 readbyte;
- u16 retry;
+
+ /*
+ * In case of USB devices, transfer speeds are limited, hence
+ * efuse I/O reads could be (way) slower. So, decrease (a lot)
+ * the read attempts in case of failures.
+ */
+ if (rtlpriv->rtlhal.interface == INTF_PCI)
+ max_attempts = 10000;
+ else
+ max_attempts = 10;
rtl_write_byte(rtlpriv, rtlpriv->cfg->maps[EFUSE_CTRL] + 1,
(_offset & 0xff));
@@ -178,7 +188,7 @@ void read_efuse_byte(struct ieee80211_hw *hw, u16 _offset, u8 *pbuf)
retry = 0;
value32 = rtl_read_dword(rtlpriv, rtlpriv->cfg->maps[EFUSE_CTRL]);
- while (!(((value32 >> 24) & 0xff) & 0x80) && (retry < 10000)) {
+ while (!(((value32 >> 24) & 0xff) & 0x80) && (retry < max_attempts)) {
value32 = rtl_read_dword(rtlpriv,
rtlpriv->cfg->maps[EFUSE_CTRL]);
retry++;
--
2.46.2
The patch titled
Subject: mm: fix __wp_page_copy_user fallback path for remote mm
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-fix-__wp_page_copy_user-fallback-path-for-remote-mm.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Asahi Lina <lina(a)asahilina.net>
Subject: mm: fix __wp_page_copy_user fallback path for remote mm
Date: Fri, 01 Nov 2024 21:08:02 +0900
If the source page is a PFN mapping, we copy back from userspace.
However, if this fault is a remote access, we cannot use
__copy_from_user_inatomic. Instead, use access_remote_vm() in this case.
Fixes WARN and incorrect zero-filling when writing to CoW mappings in
a remote process, such as when using gdb on a binary present on a DAX
filesystem.
[ 143.683782] ------------[ cut here ]------------
[ 143.683784] WARNING: CPU: 1 PID: 350 at mm/memory.c:2904 __wp_page_copy_user+0x120/0x2bc
[ 143.683793] CPU: 1 PID: 350 Comm: gdb Not tainted 6.6.52 #1
[ 143.683794] Hardware name: linux,dummy-virt (DT)
[ 143.683795] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 143.683796] pc : __wp_page_copy_user+0x120/0x2bc
[ 143.683798] lr : __wp_page_copy_user+0x254/0x2bc
[ 143.683799] sp : ffff80008272b8b0
[ 143.683799] x29: ffff80008272b8b0 x28: 0000000000000000 x27: ffff000083bad580
[ 143.683801] x26: 0000000000000000 x25: 0000fffff7fd5000 x24: ffff000081db04c0
[ 143.683802] x23: ffff00014f24b000 x22: fffffc00053c92c0 x21: ffff000083502150
[ 143.683803] x20: 0000fffff7fd5000 x19: ffff80008272b9d0 x18: 0000000000000000
[ 143.683804] x17: ffff000081db0500 x16: ffff800080fe52a0 x15: 0000fffff7fd5000
[ 143.683804] x14: 0000000000bb1845 x13: 0000000000000080 x12: ffff80008272b880
[ 143.683805] x11: ffff000081d13600 x10: ffff000081d13608 x9 : ffff000081d1360c
[ 143.683806] x8 : ffff000083a16f00 x7 : 0000000000000010 x6 : ffff00014f24b000
[ 143.683807] x5 : ffff00014f24c000 x4 : 0000000000000000 x3 : ffff000083582000
[ 143.683807] x2 : 0000000000000f80 x1 : 0000fffff7fd5000 x0 : 0000000000001000
[ 143.683808] Call trace:
[ 143.683809] __wp_page_copy_user+0x120/0x2bc
[ 143.683810] wp_page_copy+0x98/0x5c0
[ 143.683813] do_wp_page+0x250/0x530
[ 143.683814] __handle_mm_fault+0x278/0x284
[ 143.683817] handle_mm_fault+0x64/0x1e8
[ 143.683819] faultin_page+0x5c/0x110
[ 143.683820] __get_user_pages+0xc8/0x2f4
[ 143.683821] get_user_pages_remote+0xac/0x30c
[ 143.683823] __access_remote_vm+0xb4/0x368
[ 143.683824] access_remote_vm+0x10/0x1c
[ 143.683826] mem_rw.isra.0+0xc4/0x218
[ 143.683831] mem_write+0x18/0x24
[ 143.683831] vfs_write+0xa0/0x37c
[ 143.683834] ksys_pwrite64+0x7c/0xc0
[ 143.683834] __arm64_sys_pwrite64+0x20/0x2c
[ 143.683835] invoke_syscall+0x48/0x10c
[ 143.683837] el0_svc_common.constprop.0+0x40/0xe0
[ 143.683839] do_el0_svc+0x1c/0x28
[ 143.683841] el0_svc+0x3c/0xdc
[ 143.683846] el0t_64_sync_handler+0x120/0x12c
[ 143.683848] el0t_64_sync+0x194/0x198
[ 143.683849] ---[ end trace 0000000000000000 ]---
Link: https://lkml.kernel.org/r/20241101-mm-remote-pfn-v1-1-080b609270b7@asahilin…
Fixes: 83d116c53058 ("mm: fix double page fault on arm64 if PTE_AF is cleared")
Signed-off-by: Asahi Lina <lina(a)asahilina.net>
Cc: Jia He <justin.he(a)arm.com>
Cc: Yibo Cai <Yibo.Cai(a)arm.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Asahi Lina <lina(a)asahilina.net>
Cc: Sergio Lopez Pascual <slp(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
--- a/mm/memory.c~mm-fix-__wp_page_copy_user-fallback-path-for-remote-mm
+++ a/mm/memory.c
@@ -3081,13 +3081,18 @@ static inline int __wp_page_copy_user(st
update_mmu_cache_range(vmf, vma, addr, vmf->pte, 1);
}
+ /* If the mm is a remote mm, copy in the page using access_remote_vm() */
+ if (current->mm != mm) {
+ if (access_remote_vm(mm, (unsigned long)uaddr, kaddr, PAGE_SIZE, 0) != PAGE_SIZE)
+ goto warn;
+ }
/*
* This really shouldn't fail, because the page is there
* in the page tables. But it might just be unreadable,
* in which case we just give up and fill the result with
* zeroes.
*/
- if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
+ else if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
if (vmf->pte)
goto warn;
_
Patches currently in -mm which might be from lina(a)asahilina.net are
mm-fix-__wp_page_copy_user-fallback-path-for-remote-mm.patch
The patch titled
Subject: selftests: hugetlb_dio: check for initial conditions to skip in the start
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
selftests-hugetlb_dio-check-for-initial-conditions-to-skip-in-the-start.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
Subject: selftests: hugetlb_dio: check for initial conditions to skip in the start
Date: Fri, 1 Nov 2024 19:15:57 +0500
The test should be skipped if initial conditions aren't fulfilled in the
start instead of failing and outputting non-compliant TAP logs. This kind
of failure pollutes the results. The initial conditions are:
- The test should only execute if /tmp file can be allocated.
- The test should only execute if huge pages are free.
Before:
TAP version 13
1..4
Bail out! Error opening file
: Read-only file system (30)
# Planned tests != run tests (4 != 0)
# Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
After:
TAP version 13
1..0 # SKIP Unable to allocate file: Read-only file system
Link: https://lkml.kernel.org/r/20241101141557.3159432-1-usama.anjum@collabora.com
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
Fixes: 3a103b5315b7 ("selftest: mm: Test if hugepage does not get leaked during __bio_release_pages()")
Cc: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Donet Tom <donettom(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
tools/testing/selftests/mm/hugetlb_dio.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)
--- a/tools/testing/selftests/mm/hugetlb_dio.c~selftests-hugetlb_dio-check-for-initial-conditions-to-skip-in-the-start
+++ a/tools/testing/selftests/mm/hugetlb_dio.c
@@ -44,13 +44,6 @@ void run_dio_using_hugetlb(unsigned int
if (fd < 0)
ksft_exit_fail_perror("Error opening file\n");
- /* Get the free huge pages before allocation */
- free_hpage_b = get_free_hugepages();
- if (free_hpage_b == 0) {
- close(fd);
- ksft_exit_skip("No free hugepage, exiting!\n");
- }
-
/* Allocate a hugetlb page */
orig_buffer = mmap(NULL, h_pagesize, mmap_prot, mmap_flags, -1, 0);
if (orig_buffer == MAP_FAILED) {
@@ -94,8 +87,20 @@ void run_dio_using_hugetlb(unsigned int
int main(void)
{
size_t pagesize = 0;
+ int fd;
ksft_print_header();
+
+ /* Open the file to DIO */
+ fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664);
+ if (fd < 0)
+ ksft_exit_skip("Unable to allocate file: %s\n", strerror(errno));
+ close(fd);
+
+ /* Check if huge pages are free */
+ if (!get_free_hugepages())
+ ksft_exit_skip("No free hugepage, exiting\n");
+
ksft_set_plan(4);
/* Get base page size */
_
Patches currently in -mm which might be from usama.anjum(a)collabora.com are
selftests-hugetlb_dio-check-for-initial-conditions-to-skip-in-the-start.patch
From: Kalesh Singh <kaleshsingh(a)google.com>
Commit 78ff64081949 ("vfs: Convert tracefs to use the new mount API")
converted tracefs to use the new mount APIs caused mount options
(e.g. gid=<gid>) to not take effect.
The tracefs superblock can be updated from multiple paths:
- on fs_initcall() to init_trace_printk_function_export()
- from a work queue to initialize eventfs
tracer_init_tracefs_work_func()
- fsconfig() syscall to mount or remount of tracefs
The tracefs superblock root inode gets created early on in
init_trace_printk_function_export().
With the new mount API, tracefs effectively uses get_tree_single() instead
of the old API mount_single().
Previously, mount_single() ensured that the options are always applied to
the superblock root inode:
(1) If the root inode didn't exist, call fill_super() to create it
and apply the options.
(2) If the root inode exists, call reconfigure_single() which
effectively calls tracefs_apply_options() to parse and apply
options to the subperblock's fs_info and inode and remount
eventfs (if necessary)
On the other hand, get_tree_single() effectively calls vfs_get_super()
which:
(3) If the root inode doesn't exists, calls fill_super() to create it
and apply the options.
(4) If the root inode already exists, updates the fs_context root
with the superblock's root inode.
(4) above is always the case for tracefs mounts, since the super block's
root inode will already be created by init_trace_printk_function_export().
This means that the mount options get ignored:
- Since it isn't applied to the superblock's root inode, it doesn't
get inherited by the children.
- Since eventfs is initialized from a separate work queue and
before call to mount with the options, and it doesn't get remounted
for mount.
Ensure that the mount options are applied to the super block and eventfs
is remounted to respect the mount options.
To understand this better, if fstab has the following:
tracefs /sys/kernel/tracing tracefs nosuid,nodev,noexec,gid=tracing 0 0
On boot up, permissions look like:
# ls -l /sys/kernel/tracing/trace
-rw-r----- 1 root root 0 Nov 1 08:37 /sys/kernel/tracing/trace
When it should look like:
# ls -l /sys/kernel/tracing/trace
-rw-r----- 1 root tracing 0 Nov 1 08:37 /sys/kernel/tracing/trace
Link: https://lore.kernel.org/r/536e99d3-345c-448b-adee-a21389d7ab4b@redhat.com/
Cc: Eric Sandeen <sandeen(a)redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Ali Zahraee <ahzahraee(a)gmail.com>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: 78ff64081949 ("vfs: Convert tracefs to use the new mount API")
Link: https://lore.kernel.org/20241030171928.4168869-2-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh(a)google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
fs/tracefs/inode.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 1748dff58c3b..cfc614c638da 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -392,6 +392,9 @@ static int tracefs_reconfigure(struct fs_context *fc)
struct tracefs_fs_info *sb_opts = sb->s_fs_info;
struct tracefs_fs_info *new_opts = fc->s_fs_info;
+ if (!new_opts)
+ return 0;
+
sync_filesystem(sb);
/* structure copy of new mount options to sb */
*sb_opts = *new_opts;
@@ -478,14 +481,17 @@ static int tracefs_fill_super(struct super_block *sb, struct fs_context *fc)
sb->s_op = &tracefs_super_operations;
sb->s_d_op = &tracefs_dentry_operations;
- tracefs_apply_options(sb, false);
-
return 0;
}
static int tracefs_get_tree(struct fs_context *fc)
{
- return get_tree_single(fc, tracefs_fill_super);
+ int err = get_tree_single(fc, tracefs_fill_super);
+
+ if (err)
+ return err;
+
+ return tracefs_reconfigure(fc);
}
static void tracefs_free_fc(struct fs_context *fc)
--
2.45.2
The patch titled
Subject: signal: restore the override_rlimit logic
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
signal-restore-the-override_rlimit-logic.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Roman Gushchin <roman.gushchin(a)linux.dev>
Subject: signal: restore the override_rlimit logic
Date: Thu, 31 Oct 2024 20:04:38 +0000
Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of
ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of
signals. However now it's enforced unconditionally, even if
override_rlimit is set. This behavior change caused production issues.
For example, if the limit is reached and a process receives a SIGSEGV
signal, sigqueue_alloc fails to allocate the necessary resources for the
signal delivery, preventing the signal from being delivered with siginfo.
This prevents the process from correctly identifying the fault address and
handling the error. From the user-space perspective, applications are
unaware that the limit has been reached and that the siginfo is
effectively 'corrupted'. This can lead to unpredictable behavior and
crashes, as we observed with java applications.
Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip
the comparison to max there if override_rlimit is set. This effectively
restores the old behavior.
Link: https://lkml.kernel.org/r/20241031200438.2951287-1-roman.gushchin@linux.dev
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Signed-off-by: Roman Gushchin <roman.gushchin(a)linux.dev>
Co-developed-by: Andrei Vagin <avagin(a)google.com>
Signed-off-by: Andrei Vagin <avagin(a)google.com>
Cc: Kees Cook <kees(a)kernel.org>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Alexey Gladkov <legion(a)kernel.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/user_namespace.h | 3 ++-
kernel/signal.c | 3 ++-
kernel/ucount.c | 5 +++--
3 files changed, 7 insertions(+), 4 deletions(-)
--- a/include/linux/user_namespace.h~signal-restore-the-override_rlimit-logic
+++ a/include/linux/user_namespace.h
@@ -141,7 +141,8 @@ static inline long get_rlimit_value(stru
long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
-long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type);
+long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
+ bool override_rlimit);
void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type);
bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max);
--- a/kernel/signal.c~signal-restore-the-override_rlimit-logic
+++ a/kernel/signal.c
@@ -419,7 +419,8 @@ __sigqueue_alloc(int sig, struct task_st
*/
rcu_read_lock();
ucounts = task_ucounts(t);
- sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
+ sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING,
+ override_rlimit);
rcu_read_unlock();
if (!sigpending)
return NULL;
--- a/kernel/ucount.c~signal-restore-the-override_rlimit-logic
+++ a/kernel/ucount.c
@@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucoun
do_dec_rlimit_put_ucounts(ucounts, NULL, type);
}
-long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
+long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
+ bool override_rlimit)
{
/* Caller must hold a reference to ucounts */
struct ucounts *iter;
@@ -316,7 +317,7 @@ long inc_rlimit_get_ucounts(struct ucoun
for (iter = ucounts; iter; iter = iter->ns->ucounts) {
long new = atomic_long_add_return(1, &iter->rlimit[type]);
- if (new < 0 || new > max)
+ if (new < 0 || (!override_rlimit && (new > max)))
goto unwind;
if (iter == ucounts)
ret = new;
_
Patches currently in -mm which might be from roman.gushchin(a)linux.dev are
signal-restore-the-override_rlimit-logic.patch
The GGTT looks to be stored inside stolen memory on igpu which is not
treated as normal RAM. The core kernel skips this memory range when
creating the hibernation image, therefore when coming back from
hibernation the GGTT programming is lost. This seems to cause issues
with broken resume where GuC FW fails to load:
[drm] *ERROR* GT0: load failed: status = 0x400000A0, time = 10ms, freq = 1250MHz (req 1300MHz), done = -1
[drm] *ERROR* GT0: load failed: status: Reset = 0, BootROM = 0x50, UKernel = 0x00, MIA = 0x00, Auth = 0x01
[drm] *ERROR* GT0: firmware signature verification failed
[drm] *ERROR* CRITICAL: Xe has declared device 0000:00:02.0 as wedged.
Current GGTT users are kernel internal and tracked as pinned, so it
should be possible to hook into the existing save/restore logic that we
use for dgpu, where the actual evict is skipped but on restore we
importantly restore the GGTT programming. This has been confirmed to
fix hibernation on at least ADL and MTL, though likely all igpu
platforms are affected.
This also means we have a hole in our testing, where the existing s4
tests only really test the driver hooks, and don't go as far as actually
rebooting and restoring from the hibernation image and in turn powering
down RAM (and therefore losing the contents of stolen).
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3275
Signed-off-by: Matthew Auld <matthew.auld(a)intel.com>
Cc: Matthew Brost <matthew.brost(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.8+
---
drivers/gpu/drm/xe/xe_bo.c | 36 ++++++++++++++------------------
drivers/gpu/drm/xe/xe_bo_evict.c | 6 ------
2 files changed, 16 insertions(+), 26 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index d79d8ef5c7d5..0ae5c8f7bab8 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -950,7 +950,10 @@ int xe_bo_restore_pinned(struct xe_bo *bo)
if (WARN_ON(!xe_bo_is_pinned(bo)))
return -EINVAL;
- if (WARN_ON(xe_bo_is_vram(bo) || !bo->ttm.ttm))
+ if (WARN_ON(xe_bo_is_vram(bo)))
+ return -EINVAL;
+
+ if (WARN_ON(!bo->ttm.ttm && !xe_bo_is_stolen(bo)))
return -EINVAL;
if (!mem_type_is_vram(place->mem_type))
@@ -1770,6 +1773,7 @@ int xe_bo_pin_external(struct xe_bo *bo)
int xe_bo_pin(struct xe_bo *bo)
{
+ struct ttm_place *place = &(bo->placements[0]);
struct xe_device *xe = xe_bo_device(bo);
int err;
@@ -1800,7 +1804,6 @@ int xe_bo_pin(struct xe_bo *bo)
*/
if (IS_DGFX(xe) && !(IS_ENABLED(CONFIG_DRM_XE_DEBUG) &&
bo->flags & XE_BO_FLAG_INTERNAL_TEST)) {
- struct ttm_place *place = &(bo->placements[0]);
if (mem_type_is_vram(place->mem_type)) {
xe_assert(xe, place->flags & TTM_PL_FLAG_CONTIGUOUS);
@@ -1809,13 +1812,12 @@ int xe_bo_pin(struct xe_bo *bo)
vram_region_gpu_offset(bo->ttm.resource)) >> PAGE_SHIFT;
place->lpfn = place->fpfn + (bo->size >> PAGE_SHIFT);
}
+ }
- if (mem_type_is_vram(place->mem_type) ||
- bo->flags & XE_BO_FLAG_GGTT) {
- spin_lock(&xe->pinned.lock);
- list_add_tail(&bo->pinned_link, &xe->pinned.kernel_bo_present);
- spin_unlock(&xe->pinned.lock);
- }
+ if (mem_type_is_vram(place->mem_type) || bo->flags & XE_BO_FLAG_GGTT) {
+ spin_lock(&xe->pinned.lock);
+ list_add_tail(&bo->pinned_link, &xe->pinned.kernel_bo_present);
+ spin_unlock(&xe->pinned.lock);
}
ttm_bo_pin(&bo->ttm);
@@ -1863,24 +1865,18 @@ void xe_bo_unpin_external(struct xe_bo *bo)
void xe_bo_unpin(struct xe_bo *bo)
{
+ struct ttm_place *place = &(bo->placements[0]);
struct xe_device *xe = xe_bo_device(bo);
xe_assert(xe, !bo->ttm.base.import_attach);
xe_assert(xe, xe_bo_is_pinned(bo));
- if (IS_DGFX(xe) && !(IS_ENABLED(CONFIG_DRM_XE_DEBUG) &&
- bo->flags & XE_BO_FLAG_INTERNAL_TEST)) {
- struct ttm_place *place = &(bo->placements[0]);
-
- if (mem_type_is_vram(place->mem_type) ||
- bo->flags & XE_BO_FLAG_GGTT) {
- spin_lock(&xe->pinned.lock);
- xe_assert(xe, !list_empty(&bo->pinned_link));
- list_del_init(&bo->pinned_link);
- spin_unlock(&xe->pinned.lock);
- }
+ if (mem_type_is_vram(place->mem_type) || bo->flags & XE_BO_FLAG_GGTT) {
+ spin_lock(&xe->pinned.lock);
+ xe_assert(xe, !list_empty(&bo->pinned_link));
+ list_del_init(&bo->pinned_link);
+ spin_unlock(&xe->pinned.lock);
}
-
ttm_bo_unpin(&bo->ttm);
}
diff --git a/drivers/gpu/drm/xe/xe_bo_evict.c b/drivers/gpu/drm/xe/xe_bo_evict.c
index 32043e1e5a86..b01bc20eb90b 100644
--- a/drivers/gpu/drm/xe/xe_bo_evict.c
+++ b/drivers/gpu/drm/xe/xe_bo_evict.c
@@ -34,9 +34,6 @@ int xe_bo_evict_all(struct xe_device *xe)
u8 id;
int ret;
- if (!IS_DGFX(xe))
- return 0;
-
/* User memory */
for (mem_type = XE_PL_VRAM0; mem_type <= XE_PL_VRAM1; ++mem_type) {
struct ttm_resource_manager *man =
@@ -125,9 +122,6 @@ int xe_bo_restore_kernel(struct xe_device *xe)
struct xe_bo *bo;
int ret;
- if (!IS_DGFX(xe))
- return 0;
-
spin_lock(&xe->pinned.lock);
for (;;) {
bo = list_first_entry_or_null(&xe->pinned.evicted,
--
2.47.0
On Fri, Nov 1, 2024 at 6:58 AM Sasha Levin <sashal(a)kernel.org> wrote:
>
> This is a note to let you know that I've just added the patch titled
>
> lib: alloc_tag_module_unload must wait for pending kfree_rcu calls
>
> to the 6.11-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> lib-alloc_tag_module_unload-must-wait-for-pending-kf.patch
> and it can be found in the queue-6.11 subdirectory.
Thanks Sasha! Could you please double-check that the prerequisite
patch https://lore.kernel.org/all/20241021171003.2907935-1-surenb@google.com/
was also picked up? I don't see it in the queue-6.11 directory.
Without that patch this one will cause build errors, that's why I sent
them as a patchset.
Thanks,
Suren.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
>
>
>
> commit 536dfe685ebd28b27ebfbc3d4b9168207b7e28a3
> Author: Florian Westphal <fw(a)strlen.de>
> Date: Mon Oct 7 22:52:24 2024 +0200
>
> lib: alloc_tag_module_unload must wait for pending kfree_rcu calls
>
> [ Upstream commit dc783ba4b9df3fb3e76e968b2cbeb9960069263c ]
>
> Ben Greear reports following splat:
> ------------[ cut here ]------------
> net/netfilter/nf_nat_core.c:1114 module nf_nat func:nf_nat_register_fn has 256 allocated at module unload
> WARNING: CPU: 1 PID: 10421 at lib/alloc_tag.c:168 alloc_tag_module_unload+0x22b/0x3f0
> Modules linked in: nf_nat(-) btrfs ufs qnx4 hfsplus hfs minix vfat msdos fat
> ...
> Hardware name: Default string Default string/SKYBAY, BIOS 5.12 08/04/2020
> RIP: 0010:alloc_tag_module_unload+0x22b/0x3f0
> codetag_unload_module+0x19b/0x2a0
> ? codetag_load_module+0x80/0x80
>
> nf_nat module exit calls kfree_rcu on those addresses, but the free
> operation is likely still pending by the time alloc_tag checks for leaks.
>
> Wait for outstanding kfree_rcu operations to complete before checking
> resolves this warning.
>
> Reproducer:
> unshare -n iptables-nft -t nat -A PREROUTING -p tcp
> grep nf_nat /proc/allocinfo # will list 4 allocations
> rmmod nft_chain_nat
> rmmod nf_nat # will WARN.
>
> [akpm(a)linux-foundation.org: add comment]
> Link: https://lkml.kernel.org/r/20241007205236.11847-1-fw@strlen.de
> Fixes: a473573964e5 ("lib: code tagging module support")
> Signed-off-by: Florian Westphal <fw(a)strlen.de>
> Reported-by: Ben Greear <greearb(a)candelatech.com>
> Closes: https://lore.kernel.org/netdev/bdaaef9d-4364-4171-b82b-bcfc12e207eb@candela…
> Cc: Uladzislau Rezki <urezki(a)gmail.com>
> Cc: Vlastimil Babka <vbabka(a)suse.cz>
> Cc: Suren Baghdasaryan <surenb(a)google.com>
> Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
> Cc: <stable(a)vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
>
> diff --git a/lib/codetag.c b/lib/codetag.c
> index afa8a2d4f3173..d1fbbb7c2ec3d 100644
> --- a/lib/codetag.c
> +++ b/lib/codetag.c
> @@ -228,6 +228,9 @@ bool codetag_unload_module(struct module *mod)
> if (!mod)
> return true;
>
> + /* await any module's kfree_rcu() operations to complete */
> + kvfree_rcu_barrier();
> +
> mutex_lock(&codetag_lock);
> list_for_each_entry(cttype, &codetag_types, link) {
> struct codetag_module *found = NULL;