September 2023 - Linux-stable-mirror

[PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec

by Aleksa Sarai

The most critical issue with vm.memfd_noexec=2 (the fact that passing MFD_EXEC would bypass it entirely[1]) has been fixed in Andrew's tree[2], but there are still some outstanding issues that need to be addressed: * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls because it will make it far to difficult to ever migrate. Instead it should imply MFD_EXEC. * The dmesg warnings are pr_warn_once(), which on most systems means that they will be used up by systemd or some other boot process and userspace developers will never see it. - For the !(flags & (MFD_EXEC | MFD_NOEXEC_SEAL)) case, outputting a rate-limited message to the kernel log is necessary to tell userspace that they should add the new flags. Arguably the most ideal way to deal with the spam concern[3,4] while still prompting userspace to switch to the new flags would be to only log the warning once per task or something similar. However, adding something to task_struct for tracking this would be needless bloat for a single pr_warn_ratelimited(). So just switch to pr_info_ratelimited() to avoid spamming the log with something that isn't a real warning. There's lots of info-level stuff in dmesg, it seems really unlikely that this should be an actual problem. Most programs are already switching to the new flags anyway. - For the vm.memfd_noexec=2 case, we need to log a warning for every failure because otherwise userspace will have no idea why their previously working program started returning -EACCES (previously -EINVAL) from memfd_create(2). pr_warn_once() is simply wrong here. * The racheting mechanism for vm.memfd_noexec makes it incredibly unappealing for most users to enable the sysctl because enabling it on &init_pid_ns means you need a system reboot to unset it. Given the actual security threat being protected against, CAP_SYS_ADMIN users being restricted in this way makes little sense. The argument for this ratcheting by the original author was that it allows you to have a hierarchical setting that cannot be unset by child pidnses, but this is not accurate -- changing the parent pidns's vm.memfd_noexec setting to be more restrictive didn't affect children. Instead, switch the vm.memfd_noexec sysctl to be properly hierarchical and allow CAP_SYS_ADMIN users (in the pidns's owning userns) to lower the setting as long as it is not lower than the parent's effective setting. This change also makes it so that changing a parent pidns's vm.memfd_noexec will affect all descendants, providing a properly hierarchical setting. The performance impact of this is incredibly minimal since the maximum depth of pidns is 32 and it is only checked during memfd_create(2) and unshare(CLONE_NEWPID). * The memfd selftests would not exit with a non-zero error code when certain tests that ran in a forked process (specifically the ones related to MFD_EXEC and MFD_NOEXEC_SEAL) failed. [1]: https://lore.kernel.org/all/ZJwcsU0vI-nzgOB_@codewreck.org/ [2]: https://lore.kernel.org/all/20230705063315.3680666-1-jeffxu@google.com/ [3]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/ [4]: https://lore.kernel.org/f185bb42-b29c-977e-312e-3349eea15383@linuxfoundatio… Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com> --- Changes in v2: - Make vm.memfd_noexec restrictions properly hierarchical. - Allow vm.memfd_noexec setting to be lowered by CAP_SYS_ADMIN as long as it is not lower than the parent's effective setting. - Fix the logging behaviour related to the new flags and vm.memfd_noexec=2. - Add more thorough tests for vm.memfd_noexec in selftests. - v1: <https://lore.kernel.org/r/20230713143406.14342-1-cyphar@cyphar.com> --- Aleksa Sarai (5): selftests: memfd: error out test process when child test fails memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2 memfd: improve userspace warnings for missing exec-related flags memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy selftests: improve vm.memfd_noexec sysctl tests include/linux/pid_namespace.h | 39 ++-- kernel/pid.c | 3 + kernel/pid_namespace.c | 6 +- kernel/pid_sysctl.h | 28 ++- mm/memfd.c | 33 ++- tools/testing/selftests/memfd/memfd_test.c | 332 +++++++++++++++++++++++------ 6 files changed, 322 insertions(+), 119 deletions(-) --- base-commit: 3ff995246e801ea4de0a30860a1d8da4aeb538e7 change-id: 20230803-memfd-vm-noexec-uapi-fixes-ace725c67b0f Best regards, -- Aleksa Sarai <cyphar(a)cyphar.com>

2 years, 3 months

8
18
0 0

[PATH 6.4.y] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages

by Luiz Capitulino

From: Sean Christopherson <seanjc(a)google.com> Commit 0b210faf337314e4bc88e796218bc70c72a51209 upstream. Add a "never" option to the nx_huge_pages module param to allow userspace to do a one-way hard disabling of the mitigation, and don't create the per-VM recovery threads when the mitigation is hard disabled. Letting userspace pinky swear that userspace doesn't want to enable NX mitigation (without reloading KVM) allows certain use cases to avoid the latency problems associated with spawning a kthread for each VM. E.g. in FaaS use cases, the guest kernel is trusted and the host may create 100+ VMs per logical CPU, which can result in 100ms+ latencies when a burst of VMs is created. Reported-by: Li RongQing <lirongqing(a)baidu.com> Closes: https://lore.kernel.org/all/1679555884-32544-1-git-send-email-lirongqing@ba… Cc: Yong He <zhuangel570(a)gmail.com> Cc: Robert Hoo <robert.hoo.linux(a)gmail.com> Cc: Kai Huang <kai.huang(a)intel.com> Reviewed-by: Robert Hoo <robert.hoo.linux(a)gmail.com> Acked-by: Kai Huang <kai.huang(a)intel.com> Tested-by: Luiz Capitulino <luizcap(a)amazon.com> Reviewed-by: Li RongQing <lirongqing(a)baidu.com> Link: https://lore.kernel.org/r/20230602005859.784190-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc(a)google.com> Signed-off-by: Luiz Capitulino <luizcap(a)amazon.com> --- arch/x86/kvm/mmu/mmu.c | 41 ++++++++++++++++++++++++++++++++++++----- 1 file changed, 36 insertions(+), 5 deletions(-) I submitted this backport for 6.1.y[1] but we agreed that having it for 6.4.y is desirable to allow upgrade path. Tests performed: * Confirmed KVM_CREATE_VM latency goes down to less than 1ms * Quickly booted a simple guest with kvmtool [1] https://lore.kernel.org/stable/cover.1693593288.git.luizcap@amazon.com/ diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6eaa3d6994ae..11c050f40d82 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -58,6 +58,8 @@ extern bool itlb_multihit_kvm_mitigation; +static bool nx_hugepage_mitigation_hard_disabled; + int __read_mostly nx_huge_pages = -1; static uint __read_mostly nx_huge_pages_recovery_period_ms; #ifdef CONFIG_PREEMPT_RT @@ -67,12 +69,13 @@ static uint __read_mostly nx_huge_pages_recovery_ratio = 0; static uint __read_mostly nx_huge_pages_recovery_ratio = 60; #endif +static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp); static int set_nx_huge_pages(const char *val, const struct kernel_param *kp); static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel_param *kp); static const struct kernel_param_ops nx_huge_pages_ops = { .set = set_nx_huge_pages, - .get = param_get_bool, + .get = get_nx_huge_pages, }; static const struct kernel_param_ops nx_huge_pages_recovery_param_ops = { @@ -6844,6 +6847,14 @@ static void mmu_destroy_caches(void) kmem_cache_destroy(mmu_page_header_cache); } +static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp) +{ + if (nx_hugepage_mitigation_hard_disabled) + return sprintf(buffer, "never\n"); + + return param_get_bool(buffer, kp); +} + static bool get_nx_auto_mode(void) { /* Return true when CPU has the bug, and mitigations are ON */ @@ -6860,15 +6871,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp) bool old_val = nx_huge_pages; bool new_val; + if (nx_hugepage_mitigation_hard_disabled) + return -EPERM; + /* In "auto" mode deploy workaround only if CPU has the bug. */ - if (sysfs_streq(val, "off")) + if (sysfs_streq(val, "off")) { new_val = 0; - else if (sysfs_streq(val, "force")) + } else if (sysfs_streq(val, "force")) { new_val = 1; - else if (sysfs_streq(val, "auto")) + } else if (sysfs_streq(val, "auto")) { new_val = get_nx_auto_mode(); - else if (kstrtobool(val, &new_val) < 0) + } else if (sysfs_streq(val, "never")) { + new_val = 0; + + mutex_lock(&kvm_lock); + if (!list_empty(&vm_list)) { + mutex_unlock(&kvm_lock); + return -EBUSY; + } + nx_hugepage_mitigation_hard_disabled = true; + mutex_unlock(&kvm_lock); + } else if (kstrtobool(val, &new_val) < 0) { return -EINVAL; + } __set_nx_huge_pages(new_val); @@ -7006,6 +7031,9 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel uint old_period, new_period; int err; + if (nx_hugepage_mitigation_hard_disabled) + return -EPERM; + was_recovery_enabled = calc_nx_huge_pages_recovery_period(&old_period); err = param_set_uint(val, kp); @@ -7164,6 +7192,9 @@ int kvm_mmu_post_init_vm(struct kvm *kvm) { int err; + if (nx_hugepage_mitigation_hard_disabled) + return 0; + err = kvm_vm_create_worker_thread(kvm, kvm_nx_huge_page_recovery_worker, 0, "kvm-nx-lpage-recovery", &kvm->arch.nx_huge_page_recovery_thread); -- 2.40.1

2 years, 3 months

2
1
0 0

[PATCH] i2c: aspeed: Reset the i2c controller when timeout occurs

by Tommy Huang

Reset the i2c controller when an i2c transfer timeout occurs. The remaining interrupts and device should be reset to avoid unpredictable controller behavior. Fixes: 2e57b7cebb98 ("i2c: aspeed: Add multi-master use case support") Cc: Jae Hyun Yoo <jae.hyun.yoo(a)linux.intel.com> Cc: <stable(a)vger.kernel.org> # v5.1+ Signed-off-by: Tommy Huang <tommy_huang(a)aspeedtech.com> --- drivers/i2c/busses/i2c-aspeed.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/i2c/busses/i2c-aspeed.c b/drivers/i2c/busses/i2c-aspeed.c index 2e5acfeb76c8..5a416b39b818 100644 --- a/drivers/i2c/busses/i2c-aspeed.c +++ b/drivers/i2c/busses/i2c-aspeed.c @@ -698,13 +698,16 @@ static int aspeed_i2c_master_xfer(struct i2c_adapter *adap, if (time_left == 0) { /* - * If timed out and bus is still busy in a multi master - * environment, attempt recovery at here. + * In a multi-master setup, if a timeout occurs, attempt + * recovery. But if the bus is idle, we still need to reset the + * i2c controller to clear the remaining interrupts. */ if (bus->multi_master && (readl(bus->base + ASPEED_I2C_CMD_REG) & ASPEED_I2CD_BUS_BUSY_STS)) aspeed_i2c_recover_bus(bus); + else + aspeed_i2c_reset(bus); /* * If timed out and the state is still pending, drop the pending -- 2.25.1

2 years, 3 months

2
1
0 0

Hello

by Reggie Hill

Kindly get back to me for a mutual benefit transaction. I will appreciate hearing from you

2 years, 3 months

1
0
0 0

[PATCH] drm/amd/display: prevent potential division by zero errors

by Hamza Mahfooz

There are two places in apply_below_the_range() where it's possible for a divide by zero error to occur. So, to fix this make sure the divisor is non-zero before attempting the computation in both cases. Cc: stable(a)vger.kernel.org Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2637 Fixes: a463b263032f ("drm/amd/display: Fix frames_to_insert math") Fixes: ded6119e825a ("drm/amd/display: Reinstate LFC optimization") Signed-off-by: Hamza Mahfooz <hamza.mahfooz(a)amd.com> --- drivers/gpu/drm/amd/display/modules/freesync/freesync.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/display/modules/freesync/freesync.c b/drivers/gpu/drm/amd/display/modules/freesync/freesync.c index dbd60811f95d..ef3a67409021 100644 --- a/drivers/gpu/drm/amd/display/modules/freesync/freesync.c +++ b/drivers/gpu/drm/amd/display/modules/freesync/freesync.c @@ -338,7 +338,9 @@ static void apply_below_the_range(struct core_freesync *core_freesync, * - Delta for CEIL: delta_from_mid_point_in_us_1 * - Delta for FLOOR: delta_from_mid_point_in_us_2 */ - if ((last_render_time_in_us / mid_point_frames_ceil) < in_out_vrr->min_duration_in_us) { + if (mid_point_frames_ceil && + (last_render_time_in_us / mid_point_frames_ceil) < + in_out_vrr->min_duration_in_us) { /* Check for out of range. * If using CEIL produces a value that is out of range, * then we are forced to use FLOOR. @@ -385,8 +387,9 @@ static void apply_below_the_range(struct core_freesync *core_freesync, /* Either we've calculated the number of frames to insert, * or we need to insert min duration frames */ - if (last_render_time_in_us / frames_to_insert < - in_out_vrr->min_duration_in_us){ + if (frames_to_insert && + (last_render_time_in_us / frames_to_insert) < + in_out_vrr->min_duration_in_us){ frames_to_insert -= (frames_to_insert > 1) ? 1 : 0; } -- 2.41.0

2 years, 3 months

2
1
0 0

[PATCH 1/2] tracefs: Add missing lockdown check to tracefs_create_dir()

by Steven Rostedt

From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org> The function tracefs_create_dir() was missing a lockdown check and was called by the RV code. This gave an inconsistent behavior of this function returning success while other tracefs functions failed. This caused the inode being freed by the wrong kmem_cache. Link: https://lore.kernel.org/all/202309050916.58201dc6-oliver.sang@intel.com/ Cc: stable(a)vger.kernel.org Fixes: bf8e602186ec4 ("tracing: Do not create tracefs files if tracefs lockdown is in effect") Reported-by: kernel test robot <oliver.sang(a)intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> --- fs/tracefs/inode.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index de5b72216b1a..3b8dd938b1c8 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -673,6 +673,9 @@ static struct dentry *__create_dir(const char *name, struct dentry *parent, */ struct dentry *tracefs_create_dir(const char *name, struct dentry *parent) { + if (security_locked_down(LOCKDOWN_TRACEFS)) + return NULL; + return __create_dir(name, parent, &simple_dir_inode_operations); } -- 2.40.1

2 years, 3 months

1
0
0 0

[merged mm-hotfixes-stable] revert-memfd-improve-userspace-warnings-for-missing-exec-related-flags.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: revert "memfd: improve userspace warnings for missing exec-related flags". has been removed from the -mm tree. Its filename was revert-memfd-improve-userspace-warnings-for-missing-exec-related-flags.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Andrew Morton <akpm(a)linux-foundation.org> Subject: revert "memfd: improve userspace warnings for missing exec-related flags". Date: Sat Sep 2 03:59:31 PM PDT 2023 This warning is telling userspace developers to pass MFD_EXEC and MFD_NOEXEC_SEAL to memfd_create(). Commit 434ed3350f57 ("memfd: improve userspace warnings for missing exec-related flags") made the warning more frequent and visible in the hope that this would accelerate the fixing of errant userspace. But the overall effect is to generate far too much dmesg noise. Fixes: 434ed3350f57 ("memfd: improve userspace warnings for missing exec-related flags") Reported-by: Damian Tometzki <dtometzki(a)fedoraproject.org> Closes: https://lkml.kernel.org/r/ZPFzCSIgZ4QuHsSC@fedora.fritz.box Cc: Aleksa Sarai <cyphar(a)cyphar.com> Cc: Christian Brauner <brauner(a)kernel.org> Cc: Daniel Verkamp <dverkamp(a)chromium.org> Cc: Jeff Xu <jeffxu(a)google.com> Cc: Kees Cook <keescook(a)chromium.org> Cc: Shuah Khan <shuah(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memfd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/mm/memfd.c~revert-memfd-improve-userspace-warnings-for-missing-exec-related-flags +++ a/mm/memfd.c @@ -316,7 +316,7 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { - pr_info_ratelimited( + pr_warn_once( "%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n", current->comm, task_pid_nr(current)); } _ Patches currently in -mm which might be from akpm(a)linux-foundation.org are mm-shmem-fix-race-in-shmem_undo_range-w-thp-fix.patch

2 years, 3 months

1
0
0 0

Re: [syzbot] [mm?] WARNING in try_grab_page

by syzbot

syzbot has found a reproducer for the following issue on: HEAD commit: 3f86ed6ec0b3 Merge tag 'arc-6.6-rc1' of git://git.kernel.o.. git tree: upstream console+strace: https://syzkaller.appspot.com/x/log.txt?x=139ce690680000 kernel config: https://syzkaller.appspot.com/x/.config?x=ff0db7a15ba54ead dashboard link: https://syzkaller.appspot.com/bug?extid=9b82859567f2e50c123e compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10b0c620680000 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=152da4e7a80000 Downloadable assets: disk image: https://storage.googleapis.com/syzbot-assets/6f4f710c5033/disk-3f86ed6e.raw… vmlinux: https://storage.googleapis.com/syzbot-assets/555548fedbdc/vmlinux-3f86ed6e.… kernel image: https://storage.googleapis.com/syzbot-assets/c06d7c39bbc0/bzImage-3f86ed6e.… mounted in repro: https://storage.googleapis.com/syzbot-assets/120cc7b707b8/mount_0.gz IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+9b82859567f2e50c123e(a)syzkaller.appspotmail.com XFS (loop0): Quotacheck needed: Please wait. XFS (loop0): Quotacheck: Done. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 5030 at mm/gup.c:229 try_grab_page+0x287/0x460 Modules linked in: CPU: 1 PID: 5030 Comm: syz-executor118 Not tainted 6.5.0-syzkaller-11704-g3f86ed6ec0b3 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 RIP: 0010:try_grab_page+0x287/0x460 mm/gup.c:229 Code: 01 49 8d 7e 60 be 04 00 00 00 e8 54 41 18 00 f0 41 83 46 60 01 42 80 3c 2b 00 0f 85 6a ff ff ff e9 6d ff ff ff e8 b9 55 be ff <0f> 0b bb f4 ff ff ff eb b6 e8 ab 55 be ff 49 ff ce e9 ca fd ff ff RSP: 0018:ffffc90003a6ee88 EFLAGS: 00010293 RAX: ffffffff81cf4377 RBX: 0000000000000000 RCX: ffff888025da0000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001 RBP: 000000000000000e R08: ffffffff81cf418c R09: 1ffffd400039097e R10: dffffc0000000000 R11: fffff9400039097f R12: ffffea0001c84bf4 R13: dffffc0000000000 R14: ffffea0001c84bc0 R15: ffffea0001c84bc0 FS: 0000555555acb380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020008000 CR3: 00000000736e9000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> follow_page_pte+0x560/0x18f0 mm/gup.c:651 follow_pud_mask mm/gup.c:765 [inline] follow_p4d_mask mm/gup.c:782 [inline] follow_page_mask+0x7dc/0xe20 mm/gup.c:832 __get_user_pages+0x643/0x15e0 mm/gup.c:1237 __get_user_pages_locked mm/gup.c:1504 [inline] get_dump_page+0x146/0x2b0 mm/gup.c:2018 dump_user_range+0x126/0x910 fs/coredump.c:913 elf_core_dump+0x3b75/0x4490 fs/binfmt_elf.c:2142 do_coredump+0x1b73/0x2ab0 fs/coredump.c:764 get_signal+0x145e/0x1840 kernel/signal.c:2878 arch_do_signal_or_restart+0x96/0x860 arch/x86/kernel/signal.c:309 exit_to_user_mode_loop+0x6a/0x100 kernel/entry/common.c:168 exit_to_user_mode_prepare+0xb1/0x140 kernel/entry/common.c:204 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline] syscall_exit_to_user_mode+0x64/0x280 kernel/entry/common.c:296 do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fb68edcf0f9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 21 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffc8b18d558 EFLAGS: 00000246 ORIG_RAX: 000000000000004d RAX: ffffffffffffffe5 RBX: 0000000000000003 RCX: 00007fb68edcf0f9 RDX: 0000000000000000 RSI: 0000000100000001 RDI: 0000000000000006 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000555500000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000f4240 R13: 00007ffc8b18d7c8 R14: 0000000000000001 R15: 00007ffc8b18d590 </TASK> --- If you want syzbot to run the reproducer, reply with: #syz test: git://repo/address.git branch-or-commit-hash If you attach or paste a git patch, syzbot will apply it before testing.

2 years, 3 months

1
0
0 0

[merged mm-hotfixes-stable] rcu-dump-vmalloc-memory-info-safely.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: rcu: dump vmalloc memory info safely has been removed from the -mm tree. Its filename was rcu-dump-vmalloc-memory-info-safely.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Zqiang <qiang.zhang1211(a)gmail.com> Subject: rcu: dump vmalloc memory info safely Date: Mon, 4 Sep 2023 18:08:05 +0000 Currently, for double invoke call_rcu(), will dump rcu_head objects memory info, if the objects is not allocated from the slab allocator, the vmalloc_dump_obj() will be invoke and the vmap_area_lock spinlock need to be held, since the call_rcu() can be invoked in interrupt context, therefore, there is a possibility of spinlock deadlock scenarios. And in Preempt-RT kernel, the rcutorture test also trigger the following lockdep warning: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0 preempt_count: 1, expected: 0 RCU nest depth: 1, expected: 1 3 locks held by swapper/0/1: #0: ffffffffb534ee80 (fullstop_mutex){+.+.}-{4:4}, at: torture_init_begin+0x24/0xa0 #1: ffffffffb5307940 (rcu_read_lock){....}-{1:3}, at: rcu_torture_init+0x1ec7/0x2370 #2: ffffffffb536af40 (vmap_area_lock){+.+.}-{3:3}, at: find_vmap_area+0x1f/0x70 irq event stamp: 565512 hardirqs last enabled at (565511): [<ffffffffb379b138>] __call_rcu_common+0x218/0x940 hardirqs last disabled at (565512): [<ffffffffb5804262>] rcu_torture_init+0x20b2/0x2370 softirqs last enabled at (399112): [<ffffffffb36b2586>] __local_bh_enable_ip+0x126/0x170 softirqs last disabled at (399106): [<ffffffffb43fef59>] inet_register_protosw+0x9/0x1d0 Preemption disabled at: [<ffffffffb58040c3>] rcu_torture_init+0x1f13/0x2370 CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 6.5.0-rc4-rt2-yocto-preempt-rt+ #15 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x68/0xb0 dump_stack+0x14/0x20 __might_resched+0x1aa/0x280 ? __pfx_rcu_torture_err_cb+0x10/0x10 rt_spin_lock+0x53/0x130 ? find_vmap_area+0x1f/0x70 find_vmap_area+0x1f/0x70 vmalloc_dump_obj+0x20/0x60 mem_dump_obj+0x22/0x90 __call_rcu_common+0x5bf/0x940 ? debug_smp_processor_id+0x1b/0x30 call_rcu_hurry+0x14/0x20 rcu_torture_init+0x1f82/0x2370 ? __pfx_rcu_torture_leak_cb+0x10/0x10 ? __pfx_rcu_torture_leak_cb+0x10/0x10 ? __pfx_rcu_torture_init+0x10/0x10 do_one_initcall+0x6c/0x300 ? debug_smp_processor_id+0x1b/0x30 kernel_init_freeable+0x2b9/0x540 ? __pfx_kernel_init+0x10/0x10 kernel_init+0x1f/0x150 ret_from_fork+0x40/0x50 ? __pfx_kernel_init+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> The previous patch fixes this by using the deadlock-safe best-effort version of find_vm_area. However, in case of failure print the fact that the pointer was a vmalloc pointer so that we print at least something. Link: https://lkml.kernel.org/r/20230904180806.1002832-2-joel@joelfernandes.org Fixes: 98f180837a89 ("mm: Make mem_dump_obj() handle vmalloc() memory") Signed-off-by: Zqiang <qiang.zhang1211(a)gmail.com> Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org> Reported-by: Zhen Lei <thunder.leizhen(a)huaweicloud.com> Reviewed-by: Matthew Wilcox (Oracle) <willy(a)infradead.org> Cc: Paul E. McKenney <paulmck(a)kernel.org> Cc: Uladzislau Rezki (Sony) <urezki(a)gmail.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/util.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) --- a/mm/util.c~rcu-dump-vmalloc-memory-info-safely +++ a/mm/util.c @@ -1068,7 +1068,9 @@ void mem_dump_obj(void *object) if (vmalloc_dump_obj(object)) return; - if (virt_addr_valid(object)) + if (is_vmalloc_addr(object)) + type = "vmalloc memory"; + else if (virt_addr_valid(object)) type = "non-slab/vmalloc memory"; else if (object == NULL) type = "NULL pointer"; _ Patches currently in -mm which might be from qiang.zhang1211(a)gmail.com are

2 years, 3 months

1
0
0 0

[merged mm-hotfixes-stable] memcontrol-ensure-memcg-acquired-by-id-is-properly-set-up.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: memcontrol: ensure memcg acquired by id is properly set up has been removed from the -mm tree. Its filename was memcontrol-ensure-memcg-acquired-by-id-is-properly-set-up.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Johannes Weiner <hannes(a)cmpxchg.org> Subject: memcontrol: ensure memcg acquired by id is properly set up Date: Wed, 23 Aug 2023 15:54:30 -0700 In the eviction recency check, we attempt to retrieve the memcg to which the folio belonged when it was evicted, by the memcg id stored in the shadow entry. However, there is a chance that the retrieved memcg is not the original memcg that has been killed, but a new one which happens to have the same id. This is a somewhat unfortunate, but acceptable and rare inaccuracy in the heuristics. However, if we retrieve this new memcg between its allocation and when it is properly attached to the memcg hierarchy, we could run into the following NULL pointer exception during the memcg hierarchy traversal done in mem_cgroup_get_nr_swap_pages(): [ 155757.793456] BUG: kernel NULL pointer dereference, address: 00000000000000c0 [ 155757.807568] #PF: supervisor read access in kernel mode [ 155757.818024] #PF: error_code(0x0000) - not-present page [ 155757.828482] PGD 401f77067 P4D 401f77067 PUD 401f76067 PMD 0 [ 155757.839985] Oops: 0000 [#1] SMP [ 155757.887870] RIP: 0010:mem_cgroup_get_nr_swap_pages+0x3d/0xb0 [ 155757.899377] Code: 29 19 4a 02 48 39 f9 74 63 48 8b 97 c0 00 00 00 48 8b b7 58 02 00 00 48 2b b7 c0 01 00 00 48 39 f0 48 0f 4d c6 48 39 d1 74 42 <48> 8b b2 c0 00 00 00 48 8b ba 58 02 00 00 48 2b ba c0 01 00 00 48 [ 155757.937125] RSP: 0018:ffffc9002ecdfbc8 EFLAGS: 00010286 [ 155757.947755] RAX: 00000000003a3b1c RBX: 000007ffffffffff RCX: ffff888280183000 [ 155757.962202] RDX: 0000000000000000 RSI: 0007ffffffffffff RDI: ffff888bbc2d1000 [ 155757.976648] RBP: 0000000000000001 R08: 000000000000000b R09: ffff888ad9cedba0 [ 155757.991094] R10: ffffea0039c07900 R11: 0000000000000010 R12: ffff888b23a7b000 [ 155758.005540] R13: 0000000000000000 R14: ffff888bbc2d1000 R15: 000007ffffc71354 [ 155758.019991] FS: 00007f6234c68640(0000) GS:ffff88903f9c0000(0000) knlGS:0000000000000000 [ 155758.036356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 155758.048023] CR2: 00000000000000c0 CR3: 0000000a83eb8004 CR4: 00000000007706e0 [ 155758.062473] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 155758.076924] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 155758.091376] PKRU: 55555554 [ 155758.096957] Call Trace: [ 155758.102016] <TASK> [ 155758.106502] ? __die+0x78/0xc0 [ 155758.112793] ? page_fault_oops+0x286/0x380 [ 155758.121175] ? exc_page_fault+0x5d/0x110 [ 155758.129209] ? asm_exc_page_fault+0x22/0x30 [ 155758.137763] ? mem_cgroup_get_nr_swap_pages+0x3d/0xb0 [ 155758.148060] workingset_test_recent+0xda/0x1b0 [ 155758.157133] workingset_refault+0xca/0x1e0 [ 155758.165508] filemap_add_folio+0x4d/0x70 [ 155758.173538] page_cache_ra_unbounded+0xed/0x190 [ 155758.182919] page_cache_sync_ra+0xd6/0x1e0 [ 155758.191738] filemap_read+0x68d/0xdf0 [ 155758.199495] ? mlx5e_napi_poll+0x123/0x940 [ 155758.207981] ? __napi_schedule+0x55/0x90 [ 155758.216095] __x64_sys_pread64+0x1d6/0x2c0 [ 155758.224601] do_syscall_64+0x3d/0x80 [ 155758.232058] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 155758.242473] RIP: 0033:0x7f62c29153b5 [ 155758.249938] Code: e8 48 89 75 f0 89 7d f8 48 89 4d e0 e8 b4 e6 f7 ff 41 89 c0 4c 8b 55 e0 48 8b 55 e8 48 8b 75 f0 8b 7d f8 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 45 f8 e8 e7 e6 f7 ff 48 8b [ 155758.288005] RSP: 002b:00007f6234c5ffd0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011 [ 155758.303474] RAX: ffffffffffffffda RBX: 00007f628c4e70c0 RCX: 00007f62c29153b5 [ 155758.318075] RDX: 000000000003c041 RSI: 00007f61d2986000 RDI: 0000000000000076 [ 155758.332678] RBP: 00007f6234c5fff0 R08: 0000000000000000 R09: 0000000064d5230c [ 155758.347452] R10: 000000000027d450 R11: 0000000000000293 R12: 000000000003c041 [ 155758.362044] R13: 00007f61d2986000 R14: 00007f629e11b060 R15: 000000000027d450 [ 155758.376661] </TASK> This patch fixes the issue by moving the memcg's id publication from the alloc stage to online stage, ensuring that any memcg acquired via id must be connected to the memcg tree. Link: https://lkml.kernel.org/r/20230823225430.166925-1-nphamcs@gmail.com Fixes: f78dfc7b77d5 ("workingset: fix confusion around eviction vs refault container") Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org> Co-developed-by: Nhat Pham <nphamcs(a)gmail.com> Signed-off-by: Nhat Pham <nphamcs(a)gmail.com> Acked-by: Shakeel Butt <shakeelb(a)google.com> Cc: Yosry Ahmed <yosryahmed(a)google.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Roman Gushchin <roman.gushchin(a)linux.dev> Cc: Muchun Song <songmuchun(a)bytedance.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memcontrol.c | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) --- a/mm/memcontrol.c~memcontrol-ensure-memcg-acquired-by-id-is-properly-set-up +++ a/mm/memcontrol.c @@ -5326,7 +5326,6 @@ static struct mem_cgroup *mem_cgroup_all INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue); memcg->deferred_split_queue.split_queue_len = 0; #endif - idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); lru_gen_init_memcg(memcg); return memcg; fail: @@ -5398,14 +5397,27 @@ static int mem_cgroup_css_online(struct if (alloc_shrinker_info(memcg)) goto offline_kmem; - /* Online state pins memcg ID, memcg ID pins CSS */ - refcount_set(&memcg->id.ref, 1); - css_get(css); - if (unlikely(mem_cgroup_is_root(memcg))) queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); lru_gen_online_memcg(memcg); + + /* Online state pins memcg ID, memcg ID pins CSS */ + refcount_set(&memcg->id.ref, 1); + css_get(css); + + /* + * Ensure mem_cgroup_from_id() works once we're fully online. + * + * We could do this earlier and require callers to filter with + * css_tryget_online(). But right now there are no users that + * need earlier access, and the workingset code relies on the + * cgroup tree linkage (mem_cgroup_get_nr_swap_pages()). So + * publish it here at the end of onlining. This matches the + * regular ID destruction during offlining. + */ + idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); + return 0; offline_kmem: memcg_offline_kmem(memcg); _ Patches currently in -mm which might be from hannes(a)cmpxchg.org are

2 years, 3 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror September 2023