April 2022 - Linux-stable-mirror

[PATCH] usb: dwc3: core: Only handle soft-reset in DCTL

by Thinh Nguyen

Make sure not to set run_stop bit or link state change request while initiating soft-reset. Register read-modify-write operation may unintentionally start the controller before the initialization completes with its previous DCTL value, which can cause initialization failure. Fixes: f59dcab17629 ("usb: dwc3: core: improve reset sequence") Cc: <stable(a)vger.kernel.org> Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com> --- drivers/usb/dwc3/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c index 1ca9dae57855..d28cd1a6709b 100644 --- a/drivers/usb/dwc3/core.c +++ b/drivers/usb/dwc3/core.c @@ -274,7 +274,8 @@ int dwc3_core_soft_reset(struct dwc3 *dwc) reg = dwc3_readl(dwc->regs, DWC3_DCTL); reg |= DWC3_DCTL_CSFTRST; - dwc3_writel(dwc->regs, DWC3_DCTL, reg); + reg &= ~DWC3_DCTL_RUN_STOP; + dwc3_gadget_dctl_write_safe(dwc, reg); /* * For DWC_usb31 controller 1.90a and later, the DCTL.CSFRST bit base-commit: bf95c4d4630c7a2c16e7b424fdea5177d9ce0864 -- 2.28.0

3 years, 7 months

1
0
0 0

[PATCH] f2fs: should not truncate blocks during roll-forward recovery

by Jaegeuk Kim

If the file preallocated blocks and fsync'ed, we should not truncate them during roll-forward recovery which will recover i_size correctly back. Fixes: d4dd19ec1ea0 ("f2fs: do not expose unwritten blocks to user by DIO") Cc: <stable(a)vger.kernel.org> # 5.17+ Signed-off-by: Jaegeuk Kim <jaegeuk(a)kernel.org> --- fs/f2fs/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c index 71f232dcf3c2..83639238a1fe 100644 --- a/fs/f2fs/inode.c +++ b/fs/f2fs/inode.c @@ -550,7 +550,8 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino) } f2fs_set_inode_flags(inode); - if (file_should_truncate(inode)) { + if (file_should_truncate(inode) && + !is_sbi_flag_set(sbi, SBI_POR_DOING)) { ret = f2fs_truncate(inode); if (ret) goto bad_inode; -- 2.36.0.rc2.479.g8af0fa9b8e-goog

3 years, 7 months

1
0
0 0

stable-rc/queue/5.15 baseline: 136 runs, 2 regressions (v5.15.35-13-gfcfbe4b48b2d)

by kernelci.org bot

stable-rc/queue/5.15 baseline: 136 runs, 2 regressions (v5.15.35-13-gfcfbe4b48b2d) Regressions Summary ------------------- platform | arch | lab | compiler | defconfig | regressions -----------------+-------+---------------+----------+----------------------------+------------ beagle-xm | arm | lab-baylibre | gcc-10 | omap2plus_defconfig | 1 rk3399-gru-kevin | arm64 | lab-collabora | gcc-10 | defconfig+arm64-chromebook | 1 Details: https://kernelci.org/test/job/stable-rc/branch/queue%2F5.15/kernel/v5.15.35… Test: baseline Tree: stable-rc Branch: queue/5.15 Describe: v5.15.35-13-gfcfbe4b48b2d URL: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git SHA: fcfbe4b48b2d5e976b297a08ecb6c7101d118013 Test Regressions ---------------- platform | arch | lab | compiler | defconfig | regressions -----------------+-------+---------------+----------+----------------------------+------------ beagle-xm | arm | lab-baylibre | gcc-10 | omap2plus_defconfig | 1 Details: https://kernelci.org/test/plan/id/6261bf31bdd83f8050ff9470 Results: 0 PASS, 1 FAIL, 0 SKIP Full config: omap2plus_defconfig Compiler: gcc-10 (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110) Plain log: https://storage.kernelci.org//stable-rc/queue-5.15/v5.15.35-13-gfcfbe4b48b2… HTML log: https://storage.kernelci.org//stable-rc/queue-5.15/v5.15.35-13-gfcfbe4b48b2… Rootfs: http://storage.kernelci.org/images/rootfs/buildroot/buildroot-baseline/2022… * baseline.login: https://kernelci.org/test/case/id/6261bf31bdd83f8050ff9471 failing since 22 days (last pass: v5.15.31-2-g57d4301e22c2, first fail: v5.15.31-3-g4ae45332eb9c) platform | arch | lab | compiler | defconfig | regressions -----------------+-------+---------------+----------+----------------------------+------------ rk3399-gru-kevin | arm64 | lab-collabora | gcc-10 | defconfig+arm64-chromebook | 1 Details: https://kernelci.org/test/plan/id/6261c36783911ea078ff9484 Results: 88 PASS, 4 FAIL, 0 SKIP Full config: defconfig+arm64-chromebook Compiler: gcc-10 (aarch64-linux-gnu-gcc (Debian 10.2.1-6) 10.2.1 20210110) Plain log: https://storage.kernelci.org//stable-rc/queue-5.15/v5.15.35-13-gfcfbe4b48b2… HTML log: https://storage.kernelci.org//stable-rc/queue-5.15/v5.15.35-13-gfcfbe4b48b2… Rootfs: http://storage.kernelci.org/images/rootfs/buildroot/buildroot-baseline/2022… * baseline.bootrr.rockchip-i2s1-probed: https://kernelci.org/test/case/id/6261c36783911ea078ff94aa failing since 45 days (last pass: v5.15.26-42-gc89c0807b943, first fail: v5.15.26-257-g2b9a22cd5eb8) 2022-04-21T20:49:18.677073 <8>[ 59.740309] <LAVA_SIGNAL_TESTCASE TEST_CASE_ID=rockchip-i2s0-probed RESULT=pass> 2022-04-21T20:49:19.700168 /lava-6143930/1/../bin/lava-test-case 2022-04-21T20:49:19.710763 <8>[ 60.774439] <LAVA_SIGNAL_TESTCASE TEST_CASE_ID=rockchip-i2s1-probed RESULT=fail>

3 years, 7 months

1
0
0 0

stable-rc/queue/5.10 baseline: 88 runs, 1 regressions (v5.10.112-5-g3b8fa2d70abc7)

by kernelci.org bot

stable-rc/queue/5.10 baseline: 88 runs, 1 regressions (v5.10.112-5-g3b8fa2d70abc7) Regressions Summary ------------------- platform | arch | lab | compiler | defconfig | regressions -----------------+-------+---------------+----------+----------------------------+------------ rk3399-gru-kevin | arm64 | lab-collabora | gcc-10 | defconfig+arm64-chromebook | 1 Details: https://kernelci.org/test/job/stable-rc/branch/queue%2F5.10/kernel/v5.10.11… Test: baseline Tree: stable-rc Branch: queue/5.10 Describe: v5.10.112-5-g3b8fa2d70abc7 URL: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git SHA: 3b8fa2d70abc7c74178c97e122617294df934a9d Test Regressions ---------------- platform | arch | lab | compiler | defconfig | regressions -----------------+-------+---------------+----------+----------------------------+------------ rk3399-gru-kevin | arm64 | lab-collabora | gcc-10 | defconfig+arm64-chromebook | 1 Details: https://kernelci.org/test/plan/id/6261c29babb02cf79eff948e Results: 90 PASS, 2 FAIL, 0 SKIP Full config: defconfig+arm64-chromebook Compiler: gcc-10 (aarch64-linux-gnu-gcc (Debian 10.2.1-6) 10.2.1 20210110) Plain log: https://storage.kernelci.org//stable-rc/queue-5.10/v5.10.112-5-g3b8fa2d70ab… HTML log: https://storage.kernelci.org//stable-rc/queue-5.10/v5.10.112-5-g3b8fa2d70ab… Rootfs: http://storage.kernelci.org/images/rootfs/buildroot/buildroot-baseline/2022… * baseline.bootrr.rockchip-i2s1-probed: https://kernelci.org/test/case/id/6261c29babb02cf79eff94b4 failing since 44 days (last pass: v5.10.103-56-ge5a40f18f4ce, first fail: v5.10.103-105-gf074cce6ae0d) 2022-04-21T20:45:56.139507 <8>[ 59.668895] <LAVA_SIGNAL_TESTCASE TEST_CASE_ID=rockchip-i2s0-probed RESULT=pass> 2022-04-21T20:45:57.161688 /lava-6143620/1/../bin/lava-test-case 2022-04-21T20:45:57.172839 <8>[ 60.703531] <LAVA_SIGNAL_TESTCASE TEST_CASE_ID=rockchip-i2s1-probed RESULT=fail>

3 years, 7 months

1
0
0 0

stable-rc/queue/4.19 baseline: 83 runs, 4 regressions (v4.19.239-6-g5ad0881ca15e2)

by kernelci.org bot

stable-rc/queue/4.19 baseline: 83 runs, 4 regressions (v4.19.239-6-g5ad0881ca15e2) Regressions Summary ------------------- platform | arch | lab | compiler | defconfig | regressions -----------------------------+-------+---------------+----------+----------------------------+------------ meson-gxbb-p200 | arm64 | lab-baylibre | gcc-10 | defconfig | 1 meson-gxl-s905d-p230 | arm64 | lab-baylibre | gcc-10 | defconfig | 1 meson-gxl-s905x-libretech-cc | arm64 | lab-broonie | gcc-10 | defconfig | 1 rk3399-gru-kevin | arm64 | lab-collabora | gcc-10 | defconfig+arm64-chromebook | 1 Details: https://kernelci.org/test/job/stable-rc/branch/queue%2F4.19/kernel/v4.19.23… Test: baseline Tree: stable-rc Branch: queue/4.19 Describe: v4.19.239-6-g5ad0881ca15e2 URL: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git SHA: 5ad0881ca15e24828a69d272438bbe483071e202 Test Regressions ---------------- platform | arch | lab | compiler | defconfig | regressions -----------------------------+-------+---------------+----------+----------------------------+------------ meson-gxbb-p200 | arm64 | lab-baylibre | gcc-10 | defconfig | 1 Details: https://kernelci.org/test/plan/id/6261c11e5f9272314aff9488 Results: 0 PASS, 1 FAIL, 0 SKIP Full config: defconfig Compiler: gcc-10 (aarch64-linux-gnu-gcc (Debian 10.2.1-6) 10.2.1 20210110) Plain log: https://storage.kernelci.org//stable-rc/queue-4.19/v4.19.239-6-g5ad0881ca15… HTML log: https://storage.kernelci.org//stable-rc/queue-4.19/v4.19.239-6-g5ad0881ca15… Rootfs: http://storage.kernelci.org/images/rootfs/buildroot/buildroot-baseline/2022… * baseline.login: https://kernelci.org/test/case/id/6261c11e5f9272314aff9489 failing since 21 days (last pass: v4.19.235-17-gd92d6a84236d, first fail: v4.19.235-22-ge34a3fde5b20) platform | arch | lab | compiler | defconfig | regressions -----------------------------+-------+---------------+----------+----------------------------+------------ meson-gxl-s905d-p230 | arm64 | lab-baylibre | gcc-10 | defconfig | 1 Details: https://kernelci.org/test/plan/id/6261be4e9c2f3616eeff948b Results: 0 PASS, 1 FAIL, 0 SKIP Full config: defconfig Compiler: gcc-10 (aarch64-linux-gnu-gcc (Debian 10.2.1-6) 10.2.1 20210110) Plain log: https://storage.kernelci.org//stable-rc/queue-4.19/v4.19.239-6-g5ad0881ca15… HTML log: https://storage.kernelci.org//stable-rc/queue-4.19/v4.19.239-6-g5ad0881ca15… Rootfs: http://storage.kernelci.org/images/rootfs/buildroot/buildroot-baseline/2022… * baseline.login: https://kernelci.org/test/case/id/6261be4e9c2f3616eeff948c failing since 15 days (last pass: v4.19.237-15-g3c6b80cc3200, first fail: v4.19.237-256-ge149a8f3cb39) platform | arch | lab | compiler | defconfig | regressions -----------------------------+-------+---------------+----------+----------------------------+------------ meson-gxl-s905x-libretech-cc | arm64 | lab-broonie | gcc-10 | defconfig | 1 Details: https://kernelci.org/test/plan/id/6261b8e8b133593ff1ff9470 Results: 0 PASS, 1 FAIL, 0 SKIP Full config: defconfig Compiler: gcc-10 (aarch64-linux-gnu-gcc (Debian 10.2.1-6) 10.2.1 20210110) Plain log: https://storage.kernelci.org//stable-rc/queue-4.19/v4.19.239-6-g5ad0881ca15… HTML log: https://storage.kernelci.org//stable-rc/queue-4.19/v4.19.239-6-g5ad0881ca15… Rootfs: http://storage.kernelci.org/images/rootfs/buildroot/buildroot-baseline/2022… * baseline.login: https://kernelci.org/test/case/id/6261b8e8b133593ff1ff9471 failing since 2 days (last pass: v4.19.238-22-gb215381f8cf05, first fail: v4.19.238-32-g4d86c9395c31a) platform | arch | lab | compiler | defconfig | regressions -----------------------------+-------+---------------+----------+----------------------------+------------ rk3399-gru-kevin | arm64 | lab-collabora | gcc-10 | defconfig+arm64-chromebook | 1 Details: https://kernelci.org/test/plan/id/6261bef898ef5130a1ff9477 Results: 83 PASS, 7 FAIL, 0 SKIP Full config: defconfig+arm64-chromebook Compiler: gcc-10 (aarch64-linux-gnu-gcc (Debian 10.2.1-6) 10.2.1 20210110) Plain log: https://storage.kernelci.org//stable-rc/queue-4.19/v4.19.239-6-g5ad0881ca15… HTML log: https://storage.kernelci.org//stable-rc/queue-4.19/v4.19.239-6-g5ad0881ca15… Rootfs: http://storage.kernelci.org/images/rootfs/buildroot/buildroot-baseline/2022… * baseline.bootrr.rockchip-i2s1-probed: https://kernelci.org/test/case/id/6261bef898ef5130a1ff94a6 failing since 46 days (last pass: v4.19.232-31-g5cf846953aa2, first fail: v4.19.232-44-gfd65e02206f4) 2022-04-21T20:30:35.247610 /lava-6143527/1/../bin/lava-test-case

3 years, 7 months

1
0
0 0

[patch 13/13] mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()

by Andrew Morton

From: Alistair Popple <apopple(a)nvidia.com> Subject: mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove() In some cases it is possible for mmu_interval_notifier_remove() to race with mn_tree_inv_end() allowing it to return while the notifier data structure is still in use. Consider the following sequence: CPU0 - mn_tree_inv_end() CPU1 - mmu_interval_notifier_remove() ----------------------------------- ------------------------------------ spin_lock(subscriptions->lock); seq = subscriptions->invalidate_seq; spin_lock(subscriptions->lock); spin_unlock(subscriptions->lock); subscriptions->invalidate_seq++; wait_event(invalidate_seq != seq); return; interval_tree_remove(interval_sub); kfree(interval_sub); spin_unlock(subscriptions->lock); wake_up_all(); As the wait_event() condition is true it will return immediately. This can lead to use-after-free type errors if the caller frees the data structure containing the interval notifier subscription while it is still on a deferred list. Fix this by taking the appropriate lock when reading invalidate_seq to ensure proper synchronisation. I observed this whilst running stress testing during some development. You do have to be pretty unlucky, but it leads to the usual problems of use-after-free (memory corruption, kernel crash, difficult to diagnose WARN_ON, etc). Link: https://lkml.kernel.org/r/20220420043734.476348-1-apopple@nvidia.com Fixes: 99cb252f5e68 ("mm/mmu_notifier: add an interval tree notifier") Signed-off-by: Alistair Popple <apopple(a)nvidia.com> Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com> Cc: Christian K��nig <christian.koenig(a)amd.com> Cc: John Hubbard <jhubbard(a)nvidia.com> Cc: Ralph Campbell <rcampbell(a)nvidia.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/mmu_notifier.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) --- a/mm/mmu_notifier.c~mm-mmu_notifierc-fix-race-in-mmu_interval_notifier_remove +++ a/mm/mmu_notifier.c @@ -1036,6 +1036,18 @@ int mmu_interval_notifier_insert_locked( } EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked); +static bool +mmu_interval_seq_released(struct mmu_notifier_subscriptions *subscriptions, + unsigned long seq) +{ + bool ret; + + spin_lock(&subscriptions->lock); + ret = subscriptions->invalidate_seq != seq; + spin_unlock(&subscriptions->lock); + return ret; +} + /** * mmu_interval_notifier_remove - Remove a interval notifier * @interval_sub: Interval subscription to unregister @@ -1083,7 +1095,7 @@ void mmu_interval_notifier_remove(struct lock_map_release(&__mmu_notifier_invalidate_range_start_map); if (seq) wait_event(subscriptions->wq, - READ_ONCE(subscriptions->invalidate_seq) != seq); + mmu_interval_seq_released(subscriptions, seq)); /* pairs with mmgrab in mmu_interval_notifier_insert() */ mmdrop(mm); _

3 years, 7 months

1
0
0 0

[patch 10/13] oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup

by Andrew Morton

From: Nico Pache <npache(a)redhat.com> Subject: oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can be targeted by the oom reaper. This mapping is used to store the futex robust list head; the kernel does not keep a copy of the robust list and instead references a userspace address to maintain the robustness during a process death. A race can occur between exit_mm and the oom reaper that allows the oom reaper to free the memory of the futex robust list before the exit path has handled the futex death: CPU1 CPU2 ------------------------------------------------------------------------ page_fault do_exit "signal" wake_oom_reaper oom_reaper oom_reap_task_mm (invalidates mm) exit_mm exit_mm_release futex_exit_release futex_cleanup exit_robust_list get_user (EFAULT- can't access memory) If the get_user EFAULT's, the kernel will be unable to recover the waiters on the robust_list, leaving userspace mutexes hung indefinitely. Delay the OOM reaper, allowing more time for the exit path to perform the futex cleanup. Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer Based on a patch by Michal Hocko. [1] https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Signed-off-by: Joel Savitz <jsavitz(a)redhat.com> Signed-off-by: Nico Pache <npache(a)redhat.com> Co-developed-by: Joel Savitz <jsavitz(a)redhat.com> Suggested-by: Thomas Gleixner <tglx(a)linutronix.de> Acked-by: Thomas Gleixner <tglx(a)linutronix.de> Acked-by: Michal Hocko <mhocko(a)suse.com> Cc: Rafael Aquini <aquini(a)redhat.com> Cc: Waiman Long <longman(a)redhat.com> Cc: Herton R. Krzesinski <herton(a)redhat.com> Cc: Juri Lelli <juri.lelli(a)redhat.com> Cc: Vincent Guittot <vincent.guittot(a)linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann(a)arm.com> Cc: Steven Rostedt <rostedt(a)goodmis.org> Cc: Ben Segall <bsegall(a)google.com> Cc: Mel Gorman <mgorman(a)suse.de> Cc: Daniel Bristot de Oliveira <bristot(a)redhat.com> Cc: David Rientjes <rientjes(a)google.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Cc: Davidlohr Bueso <dave(a)stgolabs.net> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: Joel Savitz <jsavitz(a)redhat.com> Cc: Darren Hart <dvhart(a)infradead.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- include/linux/sched.h | 1 mm/oom_kill.c | 54 +++++++++++++++++++++++++++++----------- 2 files changed, 41 insertions(+), 14 deletions(-) --- a/include/linux/sched.h~oom_killc-futex-delay-the-oom-reaper-to-allow-time-for-proper-futex-cleanup +++ a/include/linux/sched.h @@ -1443,6 +1443,7 @@ struct task_struct { int pagefault_disabled; #ifdef CONFIG_MMU struct task_struct *oom_reaper_list; + struct timer_list oom_reaper_timer; #endif #ifdef CONFIG_VMAP_STACK struct vm_struct *stack_vm_area; --- a/mm/oom_kill.c~oom_killc-futex-delay-the-oom-reaper-to-allow-time-for-proper-futex-cleanup +++ a/mm/oom_kill.c @@ -632,7 +632,7 @@ done: */ set_bit(MMF_OOM_SKIP, &mm->flags); - /* Drop a reference taken by wake_oom_reaper */ + /* Drop a reference taken by queue_oom_reaper */ put_task_struct(tsk); } @@ -644,12 +644,12 @@ static int oom_reaper(void *unused) struct task_struct *tsk = NULL; wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL); - spin_lock(&oom_reaper_lock); + spin_lock_irq(&oom_reaper_lock); if (oom_reaper_list != NULL) { tsk = oom_reaper_list; oom_reaper_list = tsk->oom_reaper_list; } - spin_unlock(&oom_reaper_lock); + spin_unlock_irq(&oom_reaper_lock); if (tsk) oom_reap_task(tsk); @@ -658,22 +658,48 @@ static int oom_reaper(void *unused) return 0; } -static void wake_oom_reaper(struct task_struct *tsk) +static void wake_oom_reaper(struct timer_list *timer) { - /* mm is already queued? */ - if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) + struct task_struct *tsk = container_of(timer, struct task_struct, + oom_reaper_timer); + struct mm_struct *mm = tsk->signal->oom_mm; + unsigned long flags; + + /* The victim managed to terminate on its own - see exit_mmap */ + if (test_bit(MMF_OOM_SKIP, &mm->flags)) { + put_task_struct(tsk); return; + } - get_task_struct(tsk); - - spin_lock(&oom_reaper_lock); + spin_lock_irqsave(&oom_reaper_lock, flags); tsk->oom_reaper_list = oom_reaper_list; oom_reaper_list = tsk; - spin_unlock(&oom_reaper_lock); + spin_unlock_irqrestore(&oom_reaper_lock, flags); trace_wake_reaper(tsk->pid); wake_up(&oom_reaper_wait); } +/* + * Give the OOM victim time to exit naturally before invoking the oom_reaping. + * The timers timeout is arbitrary... the longer it is, the longer the worst + * case scenario for the OOM can take. If it is too small, the oom_reaper can + * get in the way and release resources needed by the process exit path. + * e.g. The futex robust list can sit in Anon|Private memory that gets reaped + * before the exit path is able to wake the futex waiters. + */ +#define OOM_REAPER_DELAY (2*HZ) +static void queue_oom_reaper(struct task_struct *tsk) +{ + /* mm is already queued? */ + if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) + return; + + get_task_struct(tsk); + timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0); + tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY; + add_timer(&tsk->oom_reaper_timer); +} + static int __init oom_init(void) { oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); @@ -681,7 +707,7 @@ static int __init oom_init(void) } subsys_initcall(oom_init) #else -static inline void wake_oom_reaper(struct task_struct *tsk) +static inline void queue_oom_reaper(struct task_struct *tsk) { } #endif /* CONFIG_MMU */ @@ -932,7 +958,7 @@ static void __oom_kill_process(struct ta rcu_read_unlock(); if (can_oom_reap) - wake_oom_reaper(victim); + queue_oom_reaper(victim); mmdrop(mm); put_task_struct(victim); @@ -968,7 +994,7 @@ static void oom_kill_process(struct oom_ task_lock(victim); if (task_will_free_mem(victim)) { mark_oom_victim(victim); - wake_oom_reaper(victim); + queue_oom_reaper(victim); task_unlock(victim); put_task_struct(victim); return; @@ -1067,7 +1093,7 @@ bool out_of_memory(struct oom_control *o */ if (task_will_free_mem(current)) { mark_oom_victim(current); - wake_oom_reaper(current); + queue_oom_reaper(current); return true; } _

3 years, 7 months

1
0
0 0

[patch 05/13] mm, hugetlb: allow for "high" userspace addresses

by Andrew Morton

From: Christophe Leroy <christophe.leroy(a)csgroup.eu> Subject: mm, hugetlb: allow for "high" userspace addresses This is a fix for commit f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses") for hugetlb. This patch adds support for "high" userspace addresses that are optionally supported on the system and have to be requested via a hint mechanism ("high" addr parameter to mmap). Architectures such as powerpc and x86 achieve this by making changes to their architectural versions of hugetlb_get_unmapped_area() function. However, arm64 uses the generic version of that function. So take into account arch_get_mmap_base() and arch_get_mmap_end() in hugetlb_get_unmapped_area(). To allow that, move those two macros out of mm/mmap.c into include/linux/sched/mm.h If these macros are not defined in architectural code then they default to (TASK_SIZE) and (base) so should not introduce any behavioural changes to architectures that do not define them. For the time being, only ARM64 is affected by this change. Catalin (ARM64) said : We should have fixed hugetlb_get_unmapped_area() as well when we added : support for 52-bit VA. The reason for commit f6795053dac8 was to prevent : normal mmap() from returning addresses above 48-bit by default as some : user-space had hard assumptions about this. : : It's a slight ABI change if you do this for hugetlb_get_unmapped_area() : but I doubt anyone would notice. It's more likely that the current : behaviour would cause issues, so I'd rather have them consistent. : : Basically when arm64 gained support for 52-bit addresses we did not : want user-space calling mmap() to suddenly get such high addresses, : otherwise we could have inadvertently broken some programs (similar : behaviour to x86 here). Hence we added commit f6795053dac8. But we : missed hugetlbfs which could still get such high mmap() addresses. So : in theory that's a potential regression that should have bee addressed : at the same time as commit f6795053dac8 (and before arm64 enabled : 52-bit addresses). Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.16500337… Fixes: f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses") Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu> Reviewed-by: Catalin Marinas <catalin.marinas(a)arm.com> Cc: Steve Capper <steve.capper(a)arm.com> Cc: Will Deacon <will.deacon(a)arm.com> Cc: <stable(a)vger.kernel.org> [5.0.x] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- fs/hugetlbfs/inode.c | 9 +++++---- include/linux/sched/mm.h | 8 ++++++++ mm/mmap.c | 8 -------- 3 files changed, 13 insertions(+), 12 deletions(-) --- a/fs/hugetlbfs/inode.c~mm-hugetlbfs-allow-for-high-userspace-addresses +++ a/fs/hugetlbfs/inode.c @@ -206,7 +206,7 @@ hugetlb_get_unmapped_area_bottomup(struc info.flags = 0; info.length = len; info.low_limit = current->mm->mmap_base; - info.high_limit = TASK_SIZE; + info.high_limit = arch_get_mmap_end(addr); info.align_mask = PAGE_MASK & ~huge_page_mask(h); info.align_offset = 0; return vm_unmapped_area(&info); @@ -222,7 +222,7 @@ hugetlb_get_unmapped_area_topdown(struct info.flags = VM_UNMAPPED_AREA_TOPDOWN; info.length = len; info.low_limit = max(PAGE_SIZE, mmap_min_addr); - info.high_limit = current->mm->mmap_base; + info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base); info.align_mask = PAGE_MASK & ~huge_page_mask(h); info.align_offset = 0; addr = vm_unmapped_area(&info); @@ -237,7 +237,7 @@ hugetlb_get_unmapped_area_topdown(struct VM_BUG_ON(addr != -ENOMEM); info.flags = 0; info.low_limit = current->mm->mmap_base; - info.high_limit = TASK_SIZE; + info.high_limit = arch_get_mmap_end(addr); addr = vm_unmapped_area(&info); } @@ -251,6 +251,7 @@ hugetlb_get_unmapped_area(struct file *f struct mm_struct *mm = current->mm; struct vm_area_struct *vma; struct hstate *h = hstate_file(file); + const unsigned long mmap_end = arch_get_mmap_end(addr); if (len & ~huge_page_mask(h)) return -EINVAL; @@ -266,7 +267,7 @@ hugetlb_get_unmapped_area(struct file *f if (addr) { addr = ALIGN(addr, huge_page_size(h)); vma = find_vma(mm, addr); - if (TASK_SIZE - len >= addr && + if (mmap_end - len >= addr && (!vma || addr + len <= vm_start_gap(vma))) return addr; } --- a/include/linux/sched/mm.h~mm-hugetlbfs-allow-for-high-userspace-addresses +++ a/include/linux/sched/mm.h @@ -136,6 +136,14 @@ static inline void mm_update_next_owner( #endif /* CONFIG_MEMCG */ #ifdef CONFIG_MMU +#ifndef arch_get_mmap_end +#define arch_get_mmap_end(addr) (TASK_SIZE) +#endif + +#ifndef arch_get_mmap_base +#define arch_get_mmap_base(addr, base) (base) +#endif + extern void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack); extern unsigned long --- a/mm/mmap.c~mm-hugetlbfs-allow-for-high-userspace-addresses +++ a/mm/mmap.c @@ -2117,14 +2117,6 @@ unsigned long vm_unmapped_area(struct vm return addr; } -#ifndef arch_get_mmap_end -#define arch_get_mmap_end(addr) (TASK_SIZE) -#endif - -#ifndef arch_get_mmap_base -#define arch_get_mmap_base(addr, base) (base) -#endif - /* Get an address range which is currently unmapped. * For shmat() with addr=0. * _

3 years, 7 months

1
0
0 0

[patch 03/13] memcg: sync flush only if periodic flush is delayed

by Andrew Morton

From: Shakeel Butt <shakeelb(a)google.com> Subject: memcg: sync flush only if periodic flush is delayed Daniel Dao has reported [1] a regression on workloads that may trigger a lot of refaults (anon and file). The underlying issue is that flushing rstat is expensive. Although rstat flush are batched with (nr_cpus * MEMCG_BATCH) stat updates, it seems like there are workloads which genuinely do stat updates larger than batch value within short amount of time. Since the rstat flush can happen in the performance critical codepaths like page faults, such workload can suffer greatly. This patch fixes this regression by making the rstat flushing conditional in the performance critical codepaths. More specifically, the kernel relies on the async periodic rstat flusher to flush the stats and only if the periodic flusher is delayed by more than twice the amount of its normal time window then the kernel allows rstat flushing from the performance critical codepaths. Now the question: what are the side-effects of this change? The worst that can happen is the refault codepath will see 4sec old lruvec stats and may cause false (or missed) activations of the refaulted page which may under-or-overestimate the workingset size. Though that is not very concerning as the kernel can already miss or do false activations. There are two more codepaths whose flushing behavior is not changed by this patch and we may need to come to them in future. One is the writeback stats used by dirty throttling and second is the deactivation heuristic in the reclaim. For now keeping an eye on them and if there is report of regression due to these codepaths, we will reevaluate then. Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndg… [1] Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault") Signed-off-by: Shakeel Butt <shakeelb(a)google.com> Reported-by: Daniel Dao <dqminh(a)cloudflare.com> Tested-by: Ivan Babrou <ivan(a)cloudflare.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Roman Gushchin <roman.gushchin(a)linux.dev> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Michal Koutn�� <mkoutny(a)suse.com> Cc: Frank Hofmann <fhofmann(a)cloudflare.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- include/linux/memcontrol.h | 5 +++++ mm/memcontrol.c | 12 +++++++++++- mm/workingset.c | 2 +- 3 files changed, 17 insertions(+), 2 deletions(-) --- a/include/linux/memcontrol.h~memcg-sync-flush-only-if-periodic-flush-is-delayed +++ a/include/linux/memcontrol.h @@ -1012,6 +1012,7 @@ static inline unsigned long lruvec_page_ } void mem_cgroup_flush_stats(void); +void mem_cgroup_flush_stats_delayed(void); void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val); @@ -1455,6 +1456,10 @@ static inline void mem_cgroup_flush_stat { } +static inline void mem_cgroup_flush_stats_delayed(void) +{ +} + static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val) { --- a/mm/memcontrol.c~memcg-sync-flush-only-if-periodic-flush-is-delayed +++ a/mm/memcontrol.c @@ -587,6 +587,9 @@ static DECLARE_DEFERRABLE_WORK(stats_flu static DEFINE_SPINLOCK(stats_flush_lock); static DEFINE_PER_CPU(unsigned int, stats_updates); static atomic_t stats_flush_threshold = ATOMIC_INIT(0); +static u64 flush_next_time; + +#define FLUSH_TIME (2UL*HZ) /* * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can @@ -637,6 +640,7 @@ static void __mem_cgroup_flush_stats(voi if (!spin_trylock_irqsave(&stats_flush_lock, flag)) return; + flush_next_time = jiffies_64 + 2*FLUSH_TIME; cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup); atomic_set(&stats_flush_threshold, 0); spin_unlock_irqrestore(&stats_flush_lock, flag); @@ -648,10 +652,16 @@ void mem_cgroup_flush_stats(void) __mem_cgroup_flush_stats(); } +void mem_cgroup_flush_stats_delayed(void) +{ + if (time_after64(jiffies_64, flush_next_time)) + mem_cgroup_flush_stats(); +} + static void flush_memcg_stats_dwork(struct work_struct *w) { __mem_cgroup_flush_stats(); - queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ); + queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); } /** --- a/mm/workingset.c~memcg-sync-flush-only-if-periodic-flush-is-delayed +++ a/mm/workingset.c @@ -355,7 +355,7 @@ void workingset_refault(struct folio *fo mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_delayed(); /* * Compare the distance to the existing workingset size. We * don't activate pages that couldn't stay resident even if _

3 years, 7 months

1
0
0 0

[patch 02/13] mm/memory-failure.c: skip huge_zero_page in memory_failure()

by Andrew Morton

From: Xu Yu <xuyu(a)linux.alibaba.com> Subject: mm/memory-failure.c: skip huge_zero_page in memory_failure() Kernel panic when injecting memory_failure for the global huge_zero_page, when CONFIG_DEBUG_VM is enabled, as follows. [ 5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000 [ 5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00 [ 5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0 [ 5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff) [ 5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000 [ 5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000 [ 5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head)) [ 5.589398] ------------[ cut here ]------------ [ 5.589952] kernel BUG at mm/huge_memory.c:2499! [ 5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11 [ 5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014 [ 5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880 [ 5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b [ 5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246 [ 5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000 [ 5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff [ 5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff [ 5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000 [ 5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40 [ 5.600693] FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000 [ 5.601640] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0 [ 5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5.604806] Call Trace: [ 5.605101] <TASK> [ 5.605357] ? __irq_work_queue_local+0x39/0x70 [ 5.605904] try_to_split_thp_page+0x3a/0x130 [ 5.606430] memory_failure+0x128/0x800 [ 5.606888] madvise_inject_error.cold+0x8b/0xa1 [ 5.607444] __x64_sys_madvise+0x54/0x60 [ 5.607915] do_syscall_64+0x35/0x80 [ 5.608347] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 5.608949] RIP: 0033:0x7fc3754f8bf9 [ 5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 [ 5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c [ 5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9 [ 5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000 [ 5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000 [ 5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490 [ 5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000 [ 5.616626] </TASK> This makes huge_zero_page bail out explicitly before split in memory_failure(), thus the panic above won't happen again. Link: https://lkml.kernel.org/r/497d3835612610e370c74e697ea3c721d1d55b9c.16497758… Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7") Signed-off-by: Xu Yu <xuyu(a)linux.alibaba.com> Reported-by: Abaci <abaci(a)linux.alibaba.com> Suggested-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com> Acked-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com> Reviewed-by: Miaohe Lin <linmiaohe(a)huawei.com> Cc: Anshuman Khandual <anshuman.khandual(a)arm.com> Cc: Oscar Salvador <osalvador(a)suse.de> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memory-failure.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) --- a/mm/memory-failure.c~mm-memory-failurec-skip-huge_zero_page-in-memory_failure +++ a/mm/memory-failure.c @@ -1861,6 +1861,19 @@ try_again: if (PageTransHuge(hpage)) { /* + * Bail out before SetPageHasHWPoisoned() if hpage is + * huge_zero_page, although PG_has_hwpoisoned is not + * checked in set_huge_zero_page(). + * + * TODO: Handle memory failure of huge_zero_page thoroughly. + */ + if (is_huge_zero_page(hpage)) { + action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED); + res = -EBUSY; + goto unlock_mutex; + } + + /* * The flag must be set after the refcount is bumped * otherwise it may race with THP split. * And the flag can't be set in get_hwpoison_page() since _

3 years, 7 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror April 2022