From: Wyes Karny wyes.karny@amd.com
[ Upstream commit 8bcedb4ce04750e1ccc9a6b6433387f6a9166a56 ]
When kernel is booted with idle=nomwait do not use MWAIT as the default idle state.
If the user boots the kernel with idle=nomwait, it is a clear direction to not use mwait as the default idle state. However, the current code does not take this into consideration while selecting the default idle state on x86.
Fix it by checking for the idle=nomwait boot option in prefer_mwait_c1_over_halt().
Also update the documentation around idle=nomwait appropriately.
[ dhansen: tweak commit message ]
Signed-off-by: Wyes Karny wyes.karny@amd.com Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Tested-by: Zhang Rui rui.zhang@intel.com Link: https://lkml.kernel.org/r/fdc2dc2d0a1bc21c2f53d989ea2d2ee3ccbc0dbe.165453838... Signed-off-by: Sasha Levin sashal@kernel.org --- Documentation/admin-guide/pm/cpuidle.rst | 15 +++++++++------ arch/x86/kernel/process.c | 9 ++++++--- 2 files changed, 15 insertions(+), 9 deletions(-)
diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index aec2cd2aaea7..19754beb5a4e 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -612,8 +612,8 @@ the ``menu`` governor to be used on the systems that use the ``ladder`` governor by default this way, for example.
The other kernel command line parameters controlling CPU idle time management -described below are only relevant for the *x86* architecture and some of -them affect Intel processors only. +described below are only relevant for the *x86* architecture and references +to ``intel_idle`` affect Intel processors only.
The *x86* architecture support code recognizes three kernel command line options related to CPU idle time management: ``idle=poll``, ``idle=halt``, @@ -635,10 +635,13 @@ idle, so it very well may hurt single-thread computations performance as well as energy-efficiency. Thus using it for performance reasons may not be a good idea at all.]
-The ``idle=nomwait`` option disables the ``intel_idle`` driver and causes -``acpi_idle`` to be used (as long as all of the information needed by it is -there in the system's ACPI tables), but it is not allowed to use the -``MWAIT`` instruction of the CPUs to ask the hardware to enter idle states. +The ``idle=nomwait`` option prevents the use of ``MWAIT`` instruction of +the CPU to enter idle states. When this option is used, the ``acpi_idle`` +driver will use the ``HLT`` instruction instead of ``MWAIT``. On systems +running Intel processors, this option disables the ``intel_idle`` driver +and forces the use of the ``acpi_idle`` driver instead. Note that in either +case, ``acpi_idle`` driver will function only if all the information needed +by it is in the system's ACPI tables.
In addition to the architecture-level kernel command line options affecting CPU idle time management, there are parameters affecting individual ``CPUIdle`` diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 8d9d72fc27a2..707376453525 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -805,6 +805,10 @@ static void amd_e400_idle(void) */ static int prefer_mwait_c1_over_halt(const struct cpuinfo_x86 *c) { + /* User has disallowed the use of MWAIT. Fallback to HALT */ + if (boot_option_idle_override == IDLE_NOMWAIT) + return 0; + if (c->x86_vendor != X86_VENDOR_INTEL) return 0;
@@ -913,9 +917,8 @@ static int __init idle_setup(char *str) } else if (!strcmp(str, "nomwait")) { /* * If the boot option of "idle=nomwait" is added, - * it means that mwait will be disabled for CPU C2/C3 - * states. In such case it won't touch the variable - * of boot_option_idle_override. + * it means that mwait will be disabled for CPU C1/C2/C3 + * states. */ boot_option_idle_override = IDLE_NOMWAIT; } else
From: Mark Rutland mark.rutland@arm.com
[ Upstream commit 4510bffb4d0246cdcc1f14c7367c026b807a862d ]
On most architectures, IRQ flag tracing is disabled in NMI context, and architectures need to define and select TRACE_IRQFLAGS_NMI_SUPPORT in order to enable this.
Commit:
859d069ee1ddd878 ("lockdep: Prepare for NMI IRQ state tracking")
Permitted IRQ flag tracing in NMI context, allowing lockdep to work in NMI context where an architecture had suitable entry logic. At the time, most architectures did not have such suitable entry logic, and this broke lockdep on such architectures. Thus, this was partially disabled in commit:
ed00495333ccc80f ("locking/lockdep: Fix TRACE_IRQFLAGS vs. NMIs")
... with architectures needing to select TRACE_IRQFLAGS_NMI_SUPPORT to enable IRQ flag tracing in NMI context.
Currently TRACE_IRQFLAGS_NMI_SUPPORT is defined under arch/x86/Kconfig.debug. Move it to arch/Kconfig so architectures can select it without having to provide their own definition.
Since the regular TRACE_IRQFLAGS_SUPPORT is selected by arch/x86/Kconfig, the select of TRACE_IRQFLAGS_NMI_SUPPORT is moved there too.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland mark.rutland@arm.com Cc: Catalin Marinas catalin.marinas@arm.com Cc: Ingo Molnar mingo@kernel.org Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Will Deacon will@kernel.org Link: https://lore.kernel.org/r/20220511131733.4074499-2-mark.rutland@arm.com Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/Kconfig | 3 +++ arch/x86/Kconfig | 1 + arch/x86/Kconfig.debug | 3 --- 3 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig index 191589f26b1a..5987363b41c2 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -200,6 +200,9 @@ config HAVE_NMI config TRACE_IRQFLAGS_SUPPORT bool
+config TRACE_IRQFLAGS_NMI_SUPPORT + bool + # # An arch should select this if it provides all these things: # diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index a170cfdae2a7..2db73635c115 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -260,6 +260,7 @@ config X86 select SYSCTL_EXCEPTION_TRACE select THREAD_INFO_IN_TASK select TRACE_IRQFLAGS_SUPPORT + select TRACE_IRQFLAGS_NMI_SUPPORT select USER_STACKTRACE_SUPPORT select VIRT_TO_BUS select HAVE_ARCH_KCSAN if X86_64 diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug index d3a6f74a94bd..d4d6db4dde22 100644 --- a/arch/x86/Kconfig.debug +++ b/arch/x86/Kconfig.debug @@ -1,8 +1,5 @@ # SPDX-License-Identifier: GPL-2.0
-config TRACE_IRQFLAGS_NMI_SUPPORT - def_bool y - config EARLY_PRINTK_USB bool
From: Ard Biesheuvel ardb@kernel.org
[ Upstream commit 2e945851e26836c0f2d34be3763ddf55870e49fe ]
Some early boot code runs before the virtual placement of the kernel is finalized, and we used to go back to the very start and recreate the ID map along with the page tables describing the virtual kernel mapping, and this involved setting some global variables with the caches off.
In order to ensure that global state created by the KASLR code is not corrupted by the cache invalidation that occurs in that case, we needed to clean those global variables to the PoC explicitly.
This is no longer needed now that the ID map is created only once (and the associated global variable updates are no longer repeated). So drop the cache maintenance that is no longer necessary.
Signed-off-by: Ard Biesheuvel ardb@kernel.org Reviewed-by: Anshuman Khandual anshuman.khandual@arm.com Link: https://lore.kernel.org/r/20220624150651.1358849-9-ardb@kernel.org Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm64/kernel/kaslr.c | 11 ----------- 1 file changed, 11 deletions(-)
diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c index 418b2bba1521..d5542666182f 100644 --- a/arch/arm64/kernel/kaslr.c +++ b/arch/arm64/kernel/kaslr.c @@ -13,7 +13,6 @@ #include <linux/pgtable.h> #include <linux/random.h>
-#include <asm/cacheflush.h> #include <asm/fixmap.h> #include <asm/kernel-pgtable.h> #include <asm/memory.h> @@ -72,9 +71,6 @@ u64 __init kaslr_early_init(void) * we end up running with module randomization disabled. */ module_alloc_base = (u64)_etext - MODULES_VSIZE; - dcache_clean_inval_poc((unsigned long)&module_alloc_base, - (unsigned long)&module_alloc_base + - sizeof(module_alloc_base));
/* * Try to map the FDT early. If this fails, we simply bail, @@ -174,13 +170,6 @@ u64 __init kaslr_early_init(void) module_alloc_base += (module_range * (seed & ((1 << 21) - 1))) >> 21; module_alloc_base &= PAGE_MASK;
- dcache_clean_inval_poc((unsigned long)&module_alloc_base, - (unsigned long)&module_alloc_base + - sizeof(module_alloc_base)); - dcache_clean_inval_poc((unsigned long)&memstart_offset_seed, - (unsigned long)&memstart_offset_seed + - sizeof(memstart_offset_seed)); - return offset; }
From: Ard Biesheuvel ardb@kernel.org
[ Upstream commit 1682c45b920643cbde31d8a5b7ca7c2be92d6928 ]
In preparation for changing the way we initialize the permanent ID map, update cpu_replace_ttbr1() so we can use it with the initial ID map as well.
Signed-off-by: Ard Biesheuvel ardb@kernel.org Link: https://lore.kernel.org/r/20220624150651.1358849-11-ardb@kernel.org Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm64/include/asm/mmu_context.h | 13 +++++++++---- arch/arm64/kernel/cpufeature.c | 2 +- arch/arm64/kernel/suspend.c | 2 +- arch/arm64/mm/kasan_init.c | 4 ++-- arch/arm64/mm/mmu.c | 2 +- 5 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h index f4ba93d4ffeb..24ed534bb417 100644 --- a/arch/arm64/include/asm/mmu_context.h +++ b/arch/arm64/include/asm/mmu_context.h @@ -106,20 +106,25 @@ static inline void cpu_uninstall_idmap(void) cpu_switch_mm(mm->pgd, mm); }
-static inline void cpu_install_idmap(void) +static inline void __cpu_install_idmap(pgd_t *idmap) { cpu_set_reserved_ttbr0(); local_flush_tlb_all(); cpu_set_idmap_tcr_t0sz();
- cpu_switch_mm(lm_alias(idmap_pg_dir), &init_mm); + cpu_switch_mm(lm_alias(idmap), &init_mm); +} + +static inline void cpu_install_idmap(void) +{ + __cpu_install_idmap(idmap_pg_dir); }
/* * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD, * avoiding the possibility of conflicting TLB entries being allocated. */ -static inline void __nocfi cpu_replace_ttbr1(pgd_t *pgdp) +static inline void __nocfi cpu_replace_ttbr1(pgd_t *pgdp, pgd_t *idmap) { typedef void (ttbr_replace_func)(phys_addr_t); extern ttbr_replace_func idmap_cpu_replace_ttbr1; @@ -142,7 +147,7 @@ static inline void __nocfi cpu_replace_ttbr1(pgd_t *pgdp)
replace_phys = (void *)__pa_symbol(function_nocfi(idmap_cpu_replace_ttbr1));
- cpu_install_idmap(); + __cpu_install_idmap(idmap); replace_phys(ttbr1); cpu_uninstall_idmap(); } diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index e71c9cfb46e8..e826509823a6 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -3013,7 +3013,7 @@ subsys_initcall_sync(init_32bit_el0_mask);
static void __maybe_unused cpu_enable_cnp(struct arm64_cpu_capabilities const *cap) { - cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); + cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir); }
/* diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c index 19ee7c33769d..40bf1551d1ad 100644 --- a/arch/arm64/kernel/suspend.c +++ b/arch/arm64/kernel/suspend.c @@ -52,7 +52,7 @@ void notrace __cpu_suspend_exit(void)
/* Restore CnP bit in TTBR1_EL1 */ if (system_supports_cnp()) - cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); + cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir);
/* * PSTATE was not saved over suspend/resume, re-enable any detected diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c index 61b52a92b8b6..674863348f67 100644 --- a/arch/arm64/mm/kasan_init.c +++ b/arch/arm64/mm/kasan_init.c @@ -235,7 +235,7 @@ static void __init kasan_init_shadow(void) */ memcpy(tmp_pg_dir, swapper_pg_dir, sizeof(tmp_pg_dir)); dsb(ishst); - cpu_replace_ttbr1(lm_alias(tmp_pg_dir)); + cpu_replace_ttbr1(lm_alias(tmp_pg_dir), idmap_pg_dir);
clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -279,7 +279,7 @@ static void __init kasan_init_shadow(void) PAGE_KERNEL_RO));
memset(kasan_early_shadow_page, KASAN_SHADOW_INIT, PAGE_SIZE); - cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); + cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir); }
static void __init kasan_init_depth(void) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 6680689242df..984f1f503328 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -780,7 +780,7 @@ void __init paging_init(void)
pgd_clear_fixmap();
- cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); + cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir); init_mm.pgd = swapper_pg_dir;
memblock_free(__pa_symbol(init_pg_dir),
From: Ard Biesheuvel ardb@kernel.org
[ Upstream commit fc5a89f75d2aad3e566e030675ac420aee49729c ]
The early KASLR init code runs extremely early, and anything that could be deferred until later should be. So let's defer the randomization of the module region until much later - this also simplifies the arithmetic, given that we no longer have to reason about the link time vs load time placement of the core kernel explicitly. Also get rid of the global status variable, and infer the status reported by the diagnostic print from other KASLR related context.
While at it, get rid of the special case for KASAN without KASAN_VMALLOC, which never occurs in practice.
Signed-off-by: Ard Biesheuvel ardb@kernel.org Link: https://lore.kernel.org/r/20220624150651.1358849-20-ardb@kernel.org Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm64/kernel/kaslr.c | 95 +++++++++++++++++---------------------- 1 file changed, 40 insertions(+), 55 deletions(-)
diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c index d5542666182f..3edee81d8ea7 100644 --- a/arch/arm64/kernel/kaslr.c +++ b/arch/arm64/kernel/kaslr.c @@ -20,14 +20,6 @@ #include <asm/sections.h> #include <asm/setup.h>
-enum kaslr_status { - KASLR_ENABLED, - KASLR_DISABLED_CMDLINE, - KASLR_DISABLED_NO_SEED, - KASLR_DISABLED_FDT_REMAP, -}; - -static enum kaslr_status __initdata kaslr_status; u64 __ro_after_init module_alloc_base; u16 __initdata memstart_offset_seed;
@@ -63,15 +55,9 @@ struct arm64_ftr_override kaslr_feature_override __initdata; u64 __init kaslr_early_init(void) { void *fdt; - u64 seed, offset, mask, module_range; + u64 seed, offset, mask; unsigned long raw;
- /* - * Set a reasonable default for module_alloc_base in case - * we end up running with module randomization disabled. - */ - module_alloc_base = (u64)_etext - MODULES_VSIZE; - /* * Try to map the FDT early. If this fails, we simply bail, * and proceed with KASLR disabled. We will make another @@ -79,7 +65,6 @@ u64 __init kaslr_early_init(void) */ fdt = get_early_fdt_ptr(); if (!fdt) { - kaslr_status = KASLR_DISABLED_FDT_REMAP; return 0; }
@@ -93,7 +78,6 @@ u64 __init kaslr_early_init(void) * return 0 if that is the case. */ if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) { - kaslr_status = KASLR_DISABLED_CMDLINE; return 0; }
@@ -106,7 +90,6 @@ u64 __init kaslr_early_init(void) seed ^= raw;
if (!seed) { - kaslr_status = KASLR_DISABLED_NO_SEED; return 0; }
@@ -126,19 +109,43 @@ u64 __init kaslr_early_init(void) /* use the top 16 bits to randomize the linear region */ memstart_offset_seed = seed >> 48;
- if (!IS_ENABLED(CONFIG_KASAN_VMALLOC) && - (IS_ENABLED(CONFIG_KASAN_GENERIC) || - IS_ENABLED(CONFIG_KASAN_SW_TAGS))) - /* - * KASAN without KASAN_VMALLOC does not expect the module region - * to intersect the vmalloc region, since shadow memory is - * allocated for each module at load time, whereas the vmalloc - * region is shadowed by KASAN zero pages. So keep modules - * out of the vmalloc region if KASAN is enabled without - * KASAN_VMALLOC, and put the kernel well within 4 GB of the - * module region. - */ - return offset % SZ_2G; + return offset; +} + +static int __init kaslr_init(void) +{ + u64 module_range; + u32 seed; + + /* + * Set a reasonable default for module_alloc_base in case + * we end up running with module randomization disabled. + */ + module_alloc_base = (u64)_etext - MODULES_VSIZE; + + if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) { + pr_info("KASLR disabled on command line\n"); + return 0; + } + + if (!kaslr_offset()) { + pr_warn("KASLR disabled due to lack of seed\n"); + return 0; + } + + pr_info("KASLR enabled\n"); + + /* + * KASAN without KASAN_VMALLOC does not expect the module region to + * intersect the vmalloc region, since shadow memory is allocated for + * each module at load time, whereas the vmalloc region will already be + * shadowed by KASAN zero pages. + */ + BUILD_BUG_ON((IS_ENABLED(CONFIG_KASAN_GENERIC) || + IS_ENABLED(CONFIG_KASAN_SW_TAGS)) && + !IS_ENABLED(CONFIG_KASAN_VMALLOC)); + + seed = get_random_u32();
if (IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) { /* @@ -150,8 +157,7 @@ u64 __init kaslr_early_init(void) * resolved normally.) */ module_range = SZ_2G - (u64)(_end - _stext); - module_alloc_base = max((u64)_end + offset - SZ_2G, - (u64)MODULES_VADDR); + module_alloc_base = max((u64)_end - SZ_2G, (u64)MODULES_VADDR); } else { /* * Randomize the module region by setting module_alloc_base to @@ -163,33 +169,12 @@ u64 __init kaslr_early_init(void) * when ARM64_MODULE_PLTS is enabled. */ module_range = MODULES_VSIZE - (u64)(_etext - _stext); - module_alloc_base = (u64)_etext + offset - MODULES_VSIZE; }
/* use the lower 21 bits to randomize the base of the module region */ module_alloc_base += (module_range * (seed & ((1 << 21) - 1))) >> 21; module_alloc_base &= PAGE_MASK;
- return offset; -} - -static int __init kaslr_init(void) -{ - switch (kaslr_status) { - case KASLR_ENABLED: - pr_info("KASLR enabled\n"); - break; - case KASLR_DISABLED_CMDLINE: - pr_info("KASLR disabled on command line\n"); - break; - case KASLR_DISABLED_NO_SEED: - pr_warn("KASLR disabled due to lack of seed\n"); - break; - case KASLR_DISABLED_FDT_REMAP: - pr_warn("KASLR disabled due to FDT remapping failure\n"); - break; - } - return 0; } -core_initcall(kaslr_init) +subsys_initcall(kaslr_init)
From: Francis Laniel flaniel@linux.microsoft.com
[ Upstream commit de6921856f99c11d3986c6702d851e1328d4f7f6 ]
Enable tracing of the execve*() system calls with the syscalls:sys_exit_execve tracepoint by removing the call to forget_syscall() when starting a new thread and preserving the value of regs->syscallno across exec.
Signed-off-by: Francis Laniel flaniel@linux.microsoft.com Link: https://lore.kernel.org/r/20220608162447.666494-2-flaniel@linux.microsoft.co... Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm64/include/asm/processor.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index 5e73d7f7d1e7..d9bf3d12a2b8 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -204,8 +204,9 @@ void tls_preserve_current_state(void);
static inline void start_thread_common(struct pt_regs *regs, unsigned long pc) { + s32 previous_syscall = regs->syscallno; memset(regs, 0, sizeof(*regs)); - forget_syscall(regs); + regs->syscallno = previous_syscall; regs->pc = pc;
if (system_uses_irq_prio_masking())
From: haibinzhang (张海斌) haibinzhang@tencent.com
[ Upstream commit af483947d472eccb79e42059276c4deed76f99a6 ]
emulation_proc_handler() changes table->data for proc_dointvec_minmax and can generate the following Oops if called concurrently with itself:
| Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010 | Internal error: Oops: 96000006 [#1] SMP | Call trace: | update_insn_emulation_mode+0xc0/0x148 | emulation_proc_handler+0x64/0xb8 | proc_sys_call_handler+0x9c/0xf8 | proc_sys_write+0x18/0x20 | __vfs_write+0x20/0x48 | vfs_write+0xe4/0x1d0 | ksys_write+0x70/0xf8 | __arm64_sys_write+0x20/0x28 | el0_svc_common.constprop.0+0x7c/0x1c0 | el0_svc_handler+0x2c/0xa0 | el0_svc+0x8/0x200
To fix this issue, keep the table->data as &insn->current_mode and use container_of() to retrieve the insn pointer. Another mutex is used to protect against the current_mode update but not for retrieving insn_emulation as table->data is no longer changing.
Co-developed-by: hewenliang hewenliang4@huawei.com Signed-off-by: hewenliang hewenliang4@huawei.com Signed-off-by: Haibin Zhang haibinzhang@tencent.com Reviewed-by: Catalin Marinas catalin.marinas@arm.com Link: https://lore.kernel.org/r/20220128090324.2727688-1-hewenliang4@huawei.com Link: https://lore.kernel.org/r/9A004C03-250B-46C5-BF39-782D7551B00E@tencent.com Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm64/kernel/armv8_deprecated.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/kernel/armv8_deprecated.c b/arch/arm64/kernel/armv8_deprecated.c index 0e86e8b9cedd..c5da9d1e954a 100644 --- a/arch/arm64/kernel/armv8_deprecated.c +++ b/arch/arm64/kernel/armv8_deprecated.c @@ -59,6 +59,7 @@ struct insn_emulation { static LIST_HEAD(insn_emulation); static int nr_insn_emulated __initdata; static DEFINE_RAW_SPINLOCK(insn_emulation_lock); +static DEFINE_MUTEX(insn_emulation_mutex);
static void register_emulation_hooks(struct insn_emulation_ops *ops) { @@ -207,10 +208,10 @@ static int emulation_proc_handler(struct ctl_table *table, int write, loff_t *ppos) { int ret = 0; - struct insn_emulation *insn = (struct insn_emulation *) table->data; + struct insn_emulation *insn = container_of(table->data, struct insn_emulation, current_mode); enum insn_emulation_mode prev_mode = insn->current_mode;
- table->data = &insn->current_mode; + mutex_lock(&insn_emulation_mutex); ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
if (ret || !write || prev_mode == insn->current_mode) @@ -223,7 +224,7 @@ static int emulation_proc_handler(struct ctl_table *table, int write, update_insn_emulation_mode(insn, INSN_UNDEF); } ret: - table->data = insn; + mutex_unlock(&insn_emulation_mutex); return ret; }
@@ -247,7 +248,7 @@ static void __init register_insn_emulation_sysctl(void) sysctl->maxlen = sizeof(int);
sysctl->procname = insn->ops->name; - sysctl->data = insn; + sysctl->data = &insn->current_mode; sysctl->extra1 = &insn->min; sysctl->extra2 = &insn->max; sysctl->proc_handler = emulation_proc_handler;
From: Catalin Marinas catalin.marinas@arm.com
[ Upstream commit ed0a6d1d973e9763989b44913ae1bd2a5d5d5777 ]
__kasan_unpoison_pages() colours the memory with a random tag and stores it in page->flags in order to re-create the tagged pointer via page_to_virt() later. When the tag from the page->flags is read, ensure that the in-memory tags are already visible by re-ordering the page_kasan_tag_set() after kasan_unpoison(). The former already has barriers in place through try_cmpxchg(). On the reader side, the order is ensured by the address dependency between page->flags and the memory access.
Signed-off-by: Catalin Marinas catalin.marinas@arm.com Reviewed-by: Andrey Konovalov andreyknvl@gmail.com Cc: Andrey Ryabinin ryabinin.a.a@gmail.com Cc: Vincenzo Frascino vincenzo.frascino@arm.com Reviewed-by: Vincenzo Frascino vincenzo.frascino@arm.com Link: https://lore.kernel.org/r/20220610152141.2148929-2-catalin.marinas@arm.com Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- mm/kasan/common.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/kasan/common.c b/mm/kasan/common.c index 2baf121fb8c5..0c36d3df23f3 100644 --- a/mm/kasan/common.c +++ b/mm/kasan/common.c @@ -109,9 +109,10 @@ void __kasan_unpoison_pages(struct page *page, unsigned int order, bool init) return;
tag = kasan_random_tag(); + kasan_unpoison(set_tag(page_address(page), tag), + PAGE_SIZE << order, init); for (i = 0; i < (1 << order); i++) page_kasan_tag_set(page + i, tag); - kasan_unpoison(page_address(page), PAGE_SIZE << order, init); }
void __kasan_poison_pages(struct page *page, unsigned int order, bool init)
From: Catalin Marinas catalin.marinas@arm.com
[ Upstream commit 20794545c14692094a882d2221c251c4573e6adf ]
This reverts commit e5b8d9218951e59df986f627ec93569a0d22149b.
Pages mapped in user-space with PROT_MTE have the allocation tags either zeroed or copied/restored to some user values. In order for the kernel to access such pages via page_address(), resetting the tag in page->flags was necessary. This tag resetting was deferred to set_pte_at() -> mte_sync_page_tags() but it can race with another CPU reading the flags (via page_to_virt()):
P0 (mte_sync_page_tags): P1 (memcpy from virt_to_page): Rflags!=0xff Wflags=0xff DMB (doesn't help) Wtags=0 Rtags=0 // fault
Since now the post_alloc_hook() function resets the page->flags tag when unpoisoning is skipped for user pages (including the __GFP_ZEROTAGS case), revert the arm64 commit calling page_kasan_tag_reset().
Signed-off-by: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Cc: Vincenzo Frascino vincenzo.frascino@arm.com Cc: Andrey Konovalov andreyknvl@gmail.com Cc: Peter Collingbourne pcc@google.com Reviewed-by: Vincenzo Frascino vincenzo.frascino@arm.com Acked-by: Andrey Konovalov andreyknvl@gmail.com Link: https://lore.kernel.org/r/20220610152141.2148929-5-catalin.marinas@arm.com Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm64/kernel/hibernate.c | 5 ----- arch/arm64/kernel/mte.c | 9 --------- arch/arm64/mm/copypage.c | 9 --------- arch/arm64/mm/mteswap.c | 9 --------- 4 files changed, 32 deletions(-)
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c index 46a0b4d6e251..db93ce2b0113 100644 --- a/arch/arm64/kernel/hibernate.c +++ b/arch/arm64/kernel/hibernate.c @@ -326,11 +326,6 @@ static void swsusp_mte_restore_tags(void) unsigned long pfn = xa_state.xa_index; struct page *page = pfn_to_online_page(pfn);
- /* - * It is not required to invoke page_kasan_tag_reset(page) - * at this point since the tags stored in page->flags are - * already restored. - */ mte_restore_page_tags(page_address(page), tags);
mte_free_tag_storage(tags); diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index 7c1c82c8115c..10207e3e5ae2 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -44,15 +44,6 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte, if (!pte_is_tagged) return;
- page_kasan_tag_reset(page); - /* - * We need smp_wmb() in between setting the flags and clearing the - * tags because if another thread reads page->flags and builds a - * tagged address out of it, there is an actual dependency to the - * memory access, but on the current thread we do not guarantee that - * the new page->flags are visible before the tags were updated. - */ - smp_wmb(); mte_clear_page_tags(page_address(page)); }
diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c index 0dea80bf6de4..24913271e898 100644 --- a/arch/arm64/mm/copypage.c +++ b/arch/arm64/mm/copypage.c @@ -23,15 +23,6 @@ void copy_highpage(struct page *to, struct page *from)
if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) { set_bit(PG_mte_tagged, &to->flags); - page_kasan_tag_reset(to); - /* - * We need smp_wmb() in between setting the flags and clearing the - * tags because if another thread reads page->flags and builds a - * tagged address out of it, there is an actual dependency to the - * memory access, but on the current thread we do not guarantee that - * the new page->flags are visible before the tags were updated. - */ - smp_wmb(); mte_copy_page_tags(kto, kfrom); } } diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c index 7c4ef56265ee..c52c1847079c 100644 --- a/arch/arm64/mm/mteswap.c +++ b/arch/arm64/mm/mteswap.c @@ -53,15 +53,6 @@ bool mte_restore_tags(swp_entry_t entry, struct page *page) if (!tags) return false;
- page_kasan_tag_reset(page); - /* - * We need smp_wmb() in between setting the flags and clearing the - * tags because if another thread reads page->flags and builds a - * tagged address out of it, there is an actual dependency to the - * memory access, but on the current thread we do not guarantee that - * the new page->flags are visible before the tags were updated. - */ - smp_wmb(); mte_restore_page_tags(page_address(page), tags);
return true;
From: Jan Kara jack@suse.cz
[ Upstream commit fa78f336937240d1bc598db817d638086060e7e9 ]
Add checks verifying number of inodes stored in the superblock matches the number computed from number of inodes per group. Also verify we have at least one block worth of inodes per group. This prevents crashes on corrupted filesystems.
Reported-by: syzbot+d273f7d7f58afd93be48@syzkaller.appspotmail.com Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org --- fs/ext2/super.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 3d21279fe2cb..fd855574ef09 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -1058,9 +1058,10 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) sbi->s_frags_per_group); goto failed_mount; } - if (sbi->s_inodes_per_group > sb->s_blocksize * 8) { + if (sbi->s_inodes_per_group < sbi->s_inodes_per_block || + sbi->s_inodes_per_group > sb->s_blocksize * 8) { ext2_msg(sb, KERN_ERR, - "error: #inodes per group too big: %lu", + "error: invalid #inodes per group: %lu", sbi->s_inodes_per_group); goto failed_mount; } @@ -1070,6 +1071,13 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) sbi->s_groups_count = ((le32_to_cpu(es->s_blocks_count) - le32_to_cpu(es->s_first_data_block) - 1) / EXT2_BLOCKS_PER_GROUP(sb)) + 1; + if ((u64)sbi->s_groups_count * sbi->s_inodes_per_group != + le32_to_cpu(es->s_inodes_count)) { + ext2_msg(sb, KERN_ERR, "error: invalid #inodes: %u vs computed %llu", + le32_to_cpu(es->s_inodes_count), + (u64)sbi->s_groups_count * sbi->s_inodes_per_group); + goto failed_mount; + } db_count = (sbi->s_groups_count + EXT2_DESC_PER_BLOCK(sb) - 1) / EXT2_DESC_PER_BLOCK(sb); sbi->s_group_desc = kmalloc_array(db_count,
From: Chen Yu yu.c.chen@intel.com
[ Upstream commit 70fb5ccf2ebb09a0c8ebba775041567812d45f86 ]
[Problem Statement] select_idle_cpu() might spend too much time searching for an idle CPU, when the system is overloaded.
The following histogram is the time spent in select_idle_cpu(), when running 224 instances of netperf on a system with 112 CPUs per LLC domain:
@usecs: [0] 533 | | [1] 5495 | | [2, 4) 12008 | | [4, 8) 239252 | | [8, 16) 4041924 |@@@@@@@@@@@@@@ | [16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [256, 512) 4507667 |@@@@@@@@@@@@@@@ | [512, 1K) 2600472 |@@@@@@@@@ | [1K, 2K) 927912 |@@@ | [2K, 4K) 218720 | | [4K, 8K) 98161 | | [8K, 16K) 37722 | | [16K, 32K) 6715 | | [32K, 64K) 477 | | [64K, 128K) 7 | |
netperf latency usecs: ======= case load Lat_99th std% TCP_RR thread-224 257.39 ( 0.21)
The time spent in select_idle_cpu() is visible to netperf and might have a negative impact.
[Symptom analysis] The patch [1] from Mel Gorman has been applied to track the efficiency of select_idle_sibling. Copy the indicators here:
SIS Search Efficiency(se_eff%): A ratio expressed as a percentage of runqueues scanned versus idle CPUs found. A 100% efficiency indicates that the target, prev or recent CPU of a task was idle at wakeup. The lower the efficiency, the more runqueues were scanned before an idle CPU was found.
SIS Domain Search Efficiency(dom_eff%): Similar, except only for the slower SIS patch.
SIS Fast Success Rate(fast_rate%): Percentage of SIS that used target, prev or recent CPUs.
SIS Success rate(success_rate%): Percentage of scans that found an idle CPU.
The test is based on Aubrey's schedtests tool, including netperf, hackbench, schbench and tbench.
Test on vanilla kernel: schedstat_parse.py -f netperf_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% TCP_RR 28 threads 99.978 18.535 99.995 100.000 TCP_RR 56 threads 99.397 5.671 99.964 100.000 TCP_RR 84 threads 21.721 6.818 73.632 100.000 TCP_RR 112 threads 12.500 5.533 59.000 100.000 TCP_RR 140 threads 8.524 4.535 49.020 100.000 TCP_RR 168 threads 6.438 3.945 40.309 99.999 TCP_RR 196 threads 5.397 3.718 32.320 99.982 TCP_RR 224 threads 4.874 3.661 25.775 99.767 UDP_RR 28 threads 99.988 17.704 99.997 100.000 UDP_RR 56 threads 99.528 5.977 99.970 100.000 UDP_RR 84 threads 24.219 6.992 76.479 100.000 UDP_RR 112 threads 13.907 5.706 62.538 100.000 UDP_RR 140 threads 9.408 4.699 52.519 100.000 UDP_RR 168 threads 7.095 4.077 44.352 100.000 UDP_RR 196 threads 5.757 3.775 35.764 99.991 UDP_RR 224 threads 5.124 3.704 28.748 99.860
schedstat_parse.py -f schbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% normal 1 mthread 99.152 6.400 99.941 100.000 normal 2 mthreads 97.844 4.003 99.908 100.000 normal 3 mthreads 96.395 2.118 99.917 99.998 normal 4 mthreads 55.288 1.451 98.615 99.804 normal 5 mthreads 7.004 1.870 45.597 61.036 normal 6 mthreads 3.354 1.346 20.777 34.230 normal 7 mthreads 2.183 1.028 11.257 21.055 normal 8 mthreads 1.653 0.825 7.849 15.549
schedstat_parse.py -f hackbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% process-pipe 1 group 99.991 7.692 99.999 100.000 process-pipe 2 groups 99.934 4.615 99.997 100.000 process-pipe 3 groups 99.597 3.198 99.987 100.000 process-pipe 4 groups 98.378 2.464 99.958 100.000 process-pipe 5 groups 27.474 3.653 89.811 99.800 process-pipe 6 groups 20.201 4.098 82.763 99.570 process-pipe 7 groups 16.423 4.156 77.398 99.316 process-pipe 8 groups 13.165 3.920 72.232 98.828 process-sockets 1 group 99.977 5.882 99.999 100.000 process-sockets 2 groups 99.927 5.505 99.996 100.000 process-sockets 3 groups 99.397 3.250 99.980 100.000 process-sockets 4 groups 79.680 4.258 98.864 99.998 process-sockets 5 groups 7.673 2.503 63.659 92.115 process-sockets 6 groups 4.642 1.584 58.946 88.048 process-sockets 7 groups 3.493 1.379 49.816 81.164 process-sockets 8 groups 3.015 1.407 40.845 75.500 threads-pipe 1 group 99.997 0.000 100.000 100.000 threads-pipe 2 groups 99.894 2.932 99.997 100.000 threads-pipe 3 groups 99.611 4.117 99.983 100.000 threads-pipe 4 groups 97.703 2.624 99.937 100.000 threads-pipe 5 groups 22.919 3.623 87.150 99.764 threads-pipe 6 groups 18.016 4.038 80.491 99.557 threads-pipe 7 groups 14.663 3.991 75.239 99.247 threads-pipe 8 groups 12.242 3.808 70.651 98.644 threads-sockets 1 group 99.990 6.667 99.999 100.000 threads-sockets 2 groups 99.940 5.114 99.997 100.000 threads-sockets 3 groups 99.469 4.115 99.977 100.000 threads-sockets 4 groups 87.528 4.038 99.400 100.000 threads-sockets 5 groups 6.942 2.398 59.244 88.337 threads-sockets 6 groups 4.359 1.954 49.448 87.860 threads-sockets 7 groups 2.845 1.345 41.198 77.102 threads-sockets 8 groups 2.871 1.404 38.512 74.312
schedstat_parse.py -f tbench_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% loopback 28 threads 99.976 18.369 99.995 100.000 loopback 56 threads 99.222 7.799 99.934 100.000 loopback 84 threads 19.723 6.819 70.215 100.000 loopback 112 threads 11.283 5.371 55.371 99.999 loopback 140 threads 0.000 0.000 0.000 0.000 loopback 168 threads 0.000 0.000 0.000 0.000 loopback 196 threads 0.000 0.000 0.000 0.000 loopback 224 threads 0.000 0.000 0.000 0.000
According to the test above, if the system becomes busy, the SIS Search Efficiency(se_eff%) drops significantly. Although some benchmarks would finally find an idle CPU(success_rate% = 100%), it is doubtful whether it is worth it to search the whole LLC domain.
[Proposal] It would be ideal to have a crystal ball to answer this question: How many CPUs must a wakeup path walk down, before it can find an idle CPU? Many potential metrics could be used to predict the number. One candidate is the sum of util_avg in this LLC domain. The benefit of choosing util_avg is that it is a metric of accumulated historic activity, which seems to be smoother than instantaneous metrics (such as rq->nr_running). Besides, choosing the sum of util_avg would help predict the load of the LLC domain more precisely, because SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle time.
In summary, the lower the util_avg is, the more select_idle_cpu() should scan for idle CPU, and vice versa. When the sum of util_avg in this LLC domain hits 85% or above, the scan stops. The reason to choose 85% as the threshold is that this is the imbalance_pct(117) when a LLC sched group is overloaded.
Introduce the quadratic function:
y = SCHED_CAPACITY_SCALE - p * x^2 and y'= y / SCHED_CAPACITY_SCALE
x is the ratio of sum_util compared to the CPU capacity: x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE) y' is the ratio of CPUs to be scanned in the LLC domain, and the number of CPUs to scan is calculated by:
nr_scan = llc_weight * y'
Choosing quadratic function is because: [1] Compared to the linear function, it scans more aggressively when the sum_util is low. [2] Compared to the exponential function, it is easier to calculate. [3] It seems that there is no accurate mapping between the sum of util_avg and the number of CPUs to be scanned. Use heuristic scan for now.
For a platform with 112 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 112 111 108 102 93 81 65 47 25 1 0 ...
For a platform with 16 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 16 15 15 14 13 11 9 6 3 0 0 ...
Furthermore, to minimize the overhead of calculating the metrics in select_idle_cpu(), borrow the statistics from periodic load balance. As mentioned by Abel, on a platform with 112 CPUs per LLC, the sum_util calculated by periodic load balance after 112 ms would decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in reflecting the latest utilization. But it is a trade-off. Checking the util_avg in newidle load balance would be more frequent, but it brings overhead - multiple CPUs write/read the per-LLC shared variable and introduces cache contention. Tim also mentioned that, it is allowed to be non-optimal in terms of scheduling for the short-term variations, but if there is a long-term trend in the load behavior, the scheduler can adjust for that.
When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and Mel suggested, SIS_UTIL should be enabled by default.
This patch is based on the util_avg, which is very sensitive to the CPU frequency invariance. There is an issue that, when the max frequency has been clamp, the util_avg would decay insanely fast when the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo in frequency invariance") could be used to mitigate this symptom, by adjusting the arch_max_freq_ratio when turbo is disabled. But this issue is still not thoroughly fixed, because the current code is unaware of the user-specified max CPU frequency.
[Test result]
netperf and tbench were launched with 25% 50% 75% 100% 125% 150% 175% 200% of CPU number respectively. Hackbench and schbench were launched by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.
The following is the benchmark result comparison between baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare% indicates better performance.
Each netperf test is a: netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100 netperf.throughput ======= case load baseline(std%) compare%( std%) TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40) TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20) TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40) TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22) TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19) TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18) TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43) TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85) UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33) UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21) UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32) UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61) UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04) UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16) UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93) UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62)
Take the 224 threads as an example, the SIS search metrics changes are illustrated below:
vanilla patched 4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg 38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg 128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg 5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg
There is -87.9% less CPU scans after patched, which indicates lower overhead. Besides, with this patch applied, there is -13% less rq lock contention in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested .try_to_wake_up.default_wake_function.woken_wake_function. This might help explain the performance improvement - Because this patch allows the waking task to remain on the previous CPU, rather than grabbing other CPUs' lock.
Each hackbench test is a: hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100 hackbench.throughput ========= case load baseline(std%) compare%( std%) process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47) process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81) process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02) process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02) process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13) process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14) process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26) process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03) threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14) threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74) threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13) threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06) threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53) threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26) threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24) threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11)
Each tbench test is a: tbench -t 100 $job 127.0.0.1 tbench.throughput ====== case load baseline(std%) compare%( std%) loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09) loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17) loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13) loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03) loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19) loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27) loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17) loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22)
Each schbench test is a: schbench -m $job -t 28 -r 100 -s 30000 -c 30000 schbench.latency_90%_us ======== case load baseline(std%) compare%( std%) normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)* normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79) normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64) normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28)
*Consider the Standard Deviation, this -7.36% regression might not be valid.
Also, a OLTP workload with a commercial RDBMS has been tested, and there is no significant change.
There were concerns that unbalanced tasks among CPUs would cause problems. For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to CPU0~CPU6, while CPU7 is idle:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 util_avg 1024 1024 1024 1024 1024 1024 1024 0
Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%, select_idle_cpu() will not scan, thus CPU7 is undetected during scan. But according to Mel, it is unlikely the CPU7 will be idle all the time because CPU7 could pull some tasks via CPU_NEWLY_IDLE.
lkp(kernel test robot) has reported a regression on stress-ng.sock on a very busy system. According to the sched_debug statistics, it might be caused by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this might introduce more context switch, especially involuntary preemption, which impacts a busy stress-ng. This regression has shown that, not all benchmarks in every scenario benefit from idle CPU scan limit, and it needs further investigation.
Besides, there is slight regression in hackbench's 16 groups case when the LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively in an LLC domain with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is negligible. The current patch aims to propose a generic solution and only considers the util_avg. Something like the below could be applied on top of the current patch to fulfill the requirement:
if (llc_weight <= 16) nr_scan = nr_scan * 32 / llc_weight;
For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large. The smaller the CPU number this LLC domain has, the larger nr_scan will be expanded. This needs further investigation.
There is also ongoing work[2] from Abel to filter out the busy CPUs during wakeup, to further speed up the idle CPU scan. And it could be a following-up optimization on top of this change.
Suggested-by: Tim Chen tim.c.chen@intel.com Suggested-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Chen Yu yu.c.chen@intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Tested-by: Yicong Yang yangyicong@hisilicon.com Tested-by: Mohini Narkhede mohini.narkhede@intel.com Tested-by: K Prateek Nayak kprateek.nayak@amd.com Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 87 ++++++++++++++++++++++++++++++++++ kernel/sched/features.h | 3 +- 3 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 8f0f778b7c91..63a04a65e310 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -74,6 +74,7 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores; + int nr_idle_scan; };
struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fcbacc35d2b9..a853e4e9e3c3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6280,6 +6280,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); int i, cpu, idle_cpu = -1, nr = INT_MAX; + struct sched_domain_shared *sd_share; struct rq *this_rq = this_rq(); int this = smp_processor_id(); struct sched_domain *this_sd; @@ -6319,6 +6320,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool time = cpu_clock(this); }
+ if (sched_feat(SIS_UTIL)) { + sd_share = rcu_dereference(per_cpu(sd_llc_shared, target)); + if (sd_share) { + /* because !--nr is the condition to stop scan */ + nr = READ_ONCE(sd_share->nr_idle_scan) + 1; + /* overloaded LLC is unlikely to have idle cpu/core */ + if (nr == 1) + return -1; + } + } + for_each_cpu_wrap(cpu, cpus, target + 1) { if (has_idle_core) { i = select_idle_core(p, cpu, cpus, &idle_cpu); @@ -9166,6 +9178,77 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) return idlest; }
+static void update_idle_cpu_scan(struct lb_env *env, + unsigned long sum_util) +{ + struct sched_domain_shared *sd_share; + int llc_weight, pct; + u64 x, y, tmp; + /* + * Update the number of CPUs to scan in LLC domain, which could + * be used as a hint in select_idle_cpu(). The update of sd_share + * could be expensive because it is within a shared cache line. + * So the write of this hint only occurs during periodic load + * balancing, rather than CPU_NEWLY_IDLE, because the latter + * can fire way more frequently than the former. + */ + if (!sched_feat(SIS_UTIL) || env->idle == CPU_NEWLY_IDLE) + return; + + llc_weight = per_cpu(sd_llc_size, env->dst_cpu); + if (env->sd->span_weight != llc_weight) + return; + + sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu)); + if (!sd_share) + return; + + /* + * The number of CPUs to search drops as sum_util increases, when + * sum_util hits 85% or above, the scan stops. + * The reason to choose 85% as the threshold is because this is the + * imbalance_pct(117) when a LLC sched group is overloaded. + * + * let y = SCHED_CAPACITY_SCALE - p * x^2 [1] + * and y'= y / SCHED_CAPACITY_SCALE + * + * x is the ratio of sum_util compared to the CPU capacity: + * x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE) + * y' is the ratio of CPUs to be scanned in the LLC domain, + * and the number of CPUs to scan is calculated by: + * + * nr_scan = llc_weight * y' [2] + * + * When x hits the threshold of overloaded, AKA, when + * x = 100 / pct, y drops to 0. According to [1], + * p should be SCHED_CAPACITY_SCALE * pct^2 / 10000 + * + * Scale x by SCHED_CAPACITY_SCALE: + * x' = sum_util / llc_weight; [3] + * + * and finally [1] becomes: + * y = SCHED_CAPACITY_SCALE - + * x'^2 * pct^2 / (10000 * SCHED_CAPACITY_SCALE) [4] + * + */ + /* equation [3] */ + x = sum_util; + do_div(x, llc_weight); + + /* equation [4] */ + pct = env->sd->imbalance_pct; + tmp = x * x * pct * pct; + do_div(tmp, 10000 * SCHED_CAPACITY_SCALE); + tmp = min_t(long, tmp, SCHED_CAPACITY_SCALE); + y = SCHED_CAPACITY_SCALE - tmp; + + /* equation [2] */ + y *= llc_weight; + do_div(y, SCHED_CAPACITY_SCALE); + if ((int)y != sd_share->nr_idle_scan) + WRITE_ONCE(sd_share->nr_idle_scan, (int)y); +} + /** * update_sd_lb_stats - Update sched_domain's statistics for load balancing. * @env: The load balancing environment. @@ -9178,6 +9261,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sched_group *sg = env->sd->groups; struct sg_lb_stats *local = &sds->local_stat; struct sg_lb_stats tmp_sgs; + unsigned long sum_util = 0; int sg_status = 0;
do { @@ -9210,6 +9294,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity;
+ sum_util += sgs->group_util; sg = sg->next; } while (sg != env->sd->groups);
@@ -9235,6 +9320,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED); trace_sched_overutilized_tp(rd, SG_OVERUTILIZED); } + + update_idle_cpu_scan(env, sum_util); }
#define NUMA_IMBALANCE_MIN 2 diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 7f8dace0964c..c4947c1b5edb 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -55,7 +55,8 @@ SCHED_FEAT(TTWU_QUEUE, true) /* * When doing wakeups, attempt to limit superfluous scans of the LLC domain. */ -SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_PROP, false) +SCHED_FEAT(SIS_UTIL, true)
/* * Issue a WARN when we do multiple update_rq_clock() calls
From: Antonio Borneo antonio.borneo@foss.st.com
[ Upstream commit 95001b756467ecc9f5973eb5e74e97699d9bbdf1 ]
Function irq_chip::irq_request_resources() is reported as optional in the declaration of struct irq_chip. If the parent irq_chip does not implement it, we should ignore it and return.
Don't return error if the functions is missing.
Signed-off-by: Antonio Borneo antonio.borneo@foss.st.com Signed-off-by: Marc Zyngier maz@kernel.org Link: https://lore.kernel.org/r/20220512160544.13561-1-antonio.borneo@foss.st.com Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/irq/chip.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index a98bcfc4be7b..f3920374f71c 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -1516,7 +1516,8 @@ int irq_chip_request_resources_parent(struct irq_data *data) if (data->chip->irq_request_resources) return data->chip->irq_request_resources(data);
- return -ENOSYS; + /* no error on missing optional irq_chip::irq_request_resources */ + return 0; } EXPORT_SYMBOL_GPL(irq_chip_request_resources_parent);
From: Samuel Holland samuel@sholland.org
[ Upstream commit 8190cc572981f2f13b6ffc26c7cfa7899e5d3ccc ]
The MIPS GIC irqchip driver may be selected in a uniprocessor configuration, but it unconditionally registers an IPI domain.
Limit the part of the driver dealing with IPIs to only be compiled when GENERIC_IRQ_IPI is enabled, which corresponds to an SMP configuration.
Reported-by: kernel test robot lkp@intel.com Signed-off-by: Samuel Holland samuel@sholland.org Signed-off-by: Marc Zyngier maz@kernel.org Link: https://lore.kernel.org/r/20220701200056.46555-2-samuel@sholland.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/irqchip/Kconfig | 3 +- drivers/irqchip/irq-mips-gic.c | 80 +++++++++++++++++++++++----------- 2 files changed, 56 insertions(+), 27 deletions(-)
diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig index aca7b595c4c7..8f9c52873338 100644 --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -304,7 +304,8 @@ config KEYSTONE_IRQ
config MIPS_GIC bool - select GENERIC_IRQ_IPI + select GENERIC_IRQ_IPI if SMP + select IRQ_DOMAIN_HIERARCHY select MIPS_CM
config INGENIC_IRQ diff --git a/drivers/irqchip/irq-mips-gic.c b/drivers/irqchip/irq-mips-gic.c index 54c7092cc61d..f03f47ffea1e 100644 --- a/drivers/irqchip/irq-mips-gic.c +++ b/drivers/irqchip/irq-mips-gic.c @@ -51,13 +51,15 @@ static DEFINE_PER_CPU_READ_MOSTLY(unsigned long[GIC_MAX_LONGS], pcpu_masks);
static DEFINE_SPINLOCK(gic_lock); static struct irq_domain *gic_irq_domain; -static struct irq_domain *gic_ipi_domain; static int gic_shared_intrs; static unsigned int gic_cpu_pin; static unsigned int timer_cpu_pin; static struct irq_chip gic_level_irq_controller, gic_edge_irq_controller; + +#ifdef CONFIG_GENERIC_IRQ_IPI static DECLARE_BITMAP(ipi_resrv, GIC_MAX_INTRS); static DECLARE_BITMAP(ipi_available, GIC_MAX_INTRS); +#endif /* CONFIG_GENERIC_IRQ_IPI */
static struct gic_all_vpes_chip_data { u32 map; @@ -460,9 +462,11 @@ static int gic_irq_domain_map(struct irq_domain *d, unsigned int virq, u32 map;
if (hwirq >= GIC_SHARED_HWIRQ_BASE) { +#ifdef CONFIG_GENERIC_IRQ_IPI /* verify that shared irqs don't conflict with an IPI irq */ if (test_bit(GIC_HWIRQ_TO_SHARED(hwirq), ipi_resrv)) return -EBUSY; +#endif /* CONFIG_GENERIC_IRQ_IPI */
err = irq_domain_set_hwirq_and_chip(d, virq, hwirq, &gic_level_irq_controller, @@ -551,6 +555,8 @@ static const struct irq_domain_ops gic_irq_domain_ops = { .map = gic_irq_domain_map, };
+#ifdef CONFIG_GENERIC_IRQ_IPI + static int gic_ipi_domain_xlate(struct irq_domain *d, struct device_node *ctrlr, const u32 *intspec, unsigned int intsize, irq_hw_number_t *out_hwirq, @@ -654,6 +660,48 @@ static const struct irq_domain_ops gic_ipi_domain_ops = { .match = gic_ipi_domain_match, };
+static int gic_register_ipi_domain(struct device_node *node) +{ + struct irq_domain *gic_ipi_domain; + unsigned int v[2], num_ipis; + + gic_ipi_domain = irq_domain_add_hierarchy(gic_irq_domain, + IRQ_DOMAIN_FLAG_IPI_PER_CPU, + GIC_NUM_LOCAL_INTRS + gic_shared_intrs, + node, &gic_ipi_domain_ops, NULL); + if (!gic_ipi_domain) { + pr_err("Failed to add IPI domain"); + return -ENXIO; + } + + irq_domain_update_bus_token(gic_ipi_domain, DOMAIN_BUS_IPI); + + if (node && + !of_property_read_u32_array(node, "mti,reserved-ipi-vectors", v, 2)) { + bitmap_set(ipi_resrv, v[0], v[1]); + } else { + /* + * Reserve 2 interrupts per possible CPU/VP for use as IPIs, + * meeting the requirements of arch/mips SMP. + */ + num_ipis = 2 * num_possible_cpus(); + bitmap_set(ipi_resrv, gic_shared_intrs - num_ipis, num_ipis); + } + + bitmap_copy(ipi_available, ipi_resrv, GIC_MAX_INTRS); + + return 0; +} + +#else /* !CONFIG_GENERIC_IRQ_IPI */ + +static inline int gic_register_ipi_domain(struct device_node *node) +{ + return 0; +} + +#endif /* !CONFIG_GENERIC_IRQ_IPI */ + static int gic_cpu_startup(unsigned int cpu) { /* Enable or disable EIC */ @@ -672,11 +720,12 @@ static int gic_cpu_startup(unsigned int cpu) static int __init gic_of_init(struct device_node *node, struct device_node *parent) { - unsigned int cpu_vec, i, gicconfig, v[2], num_ipis; + unsigned int cpu_vec, i, gicconfig; unsigned long reserved; phys_addr_t gic_base; struct resource res; size_t gic_len; + int ret;
/* Find the first available CPU vector. */ i = 0; @@ -765,30 +814,9 @@ static int __init gic_of_init(struct device_node *node, return -ENXIO; }
- gic_ipi_domain = irq_domain_add_hierarchy(gic_irq_domain, - IRQ_DOMAIN_FLAG_IPI_PER_CPU, - GIC_NUM_LOCAL_INTRS + gic_shared_intrs, - node, &gic_ipi_domain_ops, NULL); - if (!gic_ipi_domain) { - pr_err("Failed to add IPI domain"); - return -ENXIO; - } - - irq_domain_update_bus_token(gic_ipi_domain, DOMAIN_BUS_IPI); - - if (node && - !of_property_read_u32_array(node, "mti,reserved-ipi-vectors", v, 2)) { - bitmap_set(ipi_resrv, v[0], v[1]); - } else { - /* - * Reserve 2 interrupts per possible CPU/VP for use as IPIs, - * meeting the requirements of arch/mips SMP. - */ - num_ipis = 2 * num_possible_cpus(); - bitmap_set(ipi_resrv, gic_shared_intrs - num_ipis, num_ipis); - } - - bitmap_copy(ipi_available, ipi_resrv, GIC_MAX_INTRS); + ret = gic_register_ipi_domain(node); + if (ret) + return ret;
board_bind_eic_interrupt = &gic_bind_eic_interrupt;
From: Samuel Holland samuel@sholland.org
[ Upstream commit 0f5209fee90b4544c58b4278d944425292789967 ]
The generic IPI code depends on the IRQ affinity mask being allocated and initialized. This will not be the case if SMP is disabled. Fix up the remaining driver that selected GENERIC_IRQ_IPI in a non-SMP config.
Reported-by: kernel test robot lkp@intel.com Signed-off-by: Samuel Holland samuel@sholland.org Signed-off-by: Marc Zyngier maz@kernel.org Link: https://lore.kernel.org/r/20220701200056.46555-3-samuel@sholland.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/irqchip/Kconfig | 2 +- kernel/irq/Kconfig | 1 + 2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig index 8f9c52873338..ae1b9f59abc5 100644 --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -171,7 +171,7 @@ config MADERA_IRQ config IRQ_MIPS_CPU bool select GENERIC_IRQ_CHIP - select GENERIC_IRQ_IPI if SYS_SUPPORTS_MULTITHREADING + select GENERIC_IRQ_IPI if SMP && SYS_SUPPORTS_MULTITHREADING select IRQ_DOMAIN select GENERIC_IRQ_EFFECTIVE_AFF_MASK
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig index fbc54c2a7f23..00d58588ea95 100644 --- a/kernel/irq/Kconfig +++ b/kernel/irq/Kconfig @@ -82,6 +82,7 @@ config IRQ_FASTEOI_HIERARCHY_HANDLERS # Generic IRQ IPI support config GENERIC_IRQ_IPI bool + depends on SMP select IRQ_DOMAIN_HIERARCHY
# Generic MSI interrupt support
From: John Keeping john@metanate.com
[ Upstream commit 401e4963bf45c800e3e9ea0d3a0289d738005fd4 ]
With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two normal priority tasks (SCHED_OTHER, nice level zero):
INFO: task kworker/u8:0:8 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000 Workqueue: writeback wb_workfn (flush-7:0) [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134) [<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174) [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>] +(rt_mutex_slowlock.constprop.0+0xac/0x174) [<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54) [<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec) [<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c) [<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8) [<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4) [<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4) [<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c) [<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8) [<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178) [<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38) Exception stack(0xc10e3fb0 to 0xc10e3ff8) 3fa0: 00000000 00000000 00000000 00000000 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
INFO: task tar:2083 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000 [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134) [<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24) [<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30) [<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8) [<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0) [<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144) [<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4) [<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250) [<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148) [<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98) [<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec) [<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c) Exception stack(0xc2e1bfa8 to 0xc2e1bff0) bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000 bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190 bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc
Here the kworker is waiting on msdos_sb_info::s_lock which is held by tar which is in turn waiting for a buffer which is locked waiting to be flushed, but this operation is plugged in the kworker.
The lock is a normal struct mutex, so tsk_is_pi_blocked() will always return false on !RT and thus the behaviour changes for RT.
It seems that the intent here is to skip blk_flush_plug() in the case where a non-preemptible lock (such as a spinlock) has been converted to a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT schedule flag. But sched_submit_work() is only called from schedule() which is never called in this scenario, so the check can simply be deleted.
Looking at the history of the -rt patchset, in fact this change was present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b ("sched/core: Rework the __schedule() preempt argument").
As described in [1]:
The schedule process must distinguish between blocking on a regular sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock and rwlock): - rwsem and mutex must flush block requests (blk_schedule_flush_plug()) even if blocked on a lock. This can not deadlock because this also happens for non-RT. There should be a warning if the scheduling point is within a RCU read section.
- spinlock and rwlock must not flush block requests. This will deadlock if the callback attempts to acquire a lock which is already acquired. Similarly to being preempted, there should be no warning if the scheduling point is within a RCU read section.
and with the tsk_is_pi_blocked() in the scheduler path, we hit the first issue.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/p...
Signed-off-by: John Keeping john@metanate.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Steven Rostedt (Google) rostedt@goodmis.org Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/sched/rt.h | 8 -------- kernel/sched/core.c | 8 ++++++-- 2 files changed, 6 insertions(+), 10 deletions(-)
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h index e5af028c08b4..994c25640e15 100644 --- a/include/linux/sched/rt.h +++ b/include/linux/sched/rt.h @@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p) } extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task); extern void rt_mutex_adjust_pi(struct task_struct *p); -static inline bool tsk_is_pi_blocked(struct task_struct *tsk) -{ - return tsk->pi_blocked_on != NULL; -} #else static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task) { return NULL; } # define rt_mutex_adjust_pi(p) do { } while (0) -static inline bool tsk_is_pi_blocked(struct task_struct *tsk) -{ - return false; -} #endif
extern void normalize_rt_tasks(void); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b89ca5c83143..012c037da58a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6379,8 +6379,12 @@ static inline void sched_submit_work(struct task_struct *tsk) preempt_enable_no_resched(); }
- if (tsk_is_pi_blocked(tsk)) - return; + /* + * spinlock and rwlock must not flush block requests. This will + * deadlock if the callback attempts to acquire a lock which is + * already acquired. + */ + SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
/* * If we are going to sleep and we have plugged IO queued,
From: William Dean williamsukatube@163.com
[ Upstream commit 71349cc85e5930dce78ed87084dee098eba24b59 ]
The function ioremap() in gic_of_init() can fail, so its return value should be checked.
Reported-by: Hacash Robot hacashRobot@santino.com Signed-off-by: William Dean williamsukatube@163.com Signed-off-by: Marc Zyngier maz@kernel.org Link: https://lore.kernel.org/r/20220723100128.2964304-1-williamsukatube@163.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/irqchip/irq-mips-gic.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/irqchip/irq-mips-gic.c b/drivers/irqchip/irq-mips-gic.c index f03f47ffea1e..d815285f1efe 100644 --- a/drivers/irqchip/irq-mips-gic.c +++ b/drivers/irqchip/irq-mips-gic.c @@ -767,6 +767,10 @@ static int __init gic_of_init(struct device_node *node, }
mips_gic_base = ioremap(gic_base, gic_len); + if (!mips_gic_base) { + pr_err("Failed to ioremap gic_base\n"); + return -ENOMEM; + }
gicconfig = read_gic_config(); gic_shared_intrs = gicconfig & GIC_CONFIG_NUMINTERRUPTS;
From: Juri Lelli juri.lelli@redhat.com
[ Upstream commit cceeeb6a6d02e7b9a74ddd27a3225013b34174aa ]
Changes to hrtimer mode (potentially made by __hrtimer_init_sleeper on PREEMPT_RT) are not visible to hrtimer_start_range_ns, thus not accounted for by hrtimer_start_expires call paths. In particular, __wait_event_hrtimeout suffers from this problem as we have, for example:
fs/aio.c::read_events wait_event_interruptible_hrtimeout __wait_event_hrtimeout hrtimer_init_sleeper_on_stack <- this might "mode |= HRTIMER_MODE_HARD" on RT if task runs at RT/DL priority hrtimer_start_range_ns WARN_ON_ONCE(!(mode & HRTIMER_MODE_HARD) ^ !timer->is_hard) fires since the latter doesn't see the change of mode done by init_sleeper
Fix it by making __wait_event_hrtimeout call hrtimer_sleeper_start_expires, which is aware of the special RT/DL case, instead of hrtimer_start_range_ns.
Reported-by: Bruno Goncalves bgoncalv@redhat.com Signed-off-by: Juri Lelli juri.lelli@redhat.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Daniel Bristot de Oliveira bristot@kernel.org Reviewed-by: Valentin Schneider vschneid@redhat.com Link: https://lore.kernel.org/r/20220627095051.42470-1-juri.lelli@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/wait.h | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/include/linux/wait.h b/include/linux/wait.h index d22cf2985b8f..21044562aab7 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -544,10 +544,11 @@ do { \ \ hrtimer_init_sleeper_on_stack(&__t, CLOCK_MONOTONIC, \ HRTIMER_MODE_REL); \ - if ((timeout) != KTIME_MAX) \ - hrtimer_start_range_ns(&__t.timer, timeout, \ - current->timer_slack_ns, \ - HRTIMER_MODE_REL); \ + if ((timeout) != KTIME_MAX) { \ + hrtimer_set_expires_range_ns(&__t.timer, timeout, \ + current->timer_slack_ns); \ + hrtimer_sleeper_start_expires(&__t, HRTIMER_MODE_REL); \ + } \ \ __ret = ___wait_event(wq_head, condition, state, 0, 0, \ if (!__t.task) { \
From: Alexander Stein alexander.stein@ew.tq-group.com
[ Upstream commit 5655699cf5cff9f4c4ee703792156bdd05d1addf ]
All 3 properties are required by sram.yaml. Fixes the dtbs_check warning: sram@900000: '#address-cells' is a required property sram@900000: '#size-cells' is a required property sram@900000: 'ranges' is a required property
Signed-off-by: Alexander Stein alexander.stein@ew.tq-group.com Signed-off-by: Shawn Guo shawnguo@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm/boot/dts/imx6ul.dtsi | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/arch/arm/boot/dts/imx6ul.dtsi b/arch/arm/boot/dts/imx6ul.dtsi index afeec01f6522..1d435a46fc5c 100644 --- a/arch/arm/boot/dts/imx6ul.dtsi +++ b/arch/arm/boot/dts/imx6ul.dtsi @@ -149,6 +149,9 @@ soc { ocram: sram@900000 { compatible = "mmio-sram"; reg = <0x00900000 0x20000>; + ranges = <0 0x00900000 0x20000>; + #address-cells = <1>; + #size-cells = <1>; };
intc: interrupt-controller@a01000 {
From: Alexander Stein alexander.stein@ew.tq-group.com
[ Upstream commit edb67843983bbdf61b4c8c3c50618003d38bb4ae ]
operating-points is a uint32-matrix as per opp-v1.yaml. Change it accordingly. While at it, change fsl,soc-operating-points as well, although there is no bindings file (yet). But they should have the same format. Fixes the dt_binding_check warning: cpu@0: operating-points:0: [696000, 1275000, 528000, 1175000, 396000, 1025000, 198000, 950000] is too long cpu@0: operating-points:0: Additional items are not allowed (528000, 1175000, 396000, 1025000, 198000, 950000 were unexpected)
Signed-off-by: Alexander Stein alexander.stein@ew.tq-group.com Signed-off-by: Shawn Guo shawnguo@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm/boot/dts/imx6ul.dtsi | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-)
diff --git a/arch/arm/boot/dts/imx6ul.dtsi b/arch/arm/boot/dts/imx6ul.dtsi index 1d435a46fc5c..2fcbd9d91521 100644 --- a/arch/arm/boot/dts/imx6ul.dtsi +++ b/arch/arm/boot/dts/imx6ul.dtsi @@ -64,20 +64,18 @@ cpu0: cpu@0 { clock-frequency = <696000000>; clock-latency = <61036>; /* two CLK32 periods */ #cooling-cells = <2>; - operating-points = < + operating-points = /* kHz uV */ - 696000 1275000 - 528000 1175000 - 396000 1025000 - 198000 950000 - >; - fsl,soc-operating-points = < + <696000 1275000>, + <528000 1175000>, + <396000 1025000>, + <198000 950000>; + fsl,soc-operating-points = /* KHz uV */ - 696000 1275000 - 528000 1175000 - 396000 1175000 - 198000 1175000 - >; + <696000 1275000>, + <528000 1175000>, + <396000 1175000>, + <198000 1175000>; clocks = <&clks IMX6UL_CLK_ARM>, <&clks IMX6UL_CLK_PLL2_BUS>, <&clks IMX6UL_CLK_PLL2_PFD2>,
From: Alexander Stein alexander.stein@ew.tq-group.com
[ Upstream commit 7d15e0c9a515494af2e3199741cdac7002928a0e ]
According to binding, the compatible shall only contain imx6ul and imx21 compatibles. Fixes the dt_binding_check warning: keypad@20b8000: compatible: 'oneOf' conditional failed, one must be fixed: ['fsl,imx6ul-kpp', 'fsl,imx6q-kpp', 'fsl,imx21-kpp'] is too long Additional items are not allowed ('fsl,imx6q-kpp', 'fsl,imx21-kpp' were unexpected) Additional items are not allowed ('fsl,imx21-kpp' was unexpected) 'fsl,imx21-kpp' was expected
Signed-off-by: Alexander Stein alexander.stein@ew.tq-group.com Signed-off-by: Shawn Guo shawnguo@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm/boot/dts/imx6ul.dtsi | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm/boot/dts/imx6ul.dtsi b/arch/arm/boot/dts/imx6ul.dtsi index 2fcbd9d91521..df8b4ad62418 100644 --- a/arch/arm/boot/dts/imx6ul.dtsi +++ b/arch/arm/boot/dts/imx6ul.dtsi @@ -544,7 +544,7 @@ fec2: ethernet@20b4000 { };
kpp: keypad@20b8000 { - compatible = "fsl,imx6ul-kpp", "fsl,imx6q-kpp", "fsl,imx21-kpp"; + compatible = "fsl,imx6ul-kpp", "fsl,imx21-kpp"; reg = <0x020b8000 0x4000>; interrupts = <GIC_SPI 82 IRQ_TYPE_LEVEL_HIGH>; clocks = <&clks IMX6UL_CLK_KPP>;
From: Alexander Stein alexander.stein@ew.tq-group.com
[ Upstream commit e0aca931a2c7c29c88ebf37f9c3cd045e083483d ]
"fsl,imx6ul-csi" was never listed as compatible to "fsl,imx7-csi", neither in yaml bindings, nor previous txt binding. Remove the imx7 part. Fixes the dt schema check warning: csi@21c4000: compatible: 'oneOf' conditional failed, one must be fixed: ['fsl,imx6ul-csi', 'fsl,imx7-csi'] is too long Additional items are not allowed ('fsl,imx7-csi' was unexpected) 'fsl,imx8mm-csi' was expected
Signed-off-by: Alexander Stein alexander.stein@ew.tq-group.com Signed-off-by: Shawn Guo shawnguo@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm/boot/dts/imx6ul.dtsi | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm/boot/dts/imx6ul.dtsi b/arch/arm/boot/dts/imx6ul.dtsi index df8b4ad62418..367657a9a99f 100644 --- a/arch/arm/boot/dts/imx6ul.dtsi +++ b/arch/arm/boot/dts/imx6ul.dtsi @@ -999,7 +999,7 @@ cpu_speed_grade: speed-grade@10 { };
csi: csi@21c4000 { - compatible = "fsl,imx6ul-csi", "fsl,imx7-csi"; + compatible = "fsl,imx6ul-csi"; reg = <0x021c4000 0x4000>; interrupts = <GIC_SPI 7 IRQ_TYPE_LEVEL_HIGH>; clocks = <&clks IMX6UL_CLK_CSI>;
From: Alexander Stein alexander.stein@ew.tq-group.com
[ Upstream commit 1a884d17ca324531634cce82e9f64c0302bdf7de ]
In yaml binding "fsl,imx6ul-lcdif" is listed as compatible to imx6sx-lcdif, but not imx28-lcdif. Change the list accordingly. Fixes the dt_binding_check warning: lcdif@21c8000: compatible: 'oneOf' conditional failed, one must be fixed: ['fsl,imx6ul-lcdif', 'fsl,imx28-lcdif'] is too long Additional items are not allowed ('fsl,imx28-lcdif' was unexpected) 'fsl,imx6ul-lcdif' is not one of ['fsl,imx23-lcdif', 'fsl,imx28-lcdif', 'fsl,imx6sx-lcdif'] 'fsl,imx6sx-lcdif' was expected
Signed-off-by: Alexander Stein alexander.stein@ew.tq-group.com Signed-off-by: Shawn Guo shawnguo@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm/boot/dts/imx6ul.dtsi | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm/boot/dts/imx6ul.dtsi b/arch/arm/boot/dts/imx6ul.dtsi index 367657a9a99f..bc6548058d8c 100644 --- a/arch/arm/boot/dts/imx6ul.dtsi +++ b/arch/arm/boot/dts/imx6ul.dtsi @@ -1008,7 +1008,7 @@ csi: csi@21c4000 { };
lcdif: lcdif@21c8000 { - compatible = "fsl,imx6ul-lcdif", "fsl,imx28-lcdif"; + compatible = "fsl,imx6ul-lcdif", "fsl,imx6sx-lcdif"; reg = <0x021c8000 0x4000>; interrupts = <GIC_SPI 5 IRQ_TYPE_LEVEL_HIGH>; clocks = <&clks IMX6UL_CLK_LCDIF_PIX>,
From: Alexander Stein alexander.stein@ew.tq-group.com
[ Upstream commit 0c6cf86e1ab433b2d421880fdd9c6e954f404948 ]
imx6ul is not compatible to imx6sx, both have different erratas. Fixes the dt_binding_check warning: spi@21e0000: compatible: 'oneOf' conditional failed, one must be fixed: ['fsl,imx6ul-qspi', 'fsl,imx6sx-qspi'] is too long Additional items are not allowed ('fsl,imx6sx-qspi' was unexpected) 'fsl,imx6ul-qspi' is not one of ['fsl,ls1043a-qspi'] 'fsl,imx6ul-qspi' is not one of ['fsl,imx8mq-qspi'] 'fsl,ls1021a-qspi' was expected 'fsl,imx7d-qspi' was expected
Signed-off-by: Alexander Stein alexander.stein@ew.tq-group.com Signed-off-by: Shawn Guo shawnguo@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm/boot/dts/imx6ul.dtsi | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm/boot/dts/imx6ul.dtsi b/arch/arm/boot/dts/imx6ul.dtsi index bc6548058d8c..eca8bf89ab88 100644 --- a/arch/arm/boot/dts/imx6ul.dtsi +++ b/arch/arm/boot/dts/imx6ul.dtsi @@ -1029,7 +1029,7 @@ pxp: pxp@21cc000 { qspi: spi@21e0000 { #address-cells = <1>; #size-cells = <0>; - compatible = "fsl,imx6ul-qspi", "fsl,imx6sx-qspi"; + compatible = "fsl,imx6ul-qspi"; reg = <0x021e0000 0x4000>, <0x60000000 0x10000000>; reg-names = "QuadSPI", "QuadSPI-memory"; interrupts = <GIC_SPI 107 IRQ_TYPE_LEVEL_HIGH>;
From: Linus Walleij linus.walleij@linaro.org
[ Upstream commit 0b2152e428ab91533a02888ff24e52e788dc4637 ]
This was fixed wrong so fix it again. Now verified by using iio-sensor-proxy monitor-sensor test program.
Link: https://lore.kernel.org/r/20220611204249.472250-1-linus.walleij@linaro.org Signed-off-by: Linus Walleij linus.walleij@linaro.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm/boot/dts/ste-ux500-samsung-codina.dts | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/boot/dts/ste-ux500-samsung-codina.dts b/arch/arm/boot/dts/ste-ux500-samsung-codina.dts index 952606e607ed..ce62ba877da1 100644 --- a/arch/arm/boot/dts/ste-ux500-samsung-codina.dts +++ b/arch/arm/boot/dts/ste-ux500-samsung-codina.dts @@ -544,8 +544,8 @@ lisd3dh@19 { reg = <0x19>; vdd-supply = <&ab8500_ldo_aux1_reg>; // 3V vddio-supply = <&ab8500_ldo_aux2_reg>; // 1.8V - mount-matrix = "0", "-1", "0", - "1", "0", "0", + mount-matrix = "0", "1", "0", + "-1", "0", "0", "0", "0", "1"; }; };
From: Linus Walleij linus.walleij@linaro.org
[ Upstream commit e24c75f02a81d6ddac0072cbd7a03e799c19d558 ]
This was fixed wrong so fix it. Now verified by using iio-sensor-proxy monitor-sensor test program.
Link: https://lore.kernel.org/r/20220611205138.491513-1-linus.walleij@linaro.org Signed-off-by: Linus Walleij linus.walleij@linaro.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm/boot/dts/ste-ux500-samsung-gavini.dts | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/boot/dts/ste-ux500-samsung-gavini.dts b/arch/arm/boot/dts/ste-ux500-samsung-gavini.dts index fabc390ccb0c..6c9e812ef03f 100644 --- a/arch/arm/boot/dts/ste-ux500-samsung-gavini.dts +++ b/arch/arm/boot/dts/ste-ux500-samsung-gavini.dts @@ -502,8 +502,8 @@ i2c-gate { accelerometer@18 { compatible = "bosch,bma222e"; reg = <0x18>; - mount-matrix = "0", "1", "0", - "-1", "0", "0", + mount-matrix = "0", "-1", "0", + "1", "0", "0", "0", "0", "1"; vddio-supply = <&ab8500_ldo_aux2_reg>; // 1.8V vdd-supply = <&ab8500_ldo_aux1_reg>; // 3V
From: Guo Mengqi guomengqi3@huawei.com
[ Upstream commit 917e43de2a56d9b82576f1cc94748261f1988458 ]
Add missing clk_disable_unprepare() in synquacer_spi_resume().
Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Guo Mengqi guomengqi3@huawei.com Link: https://lore.kernel.org/r/20220624005614.49434-1-guomengqi3@huawei.com Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/spi/spi-synquacer.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/spi/spi-synquacer.c b/drivers/spi/spi-synquacer.c index ea706d9629cb..47cbe73137c2 100644 --- a/drivers/spi/spi-synquacer.c +++ b/drivers/spi/spi-synquacer.c @@ -783,6 +783,7 @@ static int __maybe_unused synquacer_spi_resume(struct device *dev)
ret = synquacer_spi_enable(master); if (ret) { + clk_disable_unprepare(sspi->clk); dev_err(dev, "failed to enable spi (%d)\n", ret); return ret; }
From: Liang He windhl@126.com
[ Upstream commit 50b87a32a79bca6e275918a711fb8cc55e16d739 ]
In omapdss_init_fbdev(), of_find_node_by_name() will return a node pointer with refcount incremented. We should use of_node_put() when it is not used anymore.
Signed-off-by: Liang He windhl@126.com Message-Id: 20220617145803.4050918-1-windhl@126.com Signed-off-by: Tony Lindgren tony@atomide.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm/mach-omap2/display.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/arm/mach-omap2/display.c b/arch/arm/mach-omap2/display.c index 21413a9b7b6c..eb09a25e3b45 100644 --- a/arch/arm/mach-omap2/display.c +++ b/arch/arm/mach-omap2/display.c @@ -211,6 +211,7 @@ static int __init omapdss_init_fbdev(void) node = of_find_node_by_name(NULL, "omap4_padconf_global"); if (node) omap4_dsi_mux_syscon = syscon_node_to_regmap(node); + of_node_put(node);
return 0; }
From: Liang He windhl@126.com
[ Upstream commit 5cdbab96bab314c6f2f5e4e8b8a019181328bf5f ]
In pdata_quirks_init_clocks(), the loop contains of_find_node_by_name() but without corresponding of_node_put().
Signed-off-by: Liang He windhl@126.com Message-Id: 20220618020603.4055792-1-windhl@126.com Signed-off-by: Tony Lindgren tony@atomide.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm/mach-omap2/pdata-quirks.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/arm/mach-omap2/pdata-quirks.c b/arch/arm/mach-omap2/pdata-quirks.c index 765809b214e7..bf50acd6b8a3 100644 --- a/arch/arm/mach-omap2/pdata-quirks.c +++ b/arch/arm/mach-omap2/pdata-quirks.c @@ -587,6 +587,8 @@ pdata_quirks_init_clocks(const struct of_device_id *omap_dt_match_table)
of_platform_populate(np, omap_dt_match_table, omap_auxdata_lookup, NULL); + + of_node_put(np); } }
From: Hans de Goede hdegoede@redhat.com
[ Upstream commit 0dd6db359e5f206cbf1dd1fd40dd211588cd2725 ]
Somehow the "ThinkPad X1 Carbon 6th" entry ended up twice in the struct dmi_system_id acpi_ec_no_wakeup[] array. Remove one of the entries.
Signed-off-by: Hans de Goede hdegoede@redhat.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/acpi/ec.c | 7 ------- 1 file changed, 7 deletions(-)
diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c index 9b859ff976e8..d41fa0f3a1ac 100644 --- a/drivers/acpi/ec.c +++ b/drivers/acpi/ec.c @@ -2167,13 +2167,6 @@ static const struct dmi_system_id acpi_ec_no_wakeup[] = { DMI_MATCH(DMI_PRODUCT_FAMILY, "Thinkpad X1 Carbon 6th"), }, }, - { - .ident = "ThinkPad X1 Carbon 6th", - .matches = { - DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"), - DMI_MATCH(DMI_PRODUCT_FAMILY, "ThinkPad X1 Carbon 6th"), - }, - }, { .ident = "ThinkPad X1 Yoga 3rd", .matches = {
From: Hans de Goede hdegoede@redhat.com
[ Upstream commit f7090e0ef360d674f08a22fab90e4e209fb1f658 ]
It seems that these quirks are no longer necessary since commit 69b957c26b32 ("ACPI: EC: Fix possible issues related to EC initialization order"), which has fixed this in a generic manner.
There are 3 commits adding DMI entries with this quirk (adding multiple DMI entries per commit). 2/3 commits are from before the generic fix.
Which leaves commit 6306f0431914 ("ACPI: EC: Make more Asus laptops use ECDT _GPE"), which was committed way after the generic fix. But this was just due to slow upstreaming of it. This commit stems from Endless from 15 Aug 2017 (committed upstream 20 May 2021): https://github.com/endlessm/linux/pull/288
The current code should work fine without this:
1. The EC_FLAGS_IGNORE_DSDT_GPE flag is only checked in ec_parse_device(), like this:
if (boot_ec && boot_ec_is_ecdt && EC_FLAGS_IGNORE_DSDT_GPE) { ec->gpe = boot_ec->gpe; } else { /* parse GPE */ }
2. ec_parse_device() is only called from acpi_ec_add() and acpi_ec_dsdt_probe()
3. acpi_ec_dsdt_probe() starts with:
if (boot_ec) return;
so it only calls ec_parse_device() when boot_ec == NULL, meaning that the quirk never triggers for this call. So only the call in acpi_ec_add() matters.
4. acpi_ec_add() does the following after the ec_parse_device() call:
if (boot_ec && ec->command_addr == boot_ec->command_addr && ec->data_addr == boot_ec->data_addr && !EC_FLAGS_TRUST_DSDT_GPE) { /* * Trust PNP0C09 namespace location rather than * ECDT ID. But trust ECDT GPE rather than _GPE * because of ASUS quirks, so do not change * boot_ec->gpe to ec->gpe. */ boot_ec->handle = ec->handle; acpi_handle_debug(ec->handle, "duplicated.\n"); acpi_ec_free(ec); ec = boot_ec; }
The quirk only matters if boot_ec != NULL and EC_FLAGS_TRUST_DSDT_GPE is never set at the same time as EC_FLAGS_IGNORE_DSDT_GPE.
That means that if the addresses match we always enter this if block and then only the ec->handle part of the data stored in ec by ec_parse_device() is used and the rest is thrown away, after which ec is made to point to boot_ec, at which point ec->gpe == boot_ec->gpe, so the same result as with the quirk set, independent of the value of the quirk.
Also note the comment in this block which indicates that the gpe result from ec_parse_device() is deliberately not taken to deal with buggy Asus laptops and all DMI quirks setting EC_FLAGS_IGNORE_DSDT_GPE are for Asus laptops.
Based on the above I believe that unless on some quirked laptops the ECDT and DSDT EC addresses do not match we can drop the quirk.
I've checked dmesg output to ensure the ECDT and DSDT EC addresses match for quirked models using https://linux-hardware.org hw-probe reports.
I've been able to confirm that the addresses match for the following models this way: GL702VMK, X505BA, X505BP, X550VXK, X580VD. Whereas for the following models I could find any dmesg output: FX502VD, FX502VE, X542BA, X542BP.
Note the models without dmesg all were submitted in patches with a batch of models and other models from the same batch checkout ok.
This, combined with that all the code adding the quirks was written before the generic fix makes me believe that it is safe to remove this quirk now.
Signed-off-by: Hans de Goede hdegoede@redhat.com Reviewed-by: Daniel Drake drake@endlessos.org Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/acpi/ec.c | 75 ++++++----------------------------------------- 1 file changed, 9 insertions(+), 66 deletions(-)
diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c index d41fa0f3a1ac..4e583a8cb562 100644 --- a/drivers/acpi/ec.c +++ b/drivers/acpi/ec.c @@ -183,7 +183,6 @@ static struct workqueue_struct *ec_wq; static struct workqueue_struct *ec_query_wq;
static int EC_FLAGS_CORRECT_ECDT; /* Needs ECDT port address correction */ -static int EC_FLAGS_IGNORE_DSDT_GPE; /* Needs ECDT GPE as correction setting */ static int EC_FLAGS_TRUST_DSDT_GPE; /* Needs DSDT GPE as correction setting */ static int EC_FLAGS_CLEAR_ON_RESUME; /* Needs acpi_ec_clear() on boot/resume */
@@ -1392,24 +1391,16 @@ ec_parse_device(acpi_handle handle, u32 Level, void *context, void **retval) if (ec->data_addr == 0 || ec->command_addr == 0) return AE_OK;
- if (boot_ec && boot_ec_is_ecdt && EC_FLAGS_IGNORE_DSDT_GPE) { - /* - * Always inherit the GPE number setting from the ECDT - * EC. - */ - ec->gpe = boot_ec->gpe; - } else { - /* Get GPE bit assignment (EC events). */ - /* TODO: Add support for _GPE returning a package */ - status = acpi_evaluate_integer(handle, "_GPE", NULL, &tmp); - if (ACPI_SUCCESS(status)) - ec->gpe = tmp; + /* Get GPE bit assignment (EC events). */ + /* TODO: Add support for _GPE returning a package */ + status = acpi_evaluate_integer(handle, "_GPE", NULL, &tmp); + if (ACPI_SUCCESS(status)) + ec->gpe = tmp; + /* + * Errors are non-fatal, allowing for ACPI Reduced Hardware + * platforms which use GpioInt instead of GPE. + */
- /* - * Errors are non-fatal, allowing for ACPI Reduced Hardware - * platforms which use GpioInt instead of GPE. - */ - } /* Use the global lock for all EC transactions? */ tmp = 0; acpi_evaluate_integer(handle, "_GLK", NULL, &tmp); @@ -1847,60 +1838,12 @@ static int ec_honor_dsdt_gpe(const struct dmi_system_id *id) return 0; }
-/* - * Some DSDTs contain wrong GPE setting. - * Asus FX502VD/VE, GL702VMK, X550VXK, X580VD - * https://bugzilla.kernel.org/show_bug.cgi?id=195651 - */ -static int ec_honor_ecdt_gpe(const struct dmi_system_id *id) -{ - pr_debug("Detected system needing ignore DSDT GPE setting.\n"); - EC_FLAGS_IGNORE_DSDT_GPE = 1; - return 0; -} - static const struct dmi_system_id ec_dmi_table[] __initconst = { { ec_correct_ecdt, "MSI MS-171F", { DMI_MATCH(DMI_SYS_VENDOR, "Micro-Star"), DMI_MATCH(DMI_PRODUCT_NAME, "MS-171F"),}, NULL}, { - ec_honor_ecdt_gpe, "ASUS FX502VD", { - DMI_MATCH(DMI_SYS_VENDOR, "ASUSTeK COMPUTER INC."), - DMI_MATCH(DMI_PRODUCT_NAME, "FX502VD"),}, NULL}, - { - ec_honor_ecdt_gpe, "ASUS FX502VE", { - DMI_MATCH(DMI_SYS_VENDOR, "ASUSTeK COMPUTER INC."), - DMI_MATCH(DMI_PRODUCT_NAME, "FX502VE"),}, NULL}, - { - ec_honor_ecdt_gpe, "ASUS GL702VMK", { - DMI_MATCH(DMI_SYS_VENDOR, "ASUSTeK COMPUTER INC."), - DMI_MATCH(DMI_PRODUCT_NAME, "GL702VMK"),}, NULL}, - { - ec_honor_ecdt_gpe, "ASUSTeK COMPUTER INC. X505BA", { - DMI_MATCH(DMI_SYS_VENDOR, "ASUSTeK COMPUTER INC."), - DMI_MATCH(DMI_PRODUCT_NAME, "X505BA"),}, NULL}, - { - ec_honor_ecdt_gpe, "ASUSTeK COMPUTER INC. X505BP", { - DMI_MATCH(DMI_SYS_VENDOR, "ASUSTeK COMPUTER INC."), - DMI_MATCH(DMI_PRODUCT_NAME, "X505BP"),}, NULL}, - { - ec_honor_ecdt_gpe, "ASUSTeK COMPUTER INC. X542BA", { - DMI_MATCH(DMI_SYS_VENDOR, "ASUSTeK COMPUTER INC."), - DMI_MATCH(DMI_PRODUCT_NAME, "X542BA"),}, NULL}, - { - ec_honor_ecdt_gpe, "ASUSTeK COMPUTER INC. X542BP", { - DMI_MATCH(DMI_SYS_VENDOR, "ASUSTeK COMPUTER INC."), - DMI_MATCH(DMI_PRODUCT_NAME, "X542BP"),}, NULL}, - { - ec_honor_ecdt_gpe, "ASUS X550VXK", { - DMI_MATCH(DMI_SYS_VENDOR, "ASUSTeK COMPUTER INC."), - DMI_MATCH(DMI_PRODUCT_NAME, "X550VXK"),}, NULL}, - { - ec_honor_ecdt_gpe, "ASUS X580VD", { - DMI_MATCH(DMI_SYS_VENDOR, "ASUSTeK COMPUTER INC."), - DMI_MATCH(DMI_PRODUCT_NAME, "X580VD"),}, NULL}, - { /* https://bugzilla.kernel.org/show_bug.cgi?id=209989 */ ec_honor_dsdt_gpe, "HP Pavilion Gaming Laptop 15-cx0xxx", { DMI_MATCH(DMI_SYS_VENDOR, "HP"),
From: Manyi Li limanyi@uniontech.com
[ Upstream commit 4b7ef7b05afcde44142225c184bf43a0cd9e2178 ]
[821d6f0359b0614792ab8e2fb93b503e25a65079] is to make machines produced from 2012 to now not saving NVS region to accelerate S3.
But, Lenovo G40-45, a platform released in 2015, still needs NVS memory saving during S3. A quirk is introduced for this platform.
Signed-off-by: Manyi Li limanyi@uniontech.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/acpi/sleep.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c index 07515139141e..d7194047d256 100644 --- a/drivers/acpi/sleep.c +++ b/drivers/acpi/sleep.c @@ -361,6 +361,14 @@ static const struct dmi_system_id acpisleep_dmi_table[] __initconst = { DMI_MATCH(DMI_PRODUCT_NAME, "80E3"), }, }, + { + .callback = init_nvs_save_s3, + .ident = "Lenovo G40-45", + .matches = { + DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"), + DMI_MATCH(DMI_PRODUCT_NAME, "80E1"), + }, + }, /* * ThinkPad X1 Tablet(2016) cannot do suspend-to-idle using * the Low Power S0 Idle firmware interface (see
From: huhai huhai@kylinos.cn
[ Upstream commit b4f1f61ed5928b1128e60e38d0dffa16966f06dc ]
register_device_clock() misses a check for platform_device_register_simple(). Add a check to fix it.
Signed-off-by: huhai huhai@kylinos.cn Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/acpi/acpi_lpss.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/acpi/acpi_lpss.c b/drivers/acpi/acpi_lpss.c index 30b1f511c2af..f609f9d62efd 100644 --- a/drivers/acpi/acpi_lpss.c +++ b/drivers/acpi/acpi_lpss.c @@ -403,6 +403,9 @@ static int register_device_clock(struct acpi_device *adev, if (!lpss_clk_dev) lpt_register_clock_device();
+ if (IS_ERR(lpss_clk_dev)) + return PTR_ERR(lpss_clk_dev); + clk_data = platform_get_drvdata(lpss_clk_dev); if (!clk_data) return -ENODEV;
From: Manivannan Sadhasivam manivannan.sadhasivam@linaro.org
[ Upstream commit ae500b351ab0006d933d804a2b7507fe1e98cecc ]
The trigger type should be LEVEL_HIGH. So fix it!
Signed-off-by: Manivannan Sadhasivam manivannan.sadhasivam@linaro.org Signed-off-by: Bjorn Andersson bjorn.andersson@linaro.org Link: https://lore.kernel.org/r/20220530080842.37024-2-manivannan.sadhasivam@linar... Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm/boot/dts/qcom-sdx55.dtsi | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm/boot/dts/qcom-sdx55.dtsi b/arch/arm/boot/dts/qcom-sdx55.dtsi index b5b784c5c65e..0e76d03087fe 100644 --- a/arch/arm/boot/dts/qcom-sdx55.dtsi +++ b/arch/arm/boot/dts/qcom-sdx55.dtsi @@ -205,7 +205,7 @@ gcc: clock-controller@100000 { blsp1_uart3: serial@831000 { compatible = "qcom,msm-uartdm-v1.4", "qcom,msm-uartdm"; reg = <0x00831000 0x200>; - interrupts = <GIC_SPI 26 IRQ_TYPE_LEVEL_LOW>; + interrupts = <GIC_SPI 26 IRQ_TYPE_LEVEL_HIGH>; clocks = <&gcc 30>, <&gcc 9>; clock-names = "core", "iface";
From: Robert Marko robimarko@gmail.com
[ Upstream commit b39961659ffc3c3a9e3d0d43b0476547b5f35d49 ]
Per schema it should be nand-controller@79b0000 instead of nand@79b0000. Fix it to match nand-controller.yaml requirements.
Signed-off-by: Robert Marko robimarko@gmail.com Reviewed-by: Krzysztof Kozlowski krzysztof.kozlowski@linaro.org Signed-off-by: Bjorn Andersson bjorn.andersson@linaro.org Link: https://lore.kernel.org/r/20220621120642.518575-1-robimarko@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm64/boot/dts/qcom/ipq8074.dtsi | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/boot/dts/qcom/ipq8074.dtsi b/arch/arm64/boot/dts/qcom/ipq8074.dtsi index fea415302408..6b9ac0550490 100644 --- a/arch/arm64/boot/dts/qcom/ipq8074.dtsi +++ b/arch/arm64/boot/dts/qcom/ipq8074.dtsi @@ -437,7 +437,7 @@ qpic_bam: dma-controller@7984000 { status = "disabled"; };
- qpic_nand: nand@79b0000 { + qpic_nand: nand-controller@79b0000 { compatible = "qcom,ipq8074-nand"; reg = <0x079b0000 0x10000>; #address-cells = <1>;
From: Samuel Holland samuel@sholland.org
[ Upstream commit b8eb2df19fbf97aa1e950cf491232c2e3bef8357 ]
"status" does not match any pattern in the gpio-leds binding. Rename the node to the preferred pattern. This fixes a `make dtbs_check` error.
Signed-off-by: Samuel Holland samuel@sholland.org Reviewed-by: Jernej Skrabec jernej.skrabec@gmail.com Signed-off-by: Jernej Skrabec jernej.skrabec@gmail.com Link: https://lore.kernel.org/r/20220702132816.46456-1-samuel@sholland.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts b/arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts index 097a5511523a..09eee653d5ca 100644 --- a/arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts +++ b/arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts @@ -40,7 +40,7 @@ hdmi_con_in: endpoint { leds { compatible = "gpio-leds";
- status { + led-0 { label = "orangepi:green:status"; gpios = <&pio 7 11 GPIO_ACTIVE_HIGH>; /* PH11 */ };
From: Liang He windhl@126.com
[ Upstream commit 75a185fb92e58ccd3670258d8d3b826bd2fa6d29 ]
In rcar_gen2_regulator_quirk(), for_each_matching_node_and_match() will automatically increase and decrease the refcount. However, we should call of_node_get() for the new reference created in 'quirk->np'. Besides, we also should call of_node_put() before the 'quirk' being freed.
Signed-off-by: Liang He windhl@126.com Link: https://lore.kernel.org/r/20220701121804.234223-1-windhl@126.com Signed-off-by: Geert Uytterhoeven geert+renesas@glider.be Signed-off-by: Sasha Levin sashal@kernel.org --- arch/arm/mach-shmobile/regulator-quirk-rcar-gen2.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/arm/mach-shmobile/regulator-quirk-rcar-gen2.c b/arch/arm/mach-shmobile/regulator-quirk-rcar-gen2.c index 09ef73b99dd8..ba44cec5e59a 100644 --- a/arch/arm/mach-shmobile/regulator-quirk-rcar-gen2.c +++ b/arch/arm/mach-shmobile/regulator-quirk-rcar-gen2.c @@ -125,6 +125,7 @@ static int regulator_quirk_notify(struct notifier_block *nb,
list_for_each_entry_safe(pos, tmp, &quirk_list, list) { list_del(&pos->list); + of_node_put(pos->np); kfree(pos); }
@@ -174,11 +175,12 @@ static int __init rcar_gen2_regulator_quirk(void) memcpy(&quirk->i2c_msg, id->data, sizeof(quirk->i2c_msg));
quirk->id = id; - quirk->np = np; + quirk->np = of_node_get(np); quirk->i2c_msg.addr = addr;
ret = of_irq_parse_one(np, 0, argsa); if (ret) { /* Skip invalid entry and continue */ + of_node_put(np); kfree(quirk); continue; } @@ -225,6 +227,7 @@ static int __init rcar_gen2_regulator_quirk(void) err_mem: list_for_each_entry_safe(pos, tmp, &quirk_list, list) { list_del(&pos->list); + of_node_put(pos->np); kfree(pos); }
From: Lv Ruyi lv.ruyi@zte.com.cn
[ Upstream commit afcdb8e55c91c6ff0700ab272fd0f74e899ab884 ]
If an error occurs, debugfs_create_file() will return ERR_PTR(-ERROR), so use IS_ERR() to check it.
Reported-by: Zeal Robot zealci@zte.com.cn Signed-off-by: Lv Ruyi lv.ruyi@zte.com.cn Signed-off-by: Thierry Reding treding@nvidia.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/firmware/tegra/bpmp-debugfs.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/firmware/tegra/bpmp-debugfs.c b/drivers/firmware/tegra/bpmp-debugfs.c index 3e9fa4b54358..1ed881a567d5 100644 --- a/drivers/firmware/tegra/bpmp-debugfs.c +++ b/drivers/firmware/tegra/bpmp-debugfs.c @@ -465,7 +465,7 @@ static int bpmp_populate_debugfs_inband(struct tegra_bpmp *bpmp, mode |= attrs & DEBUGFS_S_IWUSR ? 0200 : 0; dentry = debugfs_create_file(name, mode, parent, bpmp, &bpmp_debug_fops); - if (!dentry) { + if (IS_ERR(dentry)) { err = -ENOMEM; goto out; } @@ -716,7 +716,7 @@ static int bpmp_populate_dir(struct tegra_bpmp *bpmp, struct seqbuf *seqbuf,
if (t & DEBUGFS_S_ISDIR) { dentry = debugfs_create_dir(name, parent); - if (!dentry) + if (IS_ERR(dentry)) return -ENOMEM; err = bpmp_populate_dir(bpmp, seqbuf, dentry, depth+1); if (err < 0) @@ -729,7 +729,7 @@ static int bpmp_populate_dir(struct tegra_bpmp *bpmp, struct seqbuf *seqbuf, dentry = debugfs_create_file(name, mode, parent, bpmp, &debugfs_fops); - if (!dentry) + if (IS_ERR(dentry)) return -ENOMEM; } } @@ -779,11 +779,11 @@ int tegra_bpmp_init_debugfs(struct tegra_bpmp *bpmp) return 0;
root = debugfs_create_dir("bpmp", NULL); - if (!root) + if (IS_ERR(root)) return -ENOMEM;
bpmp->debugfs_mirror = debugfs_create_dir("debug", root); - if (!bpmp->debugfs_mirror) { + if (IS_ERR(bpmp->debugfs_mirror)) { err = -ENOMEM; goto out; }
From: Armin Wolf W_Armin@gmx.de
[ Upstream commit 385e5f57053ff293282fea84c1c27186d53f66e1 ]
A user reported that the program dell-bios-fan-control worked on his Dell XPS 13 7390 to switch off automatic fan control. Since it uses the same mechanism as the dell_smm_hwmon module, add this model to the fan control whitelist.
Compile-tested only.
Signed-off-by: Armin Wolf W_Armin@gmx.de Acked-by: Pali Rohár pali@kernel.org Link: https://lore.kernel.org/r/20220612041806.11367-1-W_Armin@gmx.de Signed-off-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/hwmon/dell-smm-hwmon.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/hwmon/dell-smm-hwmon.c b/drivers/hwmon/dell-smm-hwmon.c index 9cb1c3588038..597cbb4391bd 100644 --- a/drivers/hwmon/dell-smm-hwmon.c +++ b/drivers/hwmon/dell-smm-hwmon.c @@ -1198,6 +1198,14 @@ static const struct dmi_system_id i8k_whitelist_fan_control[] __initconst = { }, .driver_data = (void *)&i8k_fan_control_data[I8K_FAN_34A3_35A3], }, + { + .ident = "Dell XPS 13 7390", + .matches = { + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."), + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "XPS 13 7390"), + }, + .driver_data = (void *)&i8k_fan_control_data[I8K_FAN_34A3_35A3], + }, { } };
From: Uwe Kleine-König u.kleine-koenig@pengutronix.de
[ Upstream commit 7d4edccc9bbfe1dcdff641343f7b0c6763fbe774 ]
Taking a lock at the beginning of .remove() doesn't prevent new readers. With the existing approach it can happen, that a read occurs just when the lock was taken blocking the reader until the lock is released at the end of the remove callback which then accessed *data that is already freed then.
To actually fix this problem the hwmon core needs some adaption. Until this is implemented take the optimistic approach of assuming that all readers are gone after hwmon_device_unregister() and sysfs_remove_group() as most other drivers do. (And once the core implements that, taking the lock would deadlock.)
So drop the lock, move the reset to after device unregistration to keep the device in a workable state until it's deregistered. Also add a error message in case the reset fails and return 0 anyhow. (Returning an error code, doesn't stop the platform device unregistration and only results in a little helpful error message before the devm cleanup handlers are called.)
Signed-off-by: Uwe Kleine-König u.kleine-koenig@pengutronix.de Link: https://lore.kernel.org/r/20220725194344.150098-1-u.kleine-koenig@pengutroni... Signed-off-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/hwmon/sht15.c | 17 ++++++----------- 1 file changed, 6 insertions(+), 11 deletions(-)
diff --git a/drivers/hwmon/sht15.c b/drivers/hwmon/sht15.c index 7f4a63959730..ae4d14257a11 100644 --- a/drivers/hwmon/sht15.c +++ b/drivers/hwmon/sht15.c @@ -1020,25 +1020,20 @@ static int sht15_probe(struct platform_device *pdev) static int sht15_remove(struct platform_device *pdev) { struct sht15_data *data = platform_get_drvdata(pdev); + int ret;
- /* - * Make sure any reads from the device are done and - * prevent new ones beginning - */ - mutex_lock(&data->read_lock); - if (sht15_soft_reset(data)) { - mutex_unlock(&data->read_lock); - return -EFAULT; - } hwmon_device_unregister(data->hwmon_dev); sysfs_remove_group(&pdev->dev.kobj, &sht15_attr_group); + + ret = sht15_soft_reset(data); + if (ret) + dev_err(&pdev->dev, "Failed to reset device (%pe)\n", ERR_PTR(ret)); + if (!IS_ERR(data->reg)) { regulator_unregister_notifier(data->reg, &data->nb); regulator_disable(data->reg); }
- mutex_unlock(&data->read_lock); - return 0; }
From: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp
[ Upstream commit 8386c414e27caba8501119948e9551e52b527f59 ]
syzbot is reporting hung task at misc_open() [1], for there is a race window of AB-BA deadlock which involves probe_count variable. Currently wait_for_device_probe() from snapshot_open() from misc_open() can sleep forever with misc_mtx held if probe_count cannot become 0.
When a device is probed by hub_event() work function, probe_count is incremented before the probe function starts, and probe_count is decremented after the probe function completed.
There are three cases that can prevent probe_count from dropping to 0.
(a) A device being probed stopped responding (i.e. broken/malicious hardware).
(b) A process emulating a USB device using /dev/raw-gadget interface stopped responding for some reason.
(c) New device probe requests keeps coming in before existing device probe requests complete.
The phenomenon syzbot is reporting is (b). A process which is holding system_transition_mutex and misc_mtx is waiting for probe_count to become 0 inside wait_for_device_probe(), but the probe function which is called from hub_event() work function is waiting for the processes which are blocked at mutex_lock(&misc_mtx) to respond via /dev/raw-gadget interface.
This patch mitigates (b) by deferring wait_for_device_probe() from snapshot_open() to snapshot_write() and snapshot_ioctl(). Please note that the possibility of (b) remains as long as any thread which is emulating a USB device via /dev/raw-gadget interface can be blocked by uninterruptible blocking operations (e.g. mutex_lock()).
Please also note that (a) and (c) are not addressed. Regarding (c), we should change the code to wait for only one device which contains the image for resuming from hibernation. I don't know how to address (a), for use of timeout for wait_for_device_probe() might result in loss of user data in the image. Maybe we should require the userland to wait for the image device before opening /dev/snapshot interface.
Link: https://syzkaller.appspot.com/bug?extid=358c9ab4c93da7b7238c [1] Reported-by: syzbot syzbot+358c9ab4c93da7b7238c@syzkaller.appspotmail.com Signed-off-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Tested-by: syzbot syzbot+358c9ab4c93da7b7238c@syzkaller.appspotmail.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/power/user.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/kernel/power/user.c b/kernel/power/user.c index 740723bb3885..13cca2e2c2bc 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -26,6 +26,7 @@
#include "power.h"
+static bool need_wait;
static struct snapshot_data { struct snapshot_handle handle; @@ -78,7 +79,7 @@ static int snapshot_open(struct inode *inode, struct file *filp) * Resuming. We may need to wait for the image device to * appear. */ - wait_for_device_probe(); + need_wait = true;
data->swap = -1; data->mode = O_WRONLY; @@ -168,6 +169,11 @@ static ssize_t snapshot_write(struct file *filp, const char __user *buf, ssize_t res; loff_t pg_offp = *offp & ~PAGE_MASK;
+ if (need_wait) { + wait_for_device_probe(); + need_wait = false; + } + lock_system_sleep();
data = filp->private_data; @@ -244,6 +250,11 @@ static long snapshot_ioctl(struct file *filp, unsigned int cmd, loff_t size; sector_t offset;
+ if (need_wait) { + wait_for_device_probe(); + need_wait = false; + } + if (_IOC_TYPE(cmd) != SNAPSHOT_IOC_MAGIC) return -ENOTTY; if (_IOC_NR(cmd) > SNAPSHOT_IOC_MAXNR)
From: Xiu Jianfeng xiujianfeng@huawei.com
[ Upstream commit 73de1befcc53a7c68b0c5e76b9b5ac41c517760f ]
In this function, it directly returns the result of __security_read_policy without freeing the allocated memory in *data, cause memory leak issue, so free the memory if __security_read_policy failed.
Signed-off-by: Xiu Jianfeng xiujianfeng@huawei.com [PM: subject line tweak] Signed-off-by: Paul Moore paul@paul-moore.com Signed-off-by: Sasha Levin sashal@kernel.org --- security/selinux/ss/services.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c index c4931bf6f92a..e8035e4876df 100644 --- a/security/selinux/ss/services.c +++ b/security/selinux/ss/services.c @@ -4045,6 +4045,7 @@ int security_read_policy(struct selinux_state *state, int security_read_state_kernel(struct selinux_state *state, void **data, size_t *len) { + int err; struct selinux_policy *policy;
policy = rcu_dereference_protected( @@ -4057,5 +4058,11 @@ int security_read_state_kernel(struct selinux_state *state, if (!*data) return -ENOMEM;
- return __security_read_policy(policy, *data, len); + err = __security_read_policy(policy, *data, len); + if (err) { + vfree(*data); + *data = NULL; + *len = 0; + } + return err; }
From: Xiu Jianfeng xiujianfeng@huawei.com
[ Upstream commit 15ec76fb29be31df2bccb30fc09875274cba2776 ]
Just like next_entry(), boundary check is necessary to prevent memory out-of-bound access.
Signed-off-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Paul Moore paul@paul-moore.com Signed-off-by: Sasha Levin sashal@kernel.org --- security/selinux/ss/policydb.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/security/selinux/ss/policydb.h b/security/selinux/ss/policydb.h index c24d4e1063ea..ffc4e7bad205 100644 --- a/security/selinux/ss/policydb.h +++ b/security/selinux/ss/policydb.h @@ -370,6 +370,8 @@ static inline int put_entry(const void *buf, size_t bytes, int num, struct polic { size_t len = bytes * num;
+ if (len > fp->len) + return -EINVAL; memcpy(fp->data, buf, len); fp->data += len; fp->len -= len;
From: Pavel Begunkov asml.silence@gmail.com
[ Upstream commit 1b4b2b09d4fb451029b112f17d34792e0277aeb2 ]
We should not append MSG_ZEROCOPY requests to skbuff with non MSG_ZEROCOPY ubuf_info, they might be not compatible.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/core/skbuff.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5ebef94e14dc..891db074981e 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1213,6 +1213,10 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size, const u32 byte_limit = 1 << 19; /* limit to a few TSO */ u32 bytelen, next;
+ /* there might be non MSG_ZEROCOPY users */ + if (uarg->callback != msg_zerocopy_callback) + return NULL; + /* realloc only when socket is locked (TCP, UDP cork), * so uarg->len and sk_zckey access is serialized */
On Sun, 7 Aug 2022 21:35:48 -0400 Sasha Levin wrote:
From: Pavel Begunkov asml.silence@gmail.com
[ Upstream commit 1b4b2b09d4fb451029b112f17d34792e0277aeb2 ]
We should not append MSG_ZEROCOPY requests to skbuff with non MSG_ZEROCOPY ubuf_info, they might be not compatible.
This is prep for later patches which add a different type of zerocopy. No need to backport.
From: Kees Cook keescook@chromium.org
[ Upstream commit aaf50b1969d7933a51ea421b11432a7fb90974e3 ]
GCC 12 continues to get smarter about array accesses. The KASAN tests are expecting to explicitly test out-of-bounds conditions at run-time, so hide the variable from GCC, to avoid warnings like:
../lib/test_kasan.c: In function 'ksize_uaf': ../lib/test_kasan.c:790:61: warning: array subscript 120 is outside array bounds of 'void[120]' [-Warray-bounds] 790 | KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[size]); | ~~~~~~~~~~~~~~~~~~~~~~^~~~~~ ../lib/test_kasan.c:97:9: note: in definition of macro 'KUNIT_EXPECT_KASAN_FAIL' 97 | expression; \ | ^~~~~~~~~~
Cc: Andrey Ryabinin ryabinin.a.a@gmail.com Cc: Alexander Potapenko glider@google.com Cc: Andrey Konovalov andreyknvl@gmail.com Cc: Dmitry Vyukov dvyukov@google.com Cc: Vincenzo Frascino vincenzo.frascino@arm.com Cc: kasan-dev@googlegroups.com Signed-off-by: Kees Cook keescook@chromium.org Link: https://lore.kernel.org/r/20220608214024.1068451-1-keescook@chromium.org Signed-off-by: Sasha Levin sashal@kernel.org --- lib/test_kasan.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/lib/test_kasan.c b/lib/test_kasan.c index 8835e0784578..89f444cabd4a 100644 --- a/lib/test_kasan.c +++ b/lib/test_kasan.c @@ -125,6 +125,7 @@ static void kmalloc_oob_right(struct kunit *test) ptr = kmalloc(size, GFP_KERNEL); KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
+ OPTIMIZER_HIDE_VAR(ptr); /* * An unaligned access past the requested kmalloc size. * Only generic KASAN can precisely detect these. @@ -153,6 +154,7 @@ static void kmalloc_oob_left(struct kunit *test) ptr = kmalloc(size, GFP_KERNEL); KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
+ OPTIMIZER_HIDE_VAR(ptr); KUNIT_EXPECT_KASAN_FAIL(test, *ptr = *(ptr - 1)); kfree(ptr); } @@ -165,6 +167,7 @@ static void kmalloc_node_oob_right(struct kunit *test) ptr = kmalloc_node(size, GFP_KERNEL, 0); KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
+ OPTIMIZER_HIDE_VAR(ptr); KUNIT_EXPECT_KASAN_FAIL(test, ptr[0] = ptr[size]); kfree(ptr); } @@ -185,6 +188,7 @@ static void kmalloc_pagealloc_oob_right(struct kunit *test) ptr = kmalloc(size, GFP_KERNEL); KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
+ OPTIMIZER_HIDE_VAR(ptr); KUNIT_EXPECT_KASAN_FAIL(test, ptr[size + OOB_TAG_OFF] = 0);
kfree(ptr); @@ -265,6 +269,7 @@ static void kmalloc_large_oob_right(struct kunit *test) ptr = kmalloc(size, GFP_KERNEL); KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
+ OPTIMIZER_HIDE_VAR(ptr); KUNIT_EXPECT_KASAN_FAIL(test, ptr[size] = 0); kfree(ptr); } @@ -404,6 +409,8 @@ static void kmalloc_oob_16(struct kunit *test) ptr2 = kmalloc(sizeof(*ptr2), GFP_KERNEL); KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2);
+ OPTIMIZER_HIDE_VAR(ptr1); + OPTIMIZER_HIDE_VAR(ptr2); KUNIT_EXPECT_KASAN_FAIL(test, *ptr1 = *ptr2); kfree(ptr1); kfree(ptr2); @@ -712,6 +719,8 @@ static void ksize_unpoisons_memory(struct kunit *test) KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); real_size = ksize(ptr);
+ OPTIMIZER_HIDE_VAR(ptr); + /* This access shouldn't trigger a KASAN report. */ ptr[size] = 'x';
@@ -734,6 +743,7 @@ static void ksize_uaf(struct kunit *test) KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); kfree(ptr);
+ OPTIMIZER_HIDE_VAR(ptr); KUNIT_EXPECT_KASAN_FAIL(test, ksize(ptr)); KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[0]); KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[size]);
linux-stable-mirror@lists.linaro.org