The patch titled
Subject: mm: resolve faulty mmap_region() error path behaviour
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-resolve-faulty-mmap_region-error-path-behaviour.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: mm: resolve faulty mmap_region() error path behaviour
Date: Tue, 29 Oct 2024 18:11:48 +0000
The mmap_region() function is somewhat terrifying, with spaghetti-like
control flow and numerous means by which issues can arise and incomplete
state, memory leaks and other unpleasantness can occur.
A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently
observed issues with resource leaks and observable inconsistent state.
Taking advantage of previous patches in this series we move a number of
checks earlier in the code, simplifying things by moving the core of the
logic into a static internal function __mmap_region().
Doing this allows us to perform a number of checks up front before we do
any real work, and allows us to unwind the writable unmap check
unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
validation unconditionally also.
We move a number of things here:
1. We preallocate memory for the iterator before we call the file-backed
memory hook, allowing us to exit early and avoid having to perform
complicated and error-prone close/free logic. We carefully free
iterator state on both success and error paths.
2. The enclosing mmap_region() function handles the mapping_map_writable()
logic early. Previously the logic had the mapping_map_writable() at the
point of mapping a newly allocated file-backed VMA, and a matching
mapping_unmap_writable() on success and error paths.
We now do this unconditionally if this is a file-backed, shared writable
mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
doing so does not invalidate the seal check we just performed, and we in
any case always decrement the counter in the wrapper.
We perform a debug assert to ensure a driver does not attempt to do the
opposite.
3. We also move arch_validate_flags() up into the mmap_region()
function. This is only relevant on arm64 and sparc64, and the check is
only meaningful for SPARC with ADI enabled. We explicitly add a warning
for this arch if a driver invalidates this check, though the code ought
eventually to be fixed to eliminate the need for this.
With all of these measures in place, we no longer need to explicitly close
the VMA on error paths, as we place all checks which might fail prior to a
call to any driver mmap hook.
This eliminates an entire class of errors, makes the code easier to reason
about and more robust.
Link: https://lkml.kernel.org/r/6e0becb36d2f5472053ac5d544c0edfe9b899e25.17302246…
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reported-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Tested-by: Mark Brown <broonie(a)kernel.org>
Cc: Andreas Larsson <andreas(a)gaisler.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: Helge Deller <deller(a)gmx.de>
Cc: James E.J. Bottomley <James.Bottomley(a)HansenPartnership.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mmap.c | 121 ++++++++++++++++++++++++++++------------------------
1 file changed, 66 insertions(+), 55 deletions(-)
--- a/mm/mmap.c~mm-resolve-faulty-mmap_region-error-path-behaviour
+++ a/mm/mmap.c
@@ -1358,20 +1358,18 @@ int do_munmap(struct mm_struct *mm, unsi
return do_vmi_munmap(&vmi, mm, start, len, uf, false);
}
-unsigned long mmap_region(struct file *file, unsigned long addr,
+static unsigned long __mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
struct list_head *uf)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
pgoff_t pglen = PHYS_PFN(len);
- struct vm_area_struct *merge;
unsigned long charged = 0;
struct vma_munmap_struct vms;
struct ma_state mas_detach;
struct maple_tree mt_detach;
unsigned long end = addr + len;
- bool writable_file_mapping = false;
int error;
VMA_ITERATOR(vmi, mm, addr);
VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
@@ -1445,28 +1443,26 @@ unsigned long mmap_region(struct file *f
vm_flags_init(vma, vm_flags);
vma->vm_page_prot = vm_get_page_prot(vm_flags);
+ if (vma_iter_prealloc(&vmi, vma)) {
+ error = -ENOMEM;
+ goto free_vma;
+ }
+
if (file) {
vma->vm_file = get_file(file);
error = mmap_file(file, vma);
if (error)
- goto unmap_and_free_vma;
-
- if (vma_is_shared_maywrite(vma)) {
- error = mapping_map_writable(file->f_mapping);
- if (error)
- goto close_and_free_vma;
-
- writable_file_mapping = true;
- }
+ goto unmap_and_free_file_vma;
+ /* Drivers cannot alter the address of the VMA. */
+ WARN_ON_ONCE(addr != vma->vm_start);
/*
- * Expansion is handled above, merging is handled below.
- * Drivers should not alter the address of the VMA.
+ * Drivers should not permit writability when previously it was
+ * disallowed.
*/
- if (WARN_ON((addr != vma->vm_start))) {
- error = -EINVAL;
- goto close_and_free_vma;
- }
+ VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
+ !(vm_flags & VM_MAYWRITE) &&
+ (vma->vm_flags & VM_MAYWRITE));
vma_iter_config(&vmi, addr, end);
/*
@@ -1474,6 +1470,8 @@ unsigned long mmap_region(struct file *f
* vma again as we may succeed this time.
*/
if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
+ struct vm_area_struct *merge;
+
vmg.flags = vma->vm_flags;
/* If this fails, state is reset ready for a reattempt. */
merge = vma_merge_new_range(&vmg);
@@ -1491,7 +1489,7 @@ unsigned long mmap_region(struct file *f
vma = merge;
/* Update vm_flags to pick up the change. */
vm_flags = vma->vm_flags;
- goto unmap_writable;
+ goto file_expanded;
}
vma_iter_config(&vmi, addr, end);
}
@@ -1500,26 +1498,15 @@ unsigned long mmap_region(struct file *f
} else if (vm_flags & VM_SHARED) {
error = shmem_zero_setup(vma);
if (error)
- goto free_vma;
+ goto free_iter_vma;
} else {
vma_set_anonymous(vma);
}
- if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
- error = -EACCES;
- goto close_and_free_vma;
- }
-
- /* Allow architectures to sanity-check the vm_flags */
- if (!arch_validate_flags(vma->vm_flags)) {
- error = -EINVAL;
- goto close_and_free_vma;
- }
-
- if (vma_iter_prealloc(&vmi, vma)) {
- error = -ENOMEM;
- goto close_and_free_vma;
- }
+#ifdef CONFIG_SPARC64
+ /* TODO: Fix SPARC ADI! */
+ WARN_ON_ONCE(!arch_validate_flags(vm_flags));
+#endif
/* Lock the VMA since it is modified after insertion into VMA tree */
vma_start_write(vma);
@@ -1533,10 +1520,7 @@ unsigned long mmap_region(struct file *f
*/
khugepaged_enter_vma(vma, vma->vm_flags);
- /* Once vma denies write, undo our temporary denial count */
-unmap_writable:
- if (writable_file_mapping)
- mapping_unmap_writable(file->f_mapping);
+file_expanded:
file = vma->vm_file;
ksm_add_vma(vma);
expanded:
@@ -1569,23 +1553,17 @@ expanded:
vma_set_page_prot(vma);
- validate_mm(mm);
return addr;
-close_and_free_vma:
- vma_close(vma);
-
- if (file || vma->vm_file) {
-unmap_and_free_vma:
- fput(vma->vm_file);
- vma->vm_file = NULL;
-
- vma_iter_set(&vmi, vma->vm_end);
- /* Undo any partial mapping done by a device driver. */
- unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
- }
- if (writable_file_mapping)
- mapping_unmap_writable(file->f_mapping);
+unmap_and_free_file_vma:
+ fput(vma->vm_file);
+ vma->vm_file = NULL;
+
+ vma_iter_set(&vmi, vma->vm_end);
+ /* Undo any partial mapping done by a device driver. */
+ unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
+free_iter_vma:
+ vma_iter_free(&vmi);
free_vma:
vm_area_free(vma);
unacct_error:
@@ -1595,10 +1573,43 @@ unacct_error:
abort_munmap:
vms_abort_munmap_vmas(&vms, &mas_detach);
gather_failed:
- validate_mm(mm);
return error;
}
+unsigned long mmap_region(struct file *file, unsigned long addr,
+ unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
+ struct list_head *uf)
+{
+ unsigned long ret;
+ bool writable_file_mapping = false;
+
+ /* Check to see if MDWE is applicable. */
+ if (map_deny_write_exec(vm_flags, vm_flags))
+ return -EACCES;
+
+ /* Allow architectures to sanity-check the vm_flags. */
+ if (!arch_validate_flags(vm_flags))
+ return -EINVAL;
+
+ /* Map writable and ensure this isn't a sealed memfd. */
+ if (file && is_shared_maywrite(vm_flags)) {
+ int error = mapping_map_writable(file->f_mapping);
+
+ if (error)
+ return error;
+ writable_file_mapping = true;
+ }
+
+ ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf);
+
+ /* Clear our write mapping regardless of error. */
+ if (writable_file_mapping)
+ mapping_unmap_writable(file->f_mapping);
+
+ validate_mm(current->mm);
+ return ret;
+}
+
static int __vm_munmap(unsigned long start, size_t len, bool unlock)
{
int ret;
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook.patch
mm-unconditionally-close-vmas-on-error.patch
mm-refactor-map_deny_write_exec.patch
mm-refactor-arch_calc_vm_flag_bits-and-arm64-mte-handling.patch
mm-resolve-faulty-mmap_region-error-path-behaviour.patch
selftests-mm-add-pkey_sighandler_xx-hugetlb_dio-to-gitignore.patch
mm-refactor-mm_access-to-not-return-null.patch
mm-refactor-mm_access-to-not-return-null-fix.patch
mm-madvise-unrestrict-process_madvise-for-current-process.patch
maple_tree-do-not-hash-pointers-on-dump-in-debug-mode.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems-fix.patch
tools-testing-add-additional-vma_internalh-stubs.patch
mm-isolate-mmap-internal-logic-to-mm-vmac.patch
mm-refactor-__mmap_region.patch
mm-remove-unnecessary-reset-state-logic-on-merge-new-vma.patch
mm-defer-second-attempt-at-merge-on-mmap.patch
mm-defer-second-attempt-at-merge-on-mmap-fix.patch
mm-pagewalk-add-the-ability-to-install-ptes.patch
mm-add-pte_marker_guard-pte-marker.patch
mm-madvise-implement-lightweight-guard-page-mechanism.patch
tools-testing-update-tools-uapi-header-for-mman-commonh.patch
selftests-mm-add-self-tests-for-guard-page-feature.patch
The patch titled
Subject: mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-refactor-arch_calc_vm_flag_bits-and-arm64-mte-handling.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
Date: Tue, 29 Oct 2024 18:11:47 +0000
Currently MTE is permitted in two circumstances (desiring to use MTE
having been specified by the VM_MTE flag) - where MAP_ANONYMOUS is
specified, as checked by arch_calc_vm_flag_bits() and actualised by
setting the VM_MTE_ALLOWED flag, or if the file backing the mapping is
shmem, in which case we set VM_MTE_ALLOWED in shmem_mmap() when the mmap
hook is activated in mmap_region().
The function that checks that, if VM_MTE is set, VM_MTE_ALLOWED is also
set is the arm64 implementation of arch_validate_flags().
Unfortunately, we intend to refactor mmap_region() to perform this check
earlier, meaning that in the case of a shmem backing we will not have
invoked shmem_mmap() yet, causing the mapping to fail spuriously.
It is inappropriate to set this architecture-specific flag in general mm
code anyway, so a sensible resolution of this issue is to instead move the
check somewhere else.
We resolve this by setting VM_MTE_ALLOWED much earlier in do_mmap(), via
the arch_calc_vm_flag_bits() call.
This is an appropriate place to do this as we already check for the
MAP_ANONYMOUS case here, and the shmem file case is simply a variant of
the same idea - we permit RAM-backed memory.
This requires a modification to the arch_calc_vm_flag_bits() signature to
pass in a pointer to the struct file associated with the mapping, however
this is not too egregious as this is only used by two architectures anyway
- arm64 and parisc.
So this patch performs this adjustment and removes the unnecessary
assignment of VM_MTE_ALLOWED in shmem_mmap().
Link: https://lkml.kernel.org/r/ec251b20ba1964fb64cf1607d2ad80c47f3873df.17302246…
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Suggested-by: Catalin Marinas <catalin.marinas(a)arm.com>
Reported-by: Jann Horn <jannh(a)google.com>
Cc: Andreas Larsson <andreas(a)gaisler.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: Helge Deller <deller(a)gmx.de>
Cc: James E.J. Bottomley <James.Bottomley(a)HansenPartnership.com>
Cc: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Will Deacon <will(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/arm64/include/asm/mman.h | 10 +++++++---
arch/parisc/include/asm/mman.h | 5 +++--
include/linux/mman.h | 7 ++++---
mm/mmap.c | 2 +-
mm/nommu.c | 2 +-
mm/shmem.c | 3 ---
6 files changed, 16 insertions(+), 13 deletions(-)
--- a/arch/arm64/include/asm/mman.h~mm-refactor-arch_calc_vm_flag_bits-and-arm64-mte-handling
+++ a/arch/arm64/include/asm/mman.h
@@ -6,6 +6,8 @@
#ifndef BUILD_VDSO
#include <linux/compiler.h>
+#include <linux/fs.h>
+#include <linux/shmem_fs.h>
#include <linux/types.h>
static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
@@ -31,19 +33,21 @@ static inline unsigned long arch_calc_vm
}
#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
-static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
+static inline unsigned long arch_calc_vm_flag_bits(struct file *file,
+ unsigned long flags)
{
/*
* Only allow MTE on anonymous mappings as these are guaranteed to be
* backed by tags-capable memory. The vm_flags may be overridden by a
* filesystem supporting MTE (RAM-based).
*/
- if (system_supports_mte() && (flags & MAP_ANONYMOUS))
+ if (system_supports_mte() &&
+ ((flags & MAP_ANONYMOUS) || shmem_file(file)))
return VM_MTE_ALLOWED;
return 0;
}
-#define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags)
+#define arch_calc_vm_flag_bits(file, flags) arch_calc_vm_flag_bits(file, flags)
static inline bool arch_validate_prot(unsigned long prot,
unsigned long addr __always_unused)
--- a/arch/parisc/include/asm/mman.h~mm-refactor-arch_calc_vm_flag_bits-and-arm64-mte-handling
+++ a/arch/parisc/include/asm/mman.h
@@ -2,6 +2,7 @@
#ifndef __ASM_MMAN_H__
#define __ASM_MMAN_H__
+#include <linux/fs.h>
#include <uapi/asm/mman.h>
/* PARISC cannot allow mdwe as it needs writable stacks */
@@ -11,7 +12,7 @@ static inline bool arch_memory_deny_writ
}
#define arch_memory_deny_write_exec_supported arch_memory_deny_write_exec_supported
-static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
+static inline unsigned long arch_calc_vm_flag_bits(struct file *file, unsigned long flags)
{
/*
* The stack on parisc grows upwards, so if userspace requests memory
@@ -23,6 +24,6 @@ static inline unsigned long arch_calc_vm
return 0;
}
-#define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags)
+#define arch_calc_vm_flag_bits(file, flags) arch_calc_vm_flag_bits(file, flags)
#endif /* __ASM_MMAN_H__ */
--- a/include/linux/mman.h~mm-refactor-arch_calc_vm_flag_bits-and-arm64-mte-handling
+++ a/include/linux/mman.h
@@ -2,6 +2,7 @@
#ifndef _LINUX_MMAN_H
#define _LINUX_MMAN_H
+#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/percpu_counter.h>
@@ -94,7 +95,7 @@ static inline void vm_unacct_memory(long
#endif
#ifndef arch_calc_vm_flag_bits
-#define arch_calc_vm_flag_bits(flags) 0
+#define arch_calc_vm_flag_bits(file, flags) 0
#endif
#ifndef arch_validate_prot
@@ -151,13 +152,13 @@ calc_vm_prot_bits(unsigned long prot, un
* Combine the mmap "flags" argument into "vm_flags" used internally.
*/
static inline unsigned long
-calc_vm_flag_bits(unsigned long flags)
+calc_vm_flag_bits(struct file *file, unsigned long flags)
{
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
_calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
_calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) |
_calc_vm_trans(flags, MAP_STACK, VM_NOHUGEPAGE) |
- arch_calc_vm_flag_bits(flags);
+ arch_calc_vm_flag_bits(file, flags);
}
unsigned long vm_commit_limit(void);
--- a/mm/mmap.c~mm-refactor-arch_calc_vm_flag_bits-and-arm64-mte-handling
+++ a/mm/mmap.c
@@ -344,7 +344,7 @@ unsigned long do_mmap(struct file *file,
* to. we assume access permissions have been handled by the open
* of the memory object, so we don't do any here.
*/
- vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+ vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(file, flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
/* Obtain the address to map to. we verify (or select) it and ensure
--- a/mm/nommu.c~mm-refactor-arch_calc_vm_flag_bits-and-arm64-mte-handling
+++ a/mm/nommu.c
@@ -842,7 +842,7 @@ static unsigned long determine_vm_flags(
{
unsigned long vm_flags;
- vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags);
+ vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(file, flags);
if (!file) {
/*
--- a/mm/shmem.c~mm-refactor-arch_calc_vm_flag_bits-and-arm64-mte-handling
+++ a/mm/shmem.c
@@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file,
if (ret)
return ret;
- /* arm64 - allow memory tagging on RAM-based files */
- vm_flags_set(vma, VM_MTE_ALLOWED);
-
file_accessed(file);
/* This is anonymous shared memory if it is unlinked at the time of mmap */
if (inode->i_nlink)
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook.patch
mm-unconditionally-close-vmas-on-error.patch
mm-refactor-map_deny_write_exec.patch
mm-refactor-arch_calc_vm_flag_bits-and-arm64-mte-handling.patch
mm-resolve-faulty-mmap_region-error-path-behaviour.patch
selftests-mm-add-pkey_sighandler_xx-hugetlb_dio-to-gitignore.patch
mm-refactor-mm_access-to-not-return-null.patch
mm-refactor-mm_access-to-not-return-null-fix.patch
mm-madvise-unrestrict-process_madvise-for-current-process.patch
maple_tree-do-not-hash-pointers-on-dump-in-debug-mode.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems-fix.patch
tools-testing-add-additional-vma_internalh-stubs.patch
mm-isolate-mmap-internal-logic-to-mm-vmac.patch
mm-refactor-__mmap_region.patch
mm-remove-unnecessary-reset-state-logic-on-merge-new-vma.patch
mm-defer-second-attempt-at-merge-on-mmap.patch
mm-defer-second-attempt-at-merge-on-mmap-fix.patch
mm-pagewalk-add-the-ability-to-install-ptes.patch
mm-add-pte_marker_guard-pte-marker.patch
mm-madvise-implement-lightweight-guard-page-mechanism.patch
tools-testing-update-tools-uapi-header-for-mman-commonh.patch
selftests-mm-add-self-tests-for-guard-page-feature.patch
The patch titled
Subject: mm: refactor map_deny_write_exec()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-refactor-map_deny_write_exec.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: mm: refactor map_deny_write_exec()
Date: Tue, 29 Oct 2024 18:11:46 +0000
Refactor the map_deny_write_exec() to not unnecessarily require a VMA
parameter but rather to accept VMA flags parameters, which allows us to
use this function early in mmap_region() in a subsequent commit.
While we're here, we refactor the function to be more readable and add
some additional documentation.
Link: https://lkml.kernel.org/r/6be8bb59cd7c68006ebb006eb9d8dc27104b1f70.17302246…
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reported-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Jann Horn <jannh(a)google.com>
Cc: Andreas Larsson <andreas(a)gaisler.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: Helge Deller <deller(a)gmx.de>
Cc: James E.J. Bottomley <James.Bottomley(a)HansenPartnership.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mman.h | 21 ++++++++++++++++++---
mm/mmap.c | 2 +-
mm/mprotect.c | 2 +-
mm/vma.h | 2 +-
4 files changed, 21 insertions(+), 6 deletions(-)
--- a/include/linux/mman.h~mm-refactor-map_deny_write_exec
+++ a/include/linux/mman.h
@@ -188,16 +188,31 @@ static inline bool arch_memory_deny_writ
*
* d) mmap(PROT_READ | PROT_EXEC)
* mmap(PROT_READ | PROT_EXEC | PROT_BTI)
+ *
+ * This is only applicable if the user has set the Memory-Deny-Write-Execute
+ * (MDWE) protection mask for the current process.
+ *
+ * @old specifies the VMA flags the VMA originally possessed, and @new the ones
+ * we propose to set.
+ *
+ * Return: false if proposed change is OK, true if not ok and should be denied.
*/
-static inline bool map_deny_write_exec(struct vm_area_struct *vma, unsigned long vm_flags)
+static inline bool map_deny_write_exec(unsigned long old, unsigned long new)
{
+ /* If MDWE is disabled, we have nothing to deny. */
if (!test_bit(MMF_HAS_MDWE, ¤t->mm->flags))
return false;
- if ((vm_flags & VM_EXEC) && (vm_flags & VM_WRITE))
+ /* If the new VMA is not executable, we have nothing to deny. */
+ if (!(new & VM_EXEC))
+ return false;
+
+ /* Under MDWE we do not accept newly writably executable VMAs... */
+ if (new & VM_WRITE)
return true;
- if (!(vma->vm_flags & VM_EXEC) && (vm_flags & VM_EXEC))
+ /* ...nor previously non-executable VMAs becoming executable. */
+ if (!(old & VM_EXEC))
return true;
return false;
--- a/mm/mmap.c~mm-refactor-map_deny_write_exec
+++ a/mm/mmap.c
@@ -1505,7 +1505,7 @@ unsigned long mmap_region(struct file *f
vma_set_anonymous(vma);
}
- if (map_deny_write_exec(vma, vma->vm_flags)) {
+ if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
error = -EACCES;
goto close_and_free_vma;
}
--- a/mm/mprotect.c~mm-refactor-map_deny_write_exec
+++ a/mm/mprotect.c
@@ -810,7 +810,7 @@ static int do_mprotect_pkey(unsigned lon
break;
}
- if (map_deny_write_exec(vma, newflags)) {
+ if (map_deny_write_exec(vma->vm_flags, newflags)) {
error = -EACCES;
break;
}
--- a/mm/vma.h~mm-refactor-map_deny_write_exec
+++ a/mm/vma.h
@@ -42,7 +42,7 @@ struct vma_munmap_struct {
int vma_count; /* Number of vmas that will be removed */
bool unlock; /* Unlock after the munmap */
bool clear_ptes; /* If there are outstanding PTE to be cleared */
- /* 1 byte hole */
+ /* 2 byte hole */
unsigned long nr_pages; /* Number of pages being removed */
unsigned long locked_vm; /* Number of locked pages */
unsigned long nr_accounted; /* Number of VM_ACCOUNT pages */
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook.patch
mm-unconditionally-close-vmas-on-error.patch
mm-refactor-map_deny_write_exec.patch
mm-refactor-arch_calc_vm_flag_bits-and-arm64-mte-handling.patch
mm-resolve-faulty-mmap_region-error-path-behaviour.patch
selftests-mm-add-pkey_sighandler_xx-hugetlb_dio-to-gitignore.patch
mm-refactor-mm_access-to-not-return-null.patch
mm-refactor-mm_access-to-not-return-null-fix.patch
mm-madvise-unrestrict-process_madvise-for-current-process.patch
maple_tree-do-not-hash-pointers-on-dump-in-debug-mode.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems-fix.patch
tools-testing-add-additional-vma_internalh-stubs.patch
mm-isolate-mmap-internal-logic-to-mm-vmac.patch
mm-refactor-__mmap_region.patch
mm-remove-unnecessary-reset-state-logic-on-merge-new-vma.patch
mm-defer-second-attempt-at-merge-on-mmap.patch
mm-defer-second-attempt-at-merge-on-mmap-fix.patch
mm-pagewalk-add-the-ability-to-install-ptes.patch
mm-add-pte_marker_guard-pte-marker.patch
mm-madvise-implement-lightweight-guard-page-mechanism.patch
tools-testing-update-tools-uapi-header-for-mman-commonh.patch
selftests-mm-add-self-tests-for-guard-page-feature.patch
The patch titled
Subject: mm: unconditionally close VMAs on error
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-unconditionally-close-vmas-on-error.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: mm: unconditionally close VMAs on error
Date: Tue, 29 Oct 2024 18:11:45 +0000
Incorrect invocation of VMA callbacks when the VMA is no longer in a
consistent state is bug prone and risky to perform.
With regards to the important vm_ops->close() callback We have gone to
great lengths to try to track whether or not we ought to close VMAs.
Rather than doing so and risking making a mistake somewhere, instead
unconditionally close and reset vma->vm_ops to an empty dummy operations
set with a NULL .close operator.
We introduce a new function to do so - vma_close() - and simplify existing
vms logic which tracked whether we needed to close or not.
This simplifies the logic, avoids incorrect double-calling of the .close()
callback and allows us to update error paths to simply call vma_close()
unconditionally - making VMA closure idempotent.
Link: https://lkml.kernel.org/r/28e89dda96f68c505cb6f8e9fc9b57c3e9f74b42.17302246…
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reported-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Reviewed-by: Jann Horn <jannh(a)google.com>
Cc: Andreas Larsson <andreas(a)gaisler.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: Helge Deller <deller(a)gmx.de>
Cc: James E.J. Bottomley <James.Bottomley(a)HansenPartnership.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/internal.h | 18 ++++++++++++++++++
mm/mmap.c | 5 ++---
mm/nommu.c | 3 +--
mm/vma.c | 14 +++++---------
mm/vma.h | 4 +---
5 files changed, 27 insertions(+), 17 deletions(-)
--- a/mm/internal.h~mm-unconditionally-close-vmas-on-error
+++ a/mm/internal.h
@@ -135,6 +135,24 @@ static inline int mmap_file(struct file
return err;
}
+/*
+ * If the VMA has a close hook then close it, and since closing it might leave
+ * it in an inconsistent state which makes the use of any hooks suspect, clear
+ * them down by installing dummy empty hooks.
+ */
+static inline void vma_close(struct vm_area_struct *vma)
+{
+ if (vma->vm_ops && vma->vm_ops->close) {
+ vma->vm_ops->close(vma);
+
+ /*
+ * The mapping is in an inconsistent state, and no further hooks
+ * may be invoked upon it.
+ */
+ vma->vm_ops = &vma_dummy_vm_ops;
+ }
+}
+
#ifdef CONFIG_MMU
/* Flags for folio_pte_batch(). */
--- a/mm/mmap.c~mm-unconditionally-close-vmas-on-error
+++ a/mm/mmap.c
@@ -1573,8 +1573,7 @@ expanded:
return addr;
close_and_free_vma:
- if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close)
- vma->vm_ops->close(vma);
+ vma_close(vma);
if (file || vma->vm_file) {
unmap_and_free_vma:
@@ -1934,7 +1933,7 @@ void exit_mmap(struct mm_struct *mm)
do {
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += vma_pages(vma);
- remove_vma(vma, /* unreachable = */ true, /* closed = */ false);
+ remove_vma(vma, /* unreachable = */ true);
count++;
cond_resched();
vma = vma_next(&vmi);
--- a/mm/nommu.c~mm-unconditionally-close-vmas-on-error
+++ a/mm/nommu.c
@@ -589,8 +589,7 @@ static int delete_vma_from_mm(struct vm_
*/
static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
{
- if (vma->vm_ops && vma->vm_ops->close)
- vma->vm_ops->close(vma);
+ vma_close(vma);
if (vma->vm_file)
fput(vma->vm_file);
put_nommu_region(vma->vm_region);
--- a/mm/vma.c~mm-unconditionally-close-vmas-on-error
+++ a/mm/vma.c
@@ -323,11 +323,10 @@ static bool can_vma_merge_right(struct v
/*
* Close a vm structure and free it.
*/
-void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed)
+void remove_vma(struct vm_area_struct *vma, bool unreachable)
{
might_sleep();
- if (!closed && vma->vm_ops && vma->vm_ops->close)
- vma->vm_ops->close(vma);
+ vma_close(vma);
if (vma->vm_file)
fput(vma->vm_file);
mpol_put(vma_policy(vma));
@@ -1115,9 +1114,7 @@ void vms_clean_up_area(struct vma_munmap
vms_clear_ptes(vms, mas_detach, true);
mas_set(mas_detach, 0);
mas_for_each(mas_detach, vma, ULONG_MAX)
- if (vma->vm_ops && vma->vm_ops->close)
- vma->vm_ops->close(vma);
- vms->closed_vm_ops = true;
+ vma_close(vma);
}
/*
@@ -1160,7 +1157,7 @@ void vms_complete_munmap_vmas(struct vma
/* Remove and clean up vmas */
mas_set(mas_detach, 0);
mas_for_each(mas_detach, vma, ULONG_MAX)
- remove_vma(vma, /* = */ false, vms->closed_vm_ops);
+ remove_vma(vma, /* unreachable = */ false);
vm_unacct_memory(vms->nr_accounted);
validate_mm(mm);
@@ -1684,8 +1681,7 @@ struct vm_area_struct *copy_vma(struct v
return new_vma;
out_vma_link:
- if (new_vma->vm_ops && new_vma->vm_ops->close)
- new_vma->vm_ops->close(new_vma);
+ vma_close(new_vma);
if (new_vma->vm_file)
fput(new_vma->vm_file);
--- a/mm/vma.h~mm-unconditionally-close-vmas-on-error
+++ a/mm/vma.h
@@ -42,7 +42,6 @@ struct vma_munmap_struct {
int vma_count; /* Number of vmas that will be removed */
bool unlock; /* Unlock after the munmap */
bool clear_ptes; /* If there are outstanding PTE to be cleared */
- bool closed_vm_ops; /* call_mmap() was encountered, so vmas may be closed */
/* 1 byte hole */
unsigned long nr_pages; /* Number of pages being removed */
unsigned long locked_vm; /* Number of locked pages */
@@ -198,7 +197,6 @@ static inline void init_vma_munmap(struc
vms->unmap_start = FIRST_USER_ADDRESS;
vms->unmap_end = USER_PGTABLES_CEILING;
vms->clear_ptes = false;
- vms->closed_vm_ops = false;
}
#endif
@@ -269,7 +267,7 @@ int do_vmi_munmap(struct vma_iterator *v
unsigned long start, size_t len, struct list_head *uf,
bool unlock);
-void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed);
+void remove_vma(struct vm_area_struct *vma, bool unreachable);
void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
struct vm_area_struct *prev, struct vm_area_struct *next);
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook.patch
mm-unconditionally-close-vmas-on-error.patch
mm-refactor-map_deny_write_exec.patch
mm-refactor-arch_calc_vm_flag_bits-and-arm64-mte-handling.patch
mm-resolve-faulty-mmap_region-error-path-behaviour.patch
selftests-mm-add-pkey_sighandler_xx-hugetlb_dio-to-gitignore.patch
mm-refactor-mm_access-to-not-return-null.patch
mm-refactor-mm_access-to-not-return-null-fix.patch
mm-madvise-unrestrict-process_madvise-for-current-process.patch
maple_tree-do-not-hash-pointers-on-dump-in-debug-mode.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems-fix.patch
tools-testing-add-additional-vma_internalh-stubs.patch
mm-isolate-mmap-internal-logic-to-mm-vmac.patch
mm-refactor-__mmap_region.patch
mm-remove-unnecessary-reset-state-logic-on-merge-new-vma.patch
mm-defer-second-attempt-at-merge-on-mmap.patch
mm-defer-second-attempt-at-merge-on-mmap-fix.patch
mm-pagewalk-add-the-ability-to-install-ptes.patch
mm-add-pte_marker_guard-pte-marker.patch
mm-madvise-implement-lightweight-guard-page-mechanism.patch
tools-testing-update-tools-uapi-header-for-mman-commonh.patch
selftests-mm-add-self-tests-for-guard-page-feature.patch
The patch titled
Subject: mm: avoid unsafe VMA hook invocation when error arises on mmap hook
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: mm: avoid unsafe VMA hook invocation when error arises on mmap hook
Date: Tue, 29 Oct 2024 18:11:44 +0000
Patch series "fix error handling in mmap_region() and refactor
(hotfixes)", v4.
mmap_region() is somewhat terrifying, with spaghetti-like control flow and
numerous means by which issues can arise and incomplete state, memory
leaks and other unpleasantness can occur.
A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently
observed issues with resource leaks and observable inconsistent state.
This series goes to great lengths to simplify how mmap_region() works and
to avoid unwinding errors late on in the process of setting up the VMA for
the new mapping, and equally avoids such operations occurring while the
VMA is in an inconsistent state.
The patches in this series comprise the minimal changes required to
resolve existing issues in mmap_region() error handling, in order that
they can be hotfixed and backported. There is additionally a follow up
series which goes further, separated out from the v1 series and sent and
updated separately.
This patch (of 5):
After an attempted mmap() fails, we are no longer in a situation where we
can safely interact with VMA hooks. This is currently not enforced,
meaning that we need complicated handling to ensure we do not incorrectly
call these hooks.
We can avoid the whole issue by treating the VMA as suspect the moment
that the file->f_ops->mmap() function reports an error by replacing
whatever VMA operations were installed with a dummy empty set of VMA
operations.
We do so through a new helper function internal to mm - mmap_file() -
which is both more logically named than the existing call_mmap() function
and correctly isolates handling of the vm_op reassignment to mm.
All the existing invocations of call_mmap() outside of mm are ultimately
nested within the call_mmap() from mm, which we now replace.
It is therefore safe to leave call_mmap() in place as a convenience
function (and to avoid churn). The invokers are:
ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap()
coda_file_operations -> mmap -> coda_file_mmap()
shm_file_operations -> shm_mmap()
shm_file_operations_huge -> shm_mmap()
dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops
-> i915_gem_dmabuf_mmap()
None of these callers interact with vm_ops or mappings in a problematic
way on error, quickly exiting out.
Link: https://lkml.kernel.org/r/cover.1730224667.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d41fd763496fd0048a962f3fd9407dc72dd4fd86.17302246…
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reported-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Jann Horn <jannh(a)google.com>
Cc: Andreas Larsson <andreas(a)gaisler.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: Helge Deller <deller(a)gmx.de>
Cc: James E.J. Bottomley <James.Bottomley(a)HansenPartnership.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/internal.h | 27 +++++++++++++++++++++++++++
mm/mmap.c | 6 +++---
mm/nommu.c | 4 ++--
3 files changed, 32 insertions(+), 5 deletions(-)
--- a/mm/internal.h~mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook
+++ a/mm/internal.h
@@ -108,6 +108,33 @@ static inline void *folio_raw_mapping(co
return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
}
+/*
+ * This is a file-backed mapping, and is about to be memory mapped - invoke its
+ * mmap hook and safely handle error conditions. On error, VMA hooks will be
+ * mutated.
+ *
+ * @file: File which backs the mapping.
+ * @vma: VMA which we are mapping.
+ *
+ * Returns: 0 if success, error otherwise.
+ */
+static inline int mmap_file(struct file *file, struct vm_area_struct *vma)
+{
+ int err = call_mmap(file, vma);
+
+ if (likely(!err))
+ return 0;
+
+ /*
+ * OK, we tried to call the file hook for mmap(), but an error
+ * arose. The mapping is in an inconsistent state and we most not invoke
+ * any further hooks on it.
+ */
+ vma->vm_ops = &vma_dummy_vm_ops;
+
+ return err;
+}
+
#ifdef CONFIG_MMU
/* Flags for folio_pte_batch(). */
--- a/mm/mmap.c~mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook
+++ a/mm/mmap.c
@@ -1422,7 +1422,7 @@ unsigned long mmap_region(struct file *f
/*
* clear PTEs while the vma is still in the tree so that rmap
* cannot race with the freeing later in the truncate scenario.
- * This is also needed for call_mmap(), which is why vm_ops
+ * This is also needed for mmap_file(), which is why vm_ops
* close function is called.
*/
vms_clean_up_area(&vms, &mas_detach);
@@ -1447,7 +1447,7 @@ unsigned long mmap_region(struct file *f
if (file) {
vma->vm_file = get_file(file);
- error = call_mmap(file, vma);
+ error = mmap_file(file, vma);
if (error)
goto unmap_and_free_vma;
@@ -1470,7 +1470,7 @@ unsigned long mmap_region(struct file *f
vma_iter_config(&vmi, addr, end);
/*
- * If vm_flags changed after call_mmap(), we should try merge
+ * If vm_flags changed after mmap_file(), we should try merge
* vma again as we may succeed this time.
*/
if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
--- a/mm/nommu.c~mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook
+++ a/mm/nommu.c
@@ -885,7 +885,7 @@ static int do_mmap_shared_file(struct vm
{
int ret;
- ret = call_mmap(vma->vm_file, vma);
+ ret = mmap_file(vma->vm_file, vma);
if (ret == 0) {
vma->vm_region->vm_top = vma->vm_region->vm_end;
return 0;
@@ -918,7 +918,7 @@ static int do_mmap_private(struct vm_are
* happy.
*/
if (capabilities & NOMMU_MAP_DIRECT) {
- ret = call_mmap(vma->vm_file, vma);
+ ret = mmap_file(vma->vm_file, vma);
/* shouldn't return success if we're not sharing */
if (WARN_ON_ONCE(!is_nommu_shared_mapping(vma->vm_flags)))
ret = -ENOSYS;
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook.patch
mm-unconditionally-close-vmas-on-error.patch
mm-refactor-map_deny_write_exec.patch
mm-refactor-arch_calc_vm_flag_bits-and-arm64-mte-handling.patch
mm-resolve-faulty-mmap_region-error-path-behaviour.patch
selftests-mm-add-pkey_sighandler_xx-hugetlb_dio-to-gitignore.patch
mm-refactor-mm_access-to-not-return-null.patch
mm-refactor-mm_access-to-not-return-null-fix.patch
mm-madvise-unrestrict-process_madvise-for-current-process.patch
maple_tree-do-not-hash-pointers-on-dump-in-debug-mode.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems-fix.patch
tools-testing-add-additional-vma_internalh-stubs.patch
mm-isolate-mmap-internal-logic-to-mm-vmac.patch
mm-refactor-__mmap_region.patch
mm-remove-unnecessary-reset-state-logic-on-merge-new-vma.patch
mm-defer-second-attempt-at-merge-on-mmap.patch
mm-defer-second-attempt-at-merge-on-mmap-fix.patch
mm-pagewalk-add-the-ability-to-install-ptes.patch
mm-add-pte_marker_guard-pte-marker.patch
mm-madvise-implement-lightweight-guard-page-mechanism.patch
tools-testing-update-tools-uapi-header-for-mman-commonh.patch
selftests-mm-add-self-tests-for-guard-page-feature.patch
Operating stable without reduced chip life at 1Ghz needs several
technologies working: The technologies involve
- SmartReflex
- DVFS
As this cannot directly specified in the OPP table as dependecies in the
devicetree yet, use the turbo flag again to mark this OPP as something
special to have some kind of opt-in.
So revert commit
5f1bf7ae8481 ("ARM: dts: omap36xx: Remove turbo mode for 1GHz variants")
Practical reasoning:
At least the GTA04A5 (DM3730) has become unstable with that OPP enabled.
Furthermore nothing enforces the availability of said technologies,
even in the kernel configuration, so allow users to rather opt-in.
Cc: Stable(a)vger.kernel.org
Fixes: 5f1bf7ae8481 ("ARM: dts: omap36xx: Remove turbo mode for 1GHz variants")
Signed-off-by: Andreas Kemnade <andreas(a)kemnade.info>
---
arch/arm/boot/dts/ti/omap/omap36xx.dtsi | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm/boot/dts/ti/omap/omap36xx.dtsi b/arch/arm/boot/dts/ti/omap/omap36xx.dtsi
index c3d79ecd56e39..c217094b50abc 100644
--- a/arch/arm/boot/dts/ti/omap/omap36xx.dtsi
+++ b/arch/arm/boot/dts/ti/omap/omap36xx.dtsi
@@ -72,6 +72,7 @@ opp-1000000000 {
<1375000 1375000 1375000>;
/* only on am/dm37x with speed-binned bit set */
opp-supported-hw = <0xffffffff 2>;
+ turbo-mode;
};
};
--
2.39.2
When sealing or unsealing a key blob we currently do not wait for
the AEAD cipher operation to finish and simply return after submitting
the request. If there is some load on the system we can exit before
the cipher operation is done and the buffer we read from/write to
is already removed from the stack. This will e.g. result in NULL
pointer dereference errors in the DCP driver during blob creation.
Fix this by waiting for the AEAD cipher operation to finish before
resuming the seal and unseal calls.
Cc: stable(a)vger.kernel.org # v6.10+
Fixes: 0e28bf61a5f9 ("KEYS: trusted: dcp: fix leak of blob encryption key")
Reported-by: Parthiban N <parthiban(a)linumiz.com>
Closes: https://lore.kernel.org/keyrings/254d3bb1-6dbc-48b4-9c08-77df04baee2f@linum…
Signed-off-by: David Gstir <david(a)sigma-star.at>
---
security/keys/trusted-keys/trusted_dcp.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/security/keys/trusted-keys/trusted_dcp.c b/security/keys/trusted-keys/trusted_dcp.c
index 4edc5bbbcda3..e908c53a803c 100644
--- a/security/keys/trusted-keys/trusted_dcp.c
+++ b/security/keys/trusted-keys/trusted_dcp.c
@@ -133,6 +133,7 @@ static int do_aead_crypto(u8 *in, u8 *out, size_t len, u8 *key, u8 *nonce,
struct scatterlist src_sg, dst_sg;
struct crypto_aead *aead;
int ret;
+ DECLARE_CRYPTO_WAIT(wait);
aead = crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(aead)) {
@@ -163,8 +164,8 @@ static int do_aead_crypto(u8 *in, u8 *out, size_t len, u8 *key, u8 *nonce,
}
aead_request_set_crypt(aead_req, &src_sg, &dst_sg, len, nonce);
- aead_request_set_callback(aead_req, CRYPTO_TFM_REQ_MAY_SLEEP, NULL,
- NULL);
+ aead_request_set_callback(aead_req, CRYPTO_TFM_REQ_MAY_SLEEP,
+ crypto_req_done, &wait);
aead_request_set_ad(aead_req, 0);
if (crypto_aead_setkey(aead, key, AES_KEYSIZE_128)) {
@@ -174,9 +175,9 @@ static int do_aead_crypto(u8 *in, u8 *out, size_t len, u8 *key, u8 *nonce,
}
if (do_encrypt)
- ret = crypto_aead_encrypt(aead_req);
+ ret = crypto_wait_req(crypto_aead_encrypt(aead_req), &wait);
else
- ret = crypto_aead_decrypt(aead_req);
+ ret = crypto_wait_req(crypto_aead_decrypt(aead_req), &wait);
free_req:
aead_request_free(aead_req);
--
2.47.0
Commit 7c55b78818cf ("jfs: xattr: fix buffer overflow for invalid xattr")
also addresses this issue but it only fixes it for positive values, while
ea_size is an integer type and can take negative values, e.g. in case of
a corrupted filesystem. This still breaks validation and would overflow
because of implicit conversion from int to size_t in print_hex_dump().
Fix this issue by clamping the ea_size value instead.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Cc: stable(a)vger.kernel.org
Signed-off-by: Artem Sadovnikov <ancowi69(a)gmail.com>
---
fs/jfs/xattr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/jfs/xattr.c b/fs/jfs/xattr.c
index 8ef8dfc3c194..95bcbbf7359c 100644
--- a/fs/jfs/xattr.c
+++ b/fs/jfs/xattr.c
@@ -557,7 +557,7 @@ static int ea_get(struct inode *inode, struct ea_buffer *ea_buf, int min_size)
size_check:
if (EALIST_SIZE(ea_buf->xattr) != ea_size) {
- int size = min_t(int, EALIST_SIZE(ea_buf->xattr), ea_size);
+ int size = clamp_t(int, ea_size, 0, EALIST_SIZE(ea_buf->xattr));
printk(KERN_ERR "ea_get: invalid extended attribute\n");
print_hex_dump(KERN_ERR, "", DUMP_PREFIX_ADDRESS, 16, 1,
--
2.43.0
Hey. Need optimization ZRAM - on 1 or 2 user on machine ZRAM has low CPU usage - but on 175 or more user on shared machine has a HIGH CPU usage.
Can anyone test ZRAM on high number of user - and add some patch to low CPU usage around of 10% to 15 % ?
CPU frequency cannot be boosted when using the amd_pstate driver in
passive or guided mode.
On a host that has an AMD EPYC 7662 processor, while running with
amd-pstate configured for passive mode on full CPU load, the processor
only reaches 2.0 GHz. On later kernels the CPU can reach 3.3GHz.
The CPU frequency is dependent on a setting called highest_perf which is
the multiplier used to compute it. The highest_perf value comes from
cppc_init_perf when the driver is built-in and from pstate_init_perf when
it is a loaded module. Both of these calls have the following condition:
highest_perf = amd_get_highest_perf();
if (highest_perf > __cppc_highest_perf_)
highest_perf = __cppc_highest_perf;
Where again __cppc_highest_perf is either the return from
cppc_get_perf_caps in the built-in case or AMD_CPPC_HIGHEST_PERF in the
module case. Both of these functions actually return the nominal value,
whereas the call to amd_get_highest_perf returns the correct boost value,
so the condition tests true and highest_perf always ends up being the
nominal value, therefore never having the ability to boost CPU frequency.
Since amd_get_highest_perf already returns the boost value, we have
eliminated this check.
The issue was introduced in v6.1 via commit bedadcfb011f ("cpufreq:
amd-pstate: Fix initial highest_perf value"), and exists in stable v6.1
kernels. This has been fixed in v6.6.y and newer but due to refactoring that
change isn't feasible to bring back to v6.1.y. Thus, v6.1 kernels are
affected by this significant performance issue, and cannot be easily
remediated.
Signed-off-by: Nabil S. Alramli <dev(a)nalramli.com>
Reviewed-by: Joe Damato <jdamato(a)fastly.com>
Reviewed-by: Kyle Hubert <khubert(a)fastly.com>
Fixes: bedadcfb011f ("cpufreq: amd-pstate: Fix initial highest_perf value")
See-also: 1ec40a175a48 ("cpufreq: amd-pstate: Enable amd-pstate preferred core support")
Cc: mario.limonciello(a)amd.com
Cc: Perry.Yuan(a)amd.com
Cc: li.meng(a)amd.com
Cc: stable(a)vger.kernel.org # v6.1
---
v2:
- Omit cover letter
- Converted from RFC to PATCH
- Expand commit message based on feedback from Mario Limonciello
- Added Reviewed-by tags
- No functional/code changes
rfc:
https://lore.kernel.org/lkml/20241025010527.491605-1-dev@nalramli.com/
---
drivers/cpufreq/amd-pstate.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/drivers/cpufreq/amd-pstate.c b/drivers/cpufreq/amd-pstate.c
index 90dcf26f0973..c66086ae624a 100644
--- a/drivers/cpufreq/amd-pstate.c
+++ b/drivers/cpufreq/amd-pstate.c
@@ -102,9 +102,7 @@ static int pstate_init_perf(struct amd_cpudata *cpudata)
*
* CPPC entry doesn't indicate the highest performance in some ASICs.
*/
- highest_perf = amd_get_highest_perf();
- if (highest_perf > AMD_CPPC_HIGHEST_PERF(cap1))
- highest_perf = AMD_CPPC_HIGHEST_PERF(cap1);
+ highest_perf = max(amd_get_highest_perf(), AMD_CPPC_HIGHEST_PERF(cap1));
WRITE_ONCE(cpudata->highest_perf, highest_perf);
@@ -124,9 +122,7 @@ static int cppc_init_perf(struct amd_cpudata *cpudata)
if (ret)
return ret;
- highest_perf = amd_get_highest_perf();
- if (highest_perf > cppc_perf.highest_perf)
- highest_perf = cppc_perf.highest_perf;
+ highest_perf = max(amd_get_highest_perf(), cppc_perf.highest_perf);
WRITE_ONCE(cpudata->highest_perf, highest_perf);
--
2.35.1
This problem reported by Clement LE GOFFIC manifest when
using CONFIG_KASAN_IN_VMALLOC and VMAP_STACK:
https://lore.kernel.org/linux-arm-kernel/a1a1d062-f3a2-4d05-9836-3b098de9db…
After some analysis it seems we are missing to sync the
VMALLOC shadow memory in top level PGD to all CPUs.
Add some code to perform this sync, and the bug appears
to go away.
As suggested by Ard, also perform a dummy read from the
shadow memory of the new VMAP_STACK in the low level
assembly.
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
---
Changes in v3:
- Collect Mark Rutlands ACK on patch 1
- Change the simplified assembly add r2, ip, lsr #n to the canonical
add r2, r2, ip, lsr #n in patch 2.
- Link to v2: https://lore.kernel.org/r/20241016-arm-kasan-vmalloc-crash-v2-0-0a52fd086ee…
Changes in v2:
- Implement the two helper functions suggested by Russell
making the KASAN PGD copying less messy.
- Link to v1: https://lore.kernel.org/r/20241015-arm-kasan-vmalloc-crash-v1-0-dbb23592ca8…
---
Linus Walleij (2):
ARM: ioremap: Sync PGDs for VMALLOC shadow
ARM: entry: Do a dummy read from VMAP shadow
arch/arm/kernel/entry-armv.S | 8 ++++++++
arch/arm/mm/ioremap.c | 25 +++++++++++++++++++++----
2 files changed, 29 insertions(+), 4 deletions(-)
---
base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
change-id: 20241015-arm-kasan-vmalloc-crash-fcbd51416457
Best regards,
--
Linus Walleij <linus.walleij(a)linaro.org>
The 2nd patch fixes CVE-2024-38538, but it requires the helper function
pskb_may_pull_reason which is defined in the 1st patch. Backport both
together.
Eric Dumazet (1):
net: add pskb_may_pull_reason() helper
Nikolay Aleksandrov (1):
net: bridge: xmit: make sure we have at least eth header len bytes
include/linux/skbuff.h | 19 +++++++++++++++----
net/bridge/br_device.c | 6 ++++++
2 files changed, 21 insertions(+), 4 deletions(-)
--
2.46.0
Upstream commit 7f9ab862e05c ("leds: spi-byte: Call of_node_put() on error path")
after being backported to 5.10/5.15/6.1/6.6 stable kernels introduced an
access-before-initialization error - it will most likely lead to a crash
in the probe function of the driver if there is no default zero
initialization for the stack values.
The commit moved the initialization of `struct device_node *child` lower
in code, while in stable kernels its value is used in between those places.
Git context of the patch does not cover the situation so it was applied
without any failures.
Upstream commit which removed that intermediate access to the variable is
ccc35ff2fd29 ("leds: spi-byte: Use devm_led_classdev_register_ext()").
I think it's worth a backport, too. The patches for the corresponding
stable trees follow in this thread.
Judging by Documentation/devicetree/bindings/leds/common.yaml, "label"
leds property is deprecated at least since the start of 2020. So there
should be no problem with switching from "label" to "function"+"color"
device name generation in kernels down to 5.10.y.
--
Fedor
The quilt patch titled
Subject: sched/numa: fix the potential null pointer dereference in task_numa_work()
has been removed from the -mm tree. Its filename was
sched-numa-fix-the-potential-null-pointer-dereference-in-task_numa_work.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Shawn Wang <shawnwang(a)linux.alibaba.com>
Subject: sched/numa: fix the potential null pointer dereference in task_numa_work()
Date: Fri, 25 Oct 2024 10:22:08 +0800
When running stress-ng-vm-segv test, we found a null pointer dereference
error in task_numa_work(). Here is the backtrace:
[323676.066985] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
......
[323676.067108] CPU: 35 PID: 2694524 Comm: stress-ng-vm-se
......
[323676.067113] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
[323676.067115] pc : vma_migratable+0x1c/0xd0
[323676.067122] lr : task_numa_work+0x1ec/0x4e0
[323676.067127] sp : ffff8000ada73d20
[323676.067128] x29: ffff8000ada73d20 x28: 0000000000000000 x27: 000000003e89f010
[323676.067130] x26: 0000000000080000 x25: ffff800081b5c0d8 x24: ffff800081b27000
[323676.067133] x23: 0000000000010000 x22: 0000000104d18cc0 x21: ffff0009f7158000
[323676.067135] x20: 0000000000000000 x19: 0000000000000000 x18: ffff8000ada73db8
[323676.067138] x17: 0001400000000000 x16: ffff800080df40b0 x15: 0000000000000035
[323676.067140] x14: ffff8000ada73cc8 x13: 1fffe0017cc72001 x12: ffff8000ada73cc8
[323676.067142] x11: ffff80008001160c x10: ffff000be639000c x9 : ffff8000800f4ba4
[323676.067145] x8 : ffff000810375000 x7 : ffff8000ada73974 x6 : 0000000000000001
[323676.067147] x5 : 0068000b33e26707 x4 : 0000000000000001 x3 : ffff0009f7158000
[323676.067149] x2 : 0000000000000041 x1 : 0000000000004400 x0 : 0000000000000000
[323676.067152] Call trace:
[323676.067153] vma_migratable+0x1c/0xd0
[323676.067155] task_numa_work+0x1ec/0x4e0
[323676.067157] task_work_run+0x78/0xd8
[323676.067161] do_notify_resume+0x1ec/0x290
[323676.067163] el0_svc+0x150/0x160
[323676.067167] el0t_64_sync_handler+0xf8/0x128
[323676.067170] el0t_64_sync+0x17c/0x180
[323676.067173] Code: d2888001 910003fd f9000bf3 aa0003f3 (f9401000)
[323676.067177] SMP: stopping secondary CPUs
[323676.070184] Starting crashdump kernel...
stress-ng-vm-segv in stress-ng is used to stress test the SIGSEGV error
handling function of the system, which tries to cause a SIGSEGV error on
return from unmapping the whole address space of the child process.
Normally this program will not cause kernel crashes. But before the
munmap system call returns to user mode, a potential task_numa_work() for
numa balancing could be added and executed. In this scenario, since the
child process has no vma after munmap, the vma_next() in task_numa_work()
will return a null pointer even if the vma iterator restarts from 0.
Recheck the vma pointer before dereferencing it in task_numa_work().
Link: https://lkml.kernel.org/r/20241025022208.125527-1-shawnwang@linux.alibaba.c…
Fixes: 214dbc428137 ("sched: convert to vma iterator")
Signed-off-by: Shawn Wang <shawnwang(a)linux.alibaba.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Dietmar Eggemann <dietmar.eggemann(a)arm.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Juri Lelli <juri.lelli(a)redhat.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Steven Rostedt (Google) <rostedt(a)goodmis.org>
Cc: Valentin Schneider <vschneid(a)redhat.com>
Cc: Vincent Guittot <vincent.guittot(a)linaro.org>
Cc: <stable(a)vger.kernel.org> [6.2+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/kernel/sched/fair.c~sched-numa-fix-the-potential-null-pointer-dereference-in-task_numa_work
+++ a/kernel/sched/fair.c
@@ -3369,7 +3369,7 @@ retry_pids:
vma = vma_next(&vmi);
}
- do {
+ for (; vma; vma = vma_next(&vmi)) {
if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
@@ -3491,7 +3491,7 @@ retry_pids:
*/
if (vma_pids_forced)
break;
- } for_each_vma(vmi, vma);
+ }
/*
* If no VMAs are remaining and VMAs were skipped due to the PID
_
Patches currently in -mm which might be from shawnwang(a)linux.alibaba.com are
daddr can be NULL if there is no neighbour table entry present,
in that case the tx packet should be dropped.
saddr will usually be set by MCTP core, but check for NULL in case a
packet is transmitted by a different protocol.
Fixes: f5b8abf9fc3d ("mctp i2c: MCTP I2C binding driver")
Cc: stable(a)vger.kernel.org
Reported-by: Dung Cao <dung(a)os.amperecomputing.com>
Signed-off-by: Matt Johnston <matt(a)codeconstruct.com.au>
---
Changes in v3:
- Revert to simpler saddr check of v1, mention in commit message
- Revert whitespace change from v2
- Link to v2: https://lore.kernel.org/r/20241021-mctp-i2c-null-dest-v2-1-4503e478517c@cod…
Changes in v2:
- Set saddr to device address if NULL, mention in commit message
- Fix patch prefix formatting
- Link to v1: https://lore.kernel.org/r/20241018-mctp-i2c-null-dest-v1-1-ba1ab52966e9@cod…
---
drivers/net/mctp/mctp-i2c.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/mctp/mctp-i2c.c b/drivers/net/mctp/mctp-i2c.c
index 4dc057c121f5d0fb9c9c48bf16b6933ae2f7b2ac..e70fb66879941f3937b7ffc5bc1e20a8a435a441 100644
--- a/drivers/net/mctp/mctp-i2c.c
+++ b/drivers/net/mctp/mctp-i2c.c
@@ -588,6 +588,9 @@ static int mctp_i2c_header_create(struct sk_buff *skb, struct net_device *dev,
if (len > MCTP_I2C_MAXMTU)
return -EMSGSIZE;
+ if (!daddr || !saddr)
+ return -EINVAL;
+
lldst = *((u8 *)daddr);
llsrc = *((u8 *)saddr);
---
base-commit: cb560795c8c2ceca1d36a95f0d1b2eafc4074e37
change-id: 20241018-mctp-i2c-null-dest-a0ba271e0c48
Best regards,
--
Matt Johnston <matt(a)codeconstruct.com.au>
This patch series is to fix bugs for below APIs:
devm_phy_put()
devm_of_phy_provider_unregister()
devm_phy_destroy()
phy_get()
of_phy_get()
devm_phy_get()
devm_of_phy_get()
devm_of_phy_get_by_index()
And simplify below API:
of_phy_simple_xlate().
Signed-off-by: Zijun Hu <quic_zijuhu(a)quicinc.com>
---
Changes in v2:
- Correct title, commit message, and inline comments.
- Link to v1: https://lore.kernel.org/r/20241020-phy_core_fix-v1-0-078062f7da71@quicinc.c…
---
Zijun Hu (6):
phy: core: Fix that API devm_phy_put() fails to release the phy
phy: core: Fix that API devm_of_phy_provider_unregister() fails to unregister the phy provider
phy: core: Fix that API devm_phy_destroy() fails to destroy the phy
phy: core: Fix an OF node refcount leakage in _of_phy_get()
phy: core: Fix an OF node refcount leakage in of_phy_provider_lookup()
phy: core: Simplify API of_phy_simple_xlate() implementation
drivers/phy/phy-core.c | 39 +++++++++++++++++++--------------------
1 file changed, 19 insertions(+), 20 deletions(-)
---
base-commit: e70d2677ef4088d59158739d72b67ac36d1b132b
change-id: 20241020-phy_core_fix-e3ad65db98f7
Best regards,
--
Zijun Hu <quic_zijuhu(a)quicinc.com>
From: Zijun Hu <quic_zijuhu(a)quicinc.com>
Commit 6ed05c68cbca ("usb: musb: sunxi: Explicitly release USB PHY on
exit") will cause that usb phy @glue->xceiv is accessed after released.
1) register platform driver @sunxi_musb_driver
// get the usb phy @glue->xceiv
sunxi_musb_probe() -> devm_usb_get_phy().
2) register and unregister platform driver @musb_driver
musb_probe() -> sunxi_musb_init()
use the phy here
//the phy is released here
musb_remove() -> sunxi_musb_exit() -> devm_usb_put_phy()
3) register @musb_driver again
musb_probe() -> sunxi_musb_init()
use the phy here but the phy has been released at 2).
...
Fixed by reverting the commit, namely, removing devm_usb_put_phy()
from sunxi_musb_exit().
Fixes: 6ed05c68cbca ("usb: musb: sunxi: Explicitly release USB PHY on exit")
Cc: stable(a)vger.kernel.org
Signed-off-by: Zijun Hu <quic_zijuhu(a)quicinc.com>
---
drivers/usb/musb/sunxi.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/usb/musb/sunxi.c b/drivers/usb/musb/sunxi.c
index d54283fd026b..05b6e7e52e02 100644
--- a/drivers/usb/musb/sunxi.c
+++ b/drivers/usb/musb/sunxi.c
@@ -293,8 +293,6 @@ static int sunxi_musb_exit(struct musb *musb)
if (test_bit(SUNXI_MUSB_FL_HAS_SRAM, &glue->flags))
sunxi_sram_release(musb->controller->parent);
- devm_usb_put_phy(glue->dev, glue->xceiv);
-
return 0;
}
---
base-commit: afb92ad8733ef0a2843cc229e4d96aead80bc429
change-id: 20241029-sunxi_fix-07fe18228733
Best regards,
--
Zijun Hu <quic_zijuhu(a)quicinc.com>
When numa balancing is enabled with demotion, vmscan will call
migrate_pages when shrinking LRUs. Successful demotions will
cause node vmstat numbers to double-decrement, leading to an
imbalanced page count. The result is dmesg output like such:
$ cat /proc/sys/vm/stat_refresh
[77383.088417] vmstat_refresh: nr_isolated_anon -103212
[77383.088417] vmstat_refresh: nr_isolated_file -899642
This negative value may impact compaction and reclaim throttling.
The double-decrement occurs in the migrate_pages path:
caller to shrink_folio_list decrements the count
shrink_folio_list
demote_folio_list
migrate_pages
migrate_pages_batch
migrate_folio_move
migrate_folio_done
mod_node_page_state(-ve) <- second decrement
This path happens for SUCCESSFUL migrations, not failures. Typically
callers to migrate_pages are required to handle putback/accounting for
failures, but this is already handled in the shrink code.
When accounting for migrations, instead do not decrement the count
when the migration reason is MR_DEMOTION. As of v6.11, this demotion
logic is the only source of MR_DEMOTION.
Signed-off-by: Gregory Price <gourry(a)gourry.net>
Fixes: 26aa2d199d6f2 ("mm/migrate: demote pages during reclaim")
Cc: stable(a)vger.kernel.org
---
mm/migrate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 923ea80ba744..e3aac274cf16 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1099,7 +1099,7 @@ static void migrate_folio_done(struct folio *src,
* not accounted to NR_ISOLATED_*. They can be recognized
* as __folio_test_movable
*/
- if (likely(!__folio_test_movable(src)))
+ if (likely(!__folio_test_movable(src)) && reason != MR_DEMOTION)
mod_node_page_state(folio_pgdat(src), NR_ISOLATED_ANON +
folio_is_file_lru(src), -folio_nr_pages(src));
--
2.43.0
SVACE reports a potential NULL pointer dereference in 5.10, 5.15 and 6.1
stable releases since the commit 4c9f8d114660 ("ath10k: enable TDLS
peer inactivity detection") that caused this report was appeared.
The problem has been fixed by the following upstream patch that was adapted
to 5.10, 5.15 and 6.1. All of the changes made to the patch in order to adapt it
are described at the end of commit message.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Peter Kosyh (1):
wifi: ath10k: Check return value of ath10k_get_arvif() in ath10k_wmi_event_tdls_peer()
drivers/net/wireless/ath/ath10k/wmi-tlv.c | 7 +++++++
1 file changed, 7 insertions(+)
--
2.43.5
Move LNL scheduling WA to xe_device.h so this can be used in other
places without needing keep the same comment about removal of this WA
in the future. The WA, which flushes work or workqueues, is now wrapped
in macros and can be reused wherever needed.
Cc: Badal Nilawar <badal.nilawar(a)intel.com>
Cc: Matthew Auld <matthew.auld(a)intel.com>
Cc: Matthew Brost <matthew.brost(a)intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray(a)intel.com>
Cc: Lucas De Marchi <lucas.demarchi(a)intel.com>
cc: <stable(a)vger.kernel.org> # v6.11+
Suggested-by: John Harrison <John.C.Harrison(a)Intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das(a)intel.com>
---
drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++
drivers/gpu/drm/xe/xe_guc_ct.c | 11 +----------
2 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index 4c3f0ebe78a9..f1fbfe916867 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -191,4 +191,18 @@ void xe_device_declare_wedged(struct xe_device *xe);
struct xe_file *xe_file_get(struct xe_file *xef);
void xe_file_put(struct xe_file *xef);
+/*
+ * Occasionally it is seen that the G2H worker starts running after a delay of more than
+ * a second even after being queued and activated by the Linux workqueue subsystem. This
+ * leads to G2H timeout error. The root cause of issue lies with scheduling latency of
+ * Lunarlake Hybrid CPU. Issue disappears if we disable Lunarlake atom cores from BIOS
+ * and this is beyond xe kmd.
+ *
+ * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.
+ */
+#define LNL_FLUSH_WORKQUEUE(wq__) \
+ flush_workqueue(wq__)
+#define LNL_FLUSH_WORK(wrk__) \
+ flush_work(wrk__)
+
#endif
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 1b5d8fb1033a..703b44b257a7 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -1018,17 +1018,8 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
- /*
- * Occasionally it is seen that the G2H worker starts running after a delay of more than
- * a second even after being queued and activated by the Linux workqueue subsystem. This
- * leads to G2H timeout error. The root cause of issue lies with scheduling latency of
- * Lunarlake Hybrid CPU. Issue dissappears if we disable Lunarlake atom cores from BIOS
- * and this is beyond xe kmd.
- *
- * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.
- */
if (!ret) {
- flush_work(&ct->g2h_worker);
+ LNL_FLUSH_WORK(&ct->g2h_worker);
if (g2h_fence.done) {
xe_gt_warn(gt, "G2H fence %u, action %04x, done\n",
g2h_fence.seqno, action[0]);
--
2.46.0
In a commit 24b7f8e5cd65 ("firewire: core: use helper functions for self
ID sequence"), the enumeration over self ID sequence was refactored with
some helper functions with KUnit tests. These helper functions are
guaranteed to work expectedly by the KUnit tests, however their application
includes a mistake to assign invalid value to the index of port connected
to parent device.
This bug affects the case that any extra node devices which has three or
more ports are connected to 1394 OHCI controller. In the case, the path
to update the tree cache could hits WARN_ON(), and gets general protection
fault due to the access to invalid address computed by the invalid value.
This commit fixes the bug to assign correct port index.
Cc: stable(a)vger.kernel.org
Reported-by: Edmund Raile <edmund.raile(a)proton.me>
Closes: https://lore.kernel.org/lkml/8a9902a4ece9329af1e1e42f5fea76861f0bf0e8.camel…
Fixes: 24b7f8e5cd65 ("firewire: core: use helper functions for self ID sequence")
Signed-off-by: Takashi Sakamoto <o-takashi(a)sakamocchi.jp>
---
drivers/firewire/core-topology.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/firewire/core-topology.c b/drivers/firewire/core-topology.c
index 6adadb11962e..892b94cfd626 100644
--- a/drivers/firewire/core-topology.c
+++ b/drivers/firewire/core-topology.c
@@ -204,7 +204,7 @@ static struct fw_node *build_tree(struct fw_card *card, const u32 *sid, int self
// the node->ports array where the parent node should be. Later,
// when we handle the parent node, we fix up the reference.
++parent_count;
- node->color = i;
+ node->color = port_index;
break;
case PHY_PACKET_SELF_ID_PORT_STATUS_CHILD:
--
2.45.2
Greetings,
This is a RFC for a maintenance patch to an issue in the amd_pstate driver
where CPU frequency cannot be boosted in passive or guided modes. Without
this patch, AMD machines using stable kernels are unable to get their CPU
frequency boosted, which is a significant performance issue.
For example, on a host that has AMD EPYC 7662 64-Core processor without
this patch running at full CPU load:
$ for i in $(cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq); \
do ni=$(echo "scale=1; $i/1000000" | bc -l); echo "$ni GHz"; done | \
sort | uniq -c
128 2.0 GHz
And with this patch:
$ for i in $(cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq); \
do ni=$(echo "scale=1; $i/1000000" | bc -l); echo "$ni GHz"; done | \
sort | uniq -c
128 3.3 GHz
I am not sure what the correct process is for submitting patches which
affect only stable trees but not the current code base, and do not apply to
the current tree. As such, I am submitting this directly to stable@, but
please let me know if I should be submitting this elsewhere.
The issue was introduced in v6.1 via commit bedadcfb011f ("cpufreq:
amd-pstate: Fix initial highest_perf value"), and exists in stable kernels
up until v6.6.51.
In v6.6.51, a large change, commit 1ec40a175a48 ("cpufreq: amd-pstate:
Enable amd-pstate preferred core support"), was introduced which
significantly refactored the code. This commit cannot be ported back on its
own, and would require reviewing and cherry picking at least a few dozen of
commits in cpufreq, amd-pstate, ACPI, CPPC.
This means kernels v6.1 up until v6.6.51 are affected by this significant
performance issue, and cannot be easily remediated.
Thank you for your attention and I look forward to your response in regards
to what the best way to proceed is for getting this important performance
fix merged.
Best Regards,
Nabil S. Alramli (1):
cpufreq: amd-pstate: Enable CPU boost in passive and guided modes
drivers/cpufreq/amd-pstate.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
--
2.35.1
From: Michael Walle <mwalle(a)kernel.org>
Commit 83e824a4a595 ("mtd: spi-nor: Correct flags for Winbond w25q128")
removed the flags for non-SFDP devices. It was assumed that it wasn't in
use anymore. This wasn't true. Add the no_sfdp_flags as well as the size
again.
We add the additional flags for dual and quad read because they have
been reported to work properly by Hartmut using both older and newer
versions of this flash, the similar flashes with 64Mbit and 256Mbit
already have these flags and because it will (luckily) trigger our
legacy SFDP parsing, so newer versions with SFDP support will still get
the parameters from the SFDP tables.
This was applied to mainline as
commit e49b2731c396 ("mtd: spi-nor: winbond: fix w25q128 regression")
however the code has changed a lot after v6.6 so the patch did
not apply to v6.6 or v6.1 which still has the problem.
This patch fixes the issue in the way of the old API and has
been tested on hardware. Please apply it for v6.1 and v6.6.
Reported-by: Hartmut Birr <e9hack(a)gmail.com>
Closes: https://lore.kernel.org/r/CALxbwRo_-9CaJmt7r7ELgu+vOcgk=xZcGHobnKf=oT2=u4d4…
Fixes: 83e824a4a595 ("mtd: spi-nor: Correct flags for Winbond w25q128")
Reviewed-by: Linus Walleij <linus.walleij(a)linaro.org>
Signed-off-by: Michael Walle <mwalle(a)kernel.org>
Acked-by: Tudor Ambarus <tudor.ambarus(a)linaro.org>
Reviewed-by: Esben Haabendal <esben(a)geanix.com>
Reviewed-by: Pratyush Yadav <pratyush(a)kernel.org>
Signed-off-by: Pratyush Yadav <pratyush(a)kernel.org>
Link: https://lore.kernel.org/r/20240621120929.2670185-1-mwalle@kernel.org
[Backported to v6.6 - vastly different due to upstream changes]
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
---
This fix backported to stable v6.6 and v6.1 after reports
from OpenWrt users:
https://github.com/openwrt/openwrt/issues/16796
---
drivers/mtd/spi-nor/winbond.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/mtd/spi-nor/winbond.c b/drivers/mtd/spi-nor/winbond.c
index cd99c9a1c568..95dd28b9bf14 100644
--- a/drivers/mtd/spi-nor/winbond.c
+++ b/drivers/mtd/spi-nor/winbond.c
@@ -120,9 +120,10 @@ static const struct flash_info winbond_nor_parts[] = {
NO_SFDP_FLAGS(SECT_4K) },
{ "w25q80bl", INFO(0xef4014, 0, 64 * 1024, 16)
NO_SFDP_FLAGS(SECT_4K) },
- { "w25q128", INFO(0xef4018, 0, 0, 0)
- PARSE_SFDP
- FLAGS(SPI_NOR_HAS_LOCK | SPI_NOR_HAS_TB) },
+ { "w25q128", INFO(0xef4018, 0, 64 * 1024, 256)
+ FLAGS(SPI_NOR_HAS_LOCK | SPI_NOR_HAS_TB)
+ NO_SFDP_FLAGS(SECT_4K | SPI_NOR_DUAL_READ |
+ SPI_NOR_QUAD_READ) },
{ "w25q256", INFO(0xef4019, 0, 64 * 1024, 512)
NO_SFDP_FLAGS(SECT_4K | SPI_NOR_DUAL_READ | SPI_NOR_QUAD_READ)
.fixups = &w25q256_fixups },
---
base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa
change-id: 20241027-v6-6-7ed05eaccb22
Best regards,
--
Linus Walleij <linus.walleij(a)linaro.org>
The patch titled
Subject: mm/thp: fix deferred split unqueue naming and locking
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-thp-fix-deferred-split-unqueue-naming-and-locking.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/thp: fix deferred split unqueue naming and locking
Date: Sun, 27 Oct 2024 13:02:13 -0700 (PDT)
Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without). The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.
Before fixing locking: rename misleading folio_undo_large_rmappable(),
which does not undo large_rmappable, to folio_unqueue_deferred_split(),
which is what it does. But that and its out-of-line __callee are mm
internals of very limited usability: add comment and WARN_ON_ONCEs to
check usage; and return a bool to say if a deferred split was unqueued,
which can then be used in WARN_ON_ONCEs around safety checks (sparing
callers the arcane conditionals in __folio_unqueue_deferred_split()).
Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all
of whose callers now call it beforehand (and if any forget then bad_page()
will tell) - except for its caller put_pages_list(), which itself no
longer has any callers (and will be deleted separately).
Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
without checking and unqueueing a THP folio from deferred split list;
which is unfortunate, since the split_queue_lock depends on the memcg
(when memcg is enabled); so swapout has been unqueueing such THPs later,
when freeing the folio, using the pgdat's lock instead: potentially
corrupting the memcg's list. __remove_mapping() has frozen refcount to 0
here, so no problem with calling folio_unqueue_deferred_split() before
resetting memcg_data.
That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
shrinker memcg aware"): which included a check on swapcache before adding
to deferred queue, but no check on deferred queue before adding THP to
swapcache. That worked fine with the usual sequence of events in reclaim
(though there were a couple of rare ways in which a THP on deferred queue
could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split
underused THPs") avoids splitting underused THPs in reclaim, which makes
swapcache THPs on deferred queue commonplace.
Keep the check on swapcache before adding to deferred queue? Yes: it is
no longer essential, but preserves the existing behaviour, and is likely
to be a worthwhile optimization (vmstat showed much more traffic on the
queue under swapping load if the check was removed); update its comment.
Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
folio->memcg_data without checking and unqueueing a THP folio from the
deferred list, sometimes corrupting "from" memcg's list, like swapout.
Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
of course unqueues), or skip it if that fails. Not ideal, but moving
charge has been requested, and khugepaged should repair the THP later:
nobody wants new custom unqueueing code just for this deprecated case.
The 87eaceb3faa5 commit did have the code to move from one deferred list
to another (but was not conscious of its unsafety while refcount non-0);
but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care
deferred split queue in memcg charge move path"), which argued that the
existence of a PMD mapping guarantees that the THP cannot be on a deferred
list. As above, false in rare cases, and now commonly false.
Backport to 6.11 should be straightforward. Earlier backports must take
care that other _deferred_list fixes and dependencies are included. There
is not a strong case for backports, but they can fix cornercases.
Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com
Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
Fixes: dafff3f4c850 ("mm: split underused THPs")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Barry Song <baohua(a)kernel.org>
Cc: Chris Li <chrisl(a)kernel.org>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Nhat Pham <nphamcs(a)gmail.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Shakeel Butt <shakeel.butt(a)linux.dev>
Cc: Usama Arif <usamaarif642(a)gmail.com>
Cc: Wei Yang <richard.weiyang(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/huge_memory.c | 35 ++++++++++++++++++++++++++---------
mm/internal.h | 10 +++++-----
mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
mm/memcontrol.c | 8 +++++---
mm/migrate.c | 4 ++--
mm/page_alloc.c | 1 -
mm/swap.c | 4 ++--
mm/vmscan.c | 4 ++--
8 files changed, 67 insertions(+), 24 deletions(-)
--- a/mm/huge_memory.c~mm-thp-fix-deferred-split-unqueue-naming-and-locking
+++ a/mm/huge_memory.c
@@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *fo
return split_huge_page_to_list_to_order(&folio->page, list, ret);
}
-void __folio_undo_large_rmappable(struct folio *folio)
+/*
+ * __folio_unqueue_deferred_split() is not to be called directly:
+ * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
+ * limits its calls to those folios which may have a _deferred_list for
+ * queueing THP splits, and that list is (racily observed to be) non-empty.
+ *
+ * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
+ * zero: because even when split_queue_lock is held, a non-empty _deferred_list
+ * might be in use on deferred_split_scan()'s unlocked on-stack list.
+ *
+ * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
+ * therefore important to unqueue deferred split before changing folio memcg.
+ */
+bool __folio_unqueue_deferred_split(struct folio *folio)
{
struct deferred_split *ds_queue;
unsigned long flags;
+ bool unqueued = false;
+
+ WARN_ON_ONCE(folio_ref_count(folio));
+ WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
ds_queue = get_deferred_split_queue(folio);
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
@@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
}
list_del_init(&folio->_deferred_list);
+ unqueued = true;
}
spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+
+ return unqueued; /* useful for debug warnings */
}
/* partially_mapped=false won't clear PG_partially_mapped folio flag */
@@ -3627,14 +3647,11 @@ void deferred_split_folio(struct folio *
return;
/*
- * The try_to_unmap() in page reclaim path might reach here too,
- * this may cause a race condition to corrupt deferred split queue.
- * And, if page reclaim is already handling the same folio, it is
- * unnecessary to handle it again in shrinker.
- *
- * Check the swapcache flag to determine if the folio is being
- * handled by page reclaim since THP swap would add the folio into
- * swap cache before calling try_to_unmap().
+ * Exclude swapcache: originally to avoid a corrupt deferred split
+ * queue. Nowadays that is fully prevented by mem_cgroup_swapout();
+ * but if page reclaim is already handling the same folio, it is
+ * unnecessary to handle it again in the shrinker, so excluding
+ * swapcache here may still be a useful optimization.
*/
if (folio_test_swapcache(folio))
return;
--- a/mm/internal.h~mm-thp-fix-deferred-split-unqueue-naming-and-locking
+++ a/mm/internal.h
@@ -684,11 +684,11 @@ static inline void folio_set_order(struc
#endif
}
-void __folio_undo_large_rmappable(struct folio *folio);
-static inline void folio_undo_large_rmappable(struct folio *folio)
+bool __folio_unqueue_deferred_split(struct folio *folio);
+static inline bool folio_unqueue_deferred_split(struct folio *folio)
{
if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
- return;
+ return false;
/*
* At this point, there is no one trying to add the folio to
@@ -696,9 +696,9 @@ static inline void folio_undo_large_rmap
* to check without acquiring the split_queue_lock.
*/
if (data_race(list_empty(&folio->_deferred_list)))
- return;
+ return false;
- __folio_undo_large_rmappable(folio);
+ return __folio_unqueue_deferred_split(folio);
}
static inline struct folio *page_rmappable_folio(struct page *page)
--- a/mm/memcontrol.c~mm-thp-fix-deferred-split-unqueue-naming-and-locking
+++ a/mm/memcontrol.c
@@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio
struct obj_cgroup *objcg;
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
- VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
- !folio_test_hugetlb(folio) &&
- !list_empty(&folio->_deferred_list), folio);
/*
* Nobody should be changing or seriously looking at
@@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio
ug->nr_memory += nr_pages;
ug->pgpgout++;
+ WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
folio->memcg_data = 0;
}
@@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *ol
/* Transfer the charge and the css ref */
commit_charge(new, memcg);
+
+ /* Warning should never happen, so don't worry about refcount non-0 */
+ WARN_ON_ONCE(folio_unqueue_deferred_split(old));
old->memcg_data = 0;
}
@@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *fo
VM_BUG_ON_FOLIO(oldid, folio);
mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
+ folio_unqueue_deferred_split(folio);
folio->memcg_data = 0;
if (!mem_cgroup_is_root(memcg))
--- a/mm/memcontrol-v1.c~mm-thp-fix-deferred-split-unqueue-naming-and-locking
+++ a/mm/memcontrol-v1.c
@@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struc
css_get(&to->css);
css_put(&from->css);
+ /* Warning should never happen, so don't worry about refcount non-0 */
+ WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
folio->memcg_data = (unsigned long)to;
__folio_memcg_unlock(from);
@@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_ra
enum mc_target_type target_type;
union mc_target target;
struct folio *folio;
+ bool tried_split_before = false;
+retry_pmd:
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
if (mc.precharge < HPAGE_PMD_NR) {
@@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_ra
target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
if (target_type == MC_TARGET_PAGE) {
folio = target.folio;
+ /*
+ * Deferred split queue locking depends on memcg,
+ * and unqueue is unsafe unless folio refcount is 0:
+ * split or skip if on the queue? first try to split.
+ */
+ if (!list_empty(&folio->_deferred_list)) {
+ spin_unlock(ptl);
+ if (!tried_split_before)
+ split_folio(folio);
+ folio_unlock(folio);
+ folio_put(folio);
+ if (tried_split_before)
+ return 0;
+ tried_split_before = true;
+ goto retry_pmd;
+ }
+ /*
+ * So long as that pmd lock is held, the folio cannot
+ * be racily added to the _deferred_list, because
+ * __folio_remove_rmap() will find !partially_mapped.
+ */
if (folio_isolate_lru(folio)) {
if (!mem_cgroup_move_account(folio, true,
mc.from, mc.to)) {
--- a/mm/migrate.c~mm-thp-fix-deferred-split-unqueue-naming-and-locking
+++ a/mm/migrate.c
@@ -490,7 +490,7 @@ static int __folio_migrate_mapping(struc
folio_test_large_rmappable(folio)) {
if (!folio_ref_freeze(folio, expected_count))
return -EAGAIN;
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
folio_ref_unfreeze(folio, expected_count);
}
@@ -515,7 +515,7 @@ static int __folio_migrate_mapping(struc
}
/* Take off deferred split queue while frozen and memcg set */
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
/*
* Now we know that no one else is looking at the folio:
--- a/mm/page_alloc.c~mm-thp-fix-deferred-split-unqueue-naming-and-locking
+++ a/mm/page_alloc.c
@@ -2686,7 +2686,6 @@ void free_unref_folios(struct folio_batc
unsigned long pfn = folio_pfn(folio);
unsigned int order = folio_order(folio);
- folio_undo_large_rmappable(folio);
if (!free_pages_prepare(&folio->page, order))
continue;
/*
--- a/mm/swap.c~mm-thp-fix-deferred-split-unqueue-naming-and-locking
+++ a/mm/swap.c
@@ -121,7 +121,7 @@ void __folio_put(struct folio *folio)
}
page_cache_release(folio);
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
mem_cgroup_uncharge(folio);
free_unref_page(&folio->page, folio_order(folio));
}
@@ -988,7 +988,7 @@ void folios_put_refs(struct folio_batch
free_huge_folio(folio);
continue;
}
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
__page_cache_release(folio, &lruvec, &flags);
if (j != i)
--- a/mm/vmscan.c~mm-thp-fix-deferred-split-unqueue-naming-and-locking
+++ a/mm/vmscan.c
@@ -1476,7 +1476,7 @@ free_it:
*/
nr_reclaimed += nr_pages;
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
if (folio_batch_add(&free_folios, folio) == 0) {
mem_cgroup_uncharge_folios(&free_folios);
try_to_unmap_flush();
@@ -1864,7 +1864,7 @@ static unsigned int move_folios_to_lru(s
if (unlikely(folio_put_testzero(folio))) {
__folio_clear_lru_flags(folio);
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
if (folio_batch_add(&free_folios, folio) == 0) {
spin_unlock_irq(&lruvec->lru_lock);
mem_cgroup_uncharge_folios(&free_folios);
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-fix-deferred-split-queue-not-partially_mapped.patch
mm-thp-fix-deferred-split-unqueue-naming-and-locking.patch
mm-delete-the-unused-put_pages_list.patch
During the aborting of a command, the software receives a command
completion event for the command ring stopped, with the TRB pointing
to the next TRB after the aborted command.
If the command we abort is located just before the Link TRB in the
command ring, then during the 'command ring stopped' completion event,
the xHC gives the Link TRB in the event's cmd DMA, which causes a
mismatch in handling command completion event.
To address this situation, move the 'command ring stopped' completion
event check slightly earlier, since the specific command it stopped
on isn't of significant concern.
Fixes: 7f84eef0dafb ("USB: xhci: No-op command queueing and irq handler.")
Cc: stable(a)vger.kernel.org
Signed-off-by: Faisal Hassan <quic_faisalh(a)quicinc.com>
---
Changes in v3:
- Skip dma check for the cmd ring stopped event
- v2 link:
https://lore.kernel.org/all/20241021131904.20678-1-quic_faisalh@quicinc.com
Changes in v2:
- Added Fixes tag
- Removed traversing of TRBs with in_range() API.
- Simplified the if condition check.
- v1 link:
https://lore.kernel.org/all/20241018195953.12315-1-quic_faisalh@quicinc.com
drivers/usb/host/xhci-ring.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index b2950c35c740..1ffc69c48eac 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1718,6 +1718,14 @@ static void handle_cmd_completion(struct xhci_hcd *xhci,
trace_xhci_handle_command(xhci->cmd_ring, &cmd_trb->generic);
+ cmd_comp_code = GET_COMP_CODE(le32_to_cpu(event->status));
+
+ /* If CMD ring stopped we own the trbs between enqueue and dequeue */
+ if (cmd_comp_code == COMP_COMMAND_RING_STOPPED) {
+ complete_all(&xhci->cmd_ring_stop_completion);
+ return;
+ }
+
cmd_dequeue_dma = xhci_trb_virt_to_dma(xhci->cmd_ring->deq_seg,
cmd_trb);
/*
@@ -1734,14 +1742,6 @@ static void handle_cmd_completion(struct xhci_hcd *xhci,
cancel_delayed_work(&xhci->cmd_timer);
- cmd_comp_code = GET_COMP_CODE(le32_to_cpu(event->status));
-
- /* If CMD ring stopped we own the trbs between enqueue and dequeue */
- if (cmd_comp_code == COMP_COMMAND_RING_STOPPED) {
- complete_all(&xhci->cmd_ring_stop_completion);
- return;
- }
-
if (cmd->command_trb != xhci->cmd_ring->dequeue) {
xhci_err(xhci,
"Command completion event does not match command\n");
--
2.17.1
This is a note to let you know that I've just added the patch titled
docs: iio: ad7380: fix supply for ad7380-4
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From 795114e849ddfd48150eb0135d04748a8c81cec5 Mon Sep 17 00:00:00 2001
From: Julien Stephan <jstephan(a)baylibre.com>
Date: Tue, 22 Oct 2024 15:22:40 +0200
Subject: docs: iio: ad7380: fix supply for ad7380-4
ad7380-4 is the only device from ad738x family that doesn't have an
internal reference. Moreover it's external reference is called REFIN in
the datasheet while all other use REFIO as an optional external
reference. Update documentation to highlight this.
Fixes: 3e82dfc82f38 ("docs: iio: new docs for ad7380 driver")
Reviewed-by: David Lechner <dlechner(a)baylibre.com>
Signed-off-by: Julien Stephan <jstephan(a)baylibre.com>
Link: https://patch.msgid.link/20241022-ad7380-fix-supplies-v3-5-f0cefe1b7fa6@bay…
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
Documentation/iio/ad7380.rst | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/Documentation/iio/ad7380.rst b/Documentation/iio/ad7380.rst
index 9c784c1e652e..6f70b49b9ef2 100644
--- a/Documentation/iio/ad7380.rst
+++ b/Documentation/iio/ad7380.rst
@@ -41,13 +41,22 @@ supports only 1 SDO line.
Reference voltage
-----------------
-2 possible reference voltage sources are supported:
+ad7380-4
+~~~~~~~~
+
+ad7380-4 supports only an external reference voltage (2.5V to 3.3V). It must be
+declared in the device tree as ``refin-supply``.
+
+All other devices from ad738x family
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+All other devices from ad738x support 2 possible reference voltage sources:
- Internal reference (2.5V)
- External reference (2.5V to 3.3V)
The source is determined by the device tree. If ``refio-supply`` is present,
-then the external reference is used, else the internal reference is used.
+then it is used as external reference, else the internal reference is used.
Oversampling and resolution boost
---------------------------------
--
2.47.0
This is a note to let you know that I've just added the patch titled
iio: adc: ad7380: fix supplies for ad7380-4
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From 05f9c67179c9a8d66dee175fb4b17f380908a26f Mon Sep 17 00:00:00 2001
From: Julien Stephan <jstephan(a)baylibre.com>
Date: Tue, 22 Oct 2024 15:22:39 +0200
Subject: iio: adc: ad7380: fix supplies for ad7380-4
ad7380-4 is the only device in the family that does not have an internal
reference. It uses "refin" as a required external reference.
All other devices in the family use "refio"" as an optional external
reference.
Fixes: 737413da8704 ("iio: adc: ad7380: add support for ad738x-4 4 channels variants")
Reviewed-by: Nuno Sa <nuno.sa(a)analog.com>
Reviewed-by: David Lechner <dlechner(a)baylibre.com>
Signed-off-by: Julien Stephan <jstephan(a)baylibre.com>
Link: https://patch.msgid.link/20241022-ad7380-fix-supplies-v3-4-f0cefe1b7fa6@bay…
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/iio/adc/ad7380.c | 36 ++++++++++++++++++++++++++----------
1 file changed, 26 insertions(+), 10 deletions(-)
diff --git a/drivers/iio/adc/ad7380.c b/drivers/iio/adc/ad7380.c
index b107d8e97ab3..fb728570debe 100644
--- a/drivers/iio/adc/ad7380.c
+++ b/drivers/iio/adc/ad7380.c
@@ -89,6 +89,7 @@ struct ad7380_chip_info {
bool has_mux;
const char * const *supplies;
unsigned int num_supplies;
+ bool external_ref_only;
const char * const *vcm_supplies;
unsigned int num_vcm_supplies;
const unsigned long *available_scan_masks;
@@ -431,6 +432,7 @@ static const struct ad7380_chip_info ad7380_4_chip_info = {
.num_simult_channels = 4,
.supplies = ad7380_supplies,
.num_supplies = ARRAY_SIZE(ad7380_supplies),
+ .external_ref_only = true,
.available_scan_masks = ad7380_4_channel_scan_masks,
.timing_specs = &ad7380_4_timing,
};
@@ -1047,17 +1049,31 @@ static int ad7380_probe(struct spi_device *spi)
"Failed to enable power supplies\n");
fsleep(T_POWERUP_US);
- /*
- * If there is no REFIO supply, then it means that we are using
- * the internal 2.5V reference, otherwise REFIO is reference voltage.
- */
- ret = devm_regulator_get_enable_read_voltage(&spi->dev, "refio");
- if (ret < 0 && ret != -ENODEV)
- return dev_err_probe(&spi->dev, ret,
- "Failed to get refio regulator\n");
+ if (st->chip_info->external_ref_only) {
+ ret = devm_regulator_get_enable_read_voltage(&spi->dev,
+ "refin");
+ if (ret < 0)
+ return dev_err_probe(&spi->dev, ret,
+ "Failed to get refin regulator\n");
- external_ref_en = ret != -ENODEV;
- st->vref_mv = external_ref_en ? ret / 1000 : AD7380_INTERNAL_REF_MV;
+ st->vref_mv = ret / 1000;
+
+ /* these chips don't have a register bit for this */
+ external_ref_en = false;
+ } else {
+ /*
+ * If there is no REFIO supply, then it means that we are using
+ * the internal reference, otherwise REFIO is reference voltage.
+ */
+ ret = devm_regulator_get_enable_read_voltage(&spi->dev,
+ "refio");
+ if (ret < 0 && ret != -ENODEV)
+ return dev_err_probe(&spi->dev, ret,
+ "Failed to get refio regulator\n");
+
+ external_ref_en = ret != -ENODEV;
+ st->vref_mv = external_ref_en ? ret / 1000 : AD7380_INTERNAL_REF_MV;
+ }
if (st->chip_info->num_vcm_supplies > ARRAY_SIZE(st->vcm_mv))
return dev_err_probe(&spi->dev, -EINVAL,
--
2.47.0
This is a note to let you know that I've just added the patch titled
dt-bindings: iio: adc: ad7380: fix ad7380-4 reference supply
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From fbe5956e8809f04e9121923db0b6d1b94f2b93ba Mon Sep 17 00:00:00 2001
From: Julien Stephan <jstephan(a)baylibre.com>
Date: Tue, 22 Oct 2024 15:22:36 +0200
Subject: dt-bindings: iio: adc: ad7380: fix ad7380-4 reference supply
ad7380-4 is the only device from ad738x family that doesn't have an
internal reference. Moreover its external reference is called REFIN in
the datasheet while all other use REFIO as an optional external
reference. If refio-supply is omitted the internal reference is
used.
Fix the binding by adding refin-supply and makes it required for
ad7380-4 only.
Fixes: 1a291cc8ee17 ("dt-bindings: iio: adc: ad7380: add support for ad738x-4 4 channels variants")
Acked-by: Conor Dooley <conor.dooley(a)microchip.com>
Reviewed-by: David Lechner <dlechner(a)baylibre.com>
Signed-off-by: Julien Stephan <jstephan(a)baylibre.com>
Link: https://patch.msgid.link/20241022-ad7380-fix-supplies-v3-1-f0cefe1b7fa6@bay…
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
.../bindings/iio/adc/adi,ad7380.yaml | 21 +++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/Documentation/devicetree/bindings/iio/adc/adi,ad7380.yaml b/Documentation/devicetree/bindings/iio/adc/adi,ad7380.yaml
index bd19abb867d9..0065d6508824 100644
--- a/Documentation/devicetree/bindings/iio/adc/adi,ad7380.yaml
+++ b/Documentation/devicetree/bindings/iio/adc/adi,ad7380.yaml
@@ -67,6 +67,10 @@ properties:
A 2.5V to 3.3V supply for the external reference voltage. When omitted,
the internal 2.5V reference is used.
+ refin-supply:
+ description:
+ A 2.5V to 3.3V supply for external reference voltage, for ad7380-4 only.
+
aina-supply:
description:
The common mode voltage supply for the AINA- pin on pseudo-differential
@@ -135,6 +139,23 @@ allOf:
ainc-supply: false
aind-supply: false
+ # ad7380-4 uses refin-supply as external reference.
+ # All other chips from ad738x family use refio as optional external reference.
+ # When refio-supply is omitted, internal reference is used.
+ - if:
+ properties:
+ compatible:
+ enum:
+ - adi,ad7380-4
+ then:
+ properties:
+ refio-supply: false
+ required:
+ - refin-supply
+ else:
+ properties:
+ refin-supply: false
+
examples:
- |
#include <dt-bindings/interrupt-controller/irq.h>
--
2.47.0
From: "Jason-JH.Lin" <jason-jh.lin(a)mediatek.com>
This reverts commit ac88a1f41f93499df6f50fd18ea835e6ff4f3200.
Reason for revert:
1. The commit [1] does not land on linux-5.15, so this patch does not
fix anything.
2. Since the fw_devlink improvements series [2] does not land on
linux-5.15, using device_set_fwnode() causes the panel to flash during
bootup.
Incorrect link management may lead to incorrect device initialization,
affecting firmware node links and consumer relationships.
The fwnode setting of panel to the DSI device would cause a DSI
initialization error without series[2], so this patch was reverted to
avoid using the incomplete fw_devlink functionality.
[1] commit 3fb16866b51d ("driver core: fw_devlink: Make cycle detection more robust")
[2] Link: https://lore.kernel.org/all/20230207014207.1678715-1-saravanak@google.com
Cc: stable(a)vger.kernel.org # 5.15.169
Cc: stable(a)vger.kernel.org # 5.10.228
Cc: stable(a)vger.kernel.org # 5.4.284
Signed-off-by: Jason-JH.Lin <jason-jh.lin(a)mediatek.com>
---
drivers/gpu/drm/drm_mipi_dsi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/drm_mipi_dsi.c b/drivers/gpu/drm/drm_mipi_dsi.c
index 24606b632009..468a3a7cb6a5 100644
--- a/drivers/gpu/drm/drm_mipi_dsi.c
+++ b/drivers/gpu/drm/drm_mipi_dsi.c
@@ -221,7 +221,7 @@ mipi_dsi_device_register_full(struct mipi_dsi_host *host,
return dsi;
}
- device_set_node(&dsi->dev, of_fwnode_handle(info->node));
+ dsi->dev.of_node = info->node;
dsi->channel = info->channel;
strlcpy(dsi->name, info->type, sizeof(dsi->name));
---
base-commit: 74cdd62cb4706515b454ce5bacb73b566c1d1bcf
change-id: 20241024-fixup-5-15-5fdd68dae707
Best regards,
--
Jason-JH.Lin <jason-jh.lin(a)mediatek.com>
After commit 94d7d9233951 ("mm: abstract the vma_merge()/split_vma()
pattern for mprotect() et al."), if vma_modify_flags() return error, the
vma is set to an error code. This will lead to an invalid prev be
returned.
Generally this shouldn't matter as the caller should treat an error as
indicating state is now invalidated, however unfortunately
apply_mlockall_flags() does not check for errors and assumes that
mlock_fixup() correctly maintains prev even if an error were to occur.
This patch fixes that assumption.
[lorenzo: provide a better fix and rephrase the log]
Fixes: 94d7d9233951 ("mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al.")
Signed-off-by: Wei Yang <richard.weiyang(a)gmail.com>
CC: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
CC: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
CC: Vlastimil Babka <vbabka(a)suse.cz>
CC: Jann Horn <jannh(a)google.com>
Cc: <stable(a)vger.kernel.org>
---
v2:
rearrange the fix and change log per Lorenzo's suggestion
add fix tag and cc stable
---
mm/mlock.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/mlock.c b/mm/mlock.c
index e3e3dc2b2956..cde076fa7d5e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -725,14 +725,17 @@ static int apply_mlockall_flags(int flags)
}
for_each_vma(vmi, vma) {
+ int error;
vm_flags_t newflags;
newflags = vma->vm_flags & ~VM_LOCKED_MASK;
newflags |= to_add;
- /* Ignore errors */
- mlock_fixup(&vmi, vma, &prev, vma->vm_start, vma->vm_end,
- newflags);
+ error = mlock_fixup(&vmi, vma, &prev, vma->vm_start, vma->vm_end,
+ newflags);
+ /* Ignore errors, but prev needs fixing up. */
+ if (error)
+ prev = vma;
cond_resched();
}
out:
--
2.34.1
From: "Jason-JH.Lin" <jason-jh.lin(a)mediatek.com>
This reverts commit ac88a1f41f93499df6f50fd18ea835e6ff4f3200.
Reason for revert:
1. The commit [1] does not land on linux-5.15, so this patch does not
fix anything.
2. Since the fw_device improvements series [2] does not land on
linux-5.15, using device_set_fwnode() causes the panel to flash during
bootup.
Incorrect link management may lead to incorrect device initialization,
affecting firmware node links and consumer relationships.
The fwnode setting of panel to the DSI device would cause a DSI
initialization error without series[2], so this patch was reverted to
avoid using the incomplete fw_devlink functionality.
[1] commit 3fb16866b51d ("driver core: fw_devlink: Make cycle detection more robust")
[2] Link: https://lore.kernel.org/all/20230207014207.1678715-1-saravanak@google.com
Cc: stable(a)vger.kernel.org # 5.15.169
Signed-off-by: Jason-JH.Lin <jason-jh.lin(a)mediatek.com>
---
drivers/gpu/drm/drm_mipi_dsi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/drm_mipi_dsi.c b/drivers/gpu/drm/drm_mipi_dsi.c
index 24606b632009..468a3a7cb6a5 100644
--- a/drivers/gpu/drm/drm_mipi_dsi.c
+++ b/drivers/gpu/drm/drm_mipi_dsi.c
@@ -221,7 +221,7 @@ mipi_dsi_device_register_full(struct mipi_dsi_host *host,
return dsi;
}
- device_set_node(&dsi->dev, of_fwnode_handle(info->node));
+ dsi->dev.of_node = info->node;
dsi->channel = info->channel;
strlcpy(dsi->name, info->type, sizeof(dsi->name));
---
base-commit: 74cdd62cb4706515b454ce5bacb73b566c1d1bcf
change-id: 20241024-fixup-5-15-5fdd68dae707
Best regards,
--
Jason-JH.Lin <jason-jh.lin(a)mediatek.com>
From: "Jason-JH.Lin" <jason-jh.lin(a)mediatek.com>
This reverts commit ac88a1f41f93499df6f50fd18ea835e6ff4f3200.
Reason for revert:
1. The commit [1] does not land on linux-5.15, so this patch does not
fix anything.
2. Since the fw_device improvements series [2] does not land on
linux-5.15, using device_set_fwnode() causes the panel to flash during
bootup.
Incorrect link management may lead to incorrect device initialization,
affecting firmware node links and consumer relationships.
The fwnode setting of panel to the DSI device would cause a DSI
initialization error without series[2], so this patch was reverted to
avoid using the incomplete fw_devlink functionality.
[1] commit 3fb16866b51d ("driver core: fw_devlink: Make cycle detection more robust")
[2] Link: https://lore.kernel.org/all/20230207014207.1678715-1-saravanak@google.com
Cc: stable(a)vger.kernel.org # 5.15.169
Cc: stable(a)vger.kernel.org # 5.10.228
Cc: stable(a)vger.kernel.org # 5.4.284
Signed-off-by: Jason-JH.Lin <jason-jh.lin(a)mediatek.com>
---
drivers/gpu/drm/drm_mipi_dsi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/drm_mipi_dsi.c b/drivers/gpu/drm/drm_mipi_dsi.c
index 24606b632009..468a3a7cb6a5 100644
--- a/drivers/gpu/drm/drm_mipi_dsi.c
+++ b/drivers/gpu/drm/drm_mipi_dsi.c
@@ -221,7 +221,7 @@ mipi_dsi_device_register_full(struct mipi_dsi_host *host,
return dsi;
}
- device_set_node(&dsi->dev, of_fwnode_handle(info->node));
+ dsi->dev.of_node = info->node;
dsi->channel = info->channel;
strlcpy(dsi->name, info->type, sizeof(dsi->name));
---
base-commit: 74cdd62cb4706515b454ce5bacb73b566c1d1bcf
change-id: 20241024-fixup-5-15-5fdd68dae707
Best regards,
--
Jason-JH.Lin <jason-jh.lin(a)mediatek.com>
Please remove the following patch from all stable queues:
2ff949441802 ("block: fix sanity checks in blk_rq_map_user_bvec")
The above patch should not go into any stable tree unless accompanied by
its (currently inflight) fix:
https://lore.kernel.org/linux-block/20241028090840.446180-1-hch@lst.de/
The patch titled
Subject: vmscan,migrate: fix page count imbalance on node stats when demoting pages
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
vmscanmigrate-fix-double-decrement-on-node-stats-when-demoting-pages.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Gregory Price <gourry(a)gourry.net>
Subject: vmscan,migrate: fix page count imbalance on node stats when demoting pages
Date: Fri, 25 Oct 2024 10:17:24 -0400
When numa balancing is enabled with demotion, vmscan will call
migrate_pages when shrinking LRUs. migrate_pages will decrement the
the node's isolated page count, leading to an imbalanced count when
invoked from (MG)LRU code.
The result is dmesg output like such:
$ cat /proc/sys/vm/stat_refresh
[77383.088417] vmstat_refresh: nr_isolated_anon -103212
[77383.088417] vmstat_refresh: nr_isolated_file -899642
This negative value may impact compaction and reclaim throttling.
The following path produces the decrement:
shrink_folio_list
demote_folio_list
migrate_pages
migrate_pages_batch
migrate_folio_move
migrate_folio_done
mod_node_page_state(-ve) <- decrement
This path happens for SUCCESSFUL migrations, not failures. Typically
callers to migrate_pages are required to handle putback/accounting for
failures, but this is already handled in the shrink code.
When accounting for migrations, instead do not decrement the count when
the migration reason is MR_DEMOTION. As of v6.11, this demotion logic
is the only source of MR_DEMOTION.
Link: https://lkml.kernel.org/r/20241025141724.17927-1-gourry@gourry.net
Fixes: 26aa2d199d6f ("mm/migrate: demote pages during reclaim")
Signed-off-by: Gregory Price <gourry(a)gourry.net>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Huang Ying <ying.huang(a)intel.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Wei Xu <weixugc(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/migrate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/migrate.c~vmscanmigrate-fix-double-decrement-on-node-stats-when-demoting-pages
+++ a/mm/migrate.c
@@ -1178,7 +1178,7 @@ static void migrate_folio_done(struct fo
* not accounted to NR_ISOLATED_*. They can be recognized
* as __folio_test_movable
*/
- if (likely(!__folio_test_movable(src)))
+ if (likely(!__folio_test_movable(src)) && reason != MR_DEMOTION)
mod_node_page_state(folio_pgdat(src), NR_ISOLATED_ANON +
folio_is_file_lru(src), -folio_nr_pages(src));
_
Patches currently in -mm which might be from gourry(a)gourry.net are
vmscanmigrate-fix-double-decrement-on-node-stats-when-demoting-pages.patch
Two small fixes related to the MPTCP packets scheduler:
- Patch 1: add missing rcu_read_(un)lock(). A fix for >= 6.6.
- Patch 2: remove unneeded lock when listing packets schedulers. A fix
for >= 6.10.
And some modifications in the MPTCP selftests:
- Patch 3: a small addition to the MPTCP selftests to cover more code.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (3):
mptcp: init: protect sched with rcu_read_lock
mptcp: remove unneeded lock when listing scheds
selftests: mptcp: list sysctl data
net/mptcp/protocol.c | 2 ++
net/mptcp/sched.c | 2 --
tools/testing/selftests/net/mptcp/mptcp_connect.sh | 9 +++++++++
3 files changed, 11 insertions(+), 2 deletions(-)
---
base-commit: 3b05b9c36ddd01338e1352588f2ec1ea23f97d43
change-id: 20241021-net-mptcp-sched-lock-10dfc75d1e00
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
This series fixes two similar issues in tegra_210_xusb_padctl_probe().
Two resources (device_node *np and the device within struct
platform_device *pdev) are acquired, but never released after they are
no longer needed. To avoid leaking such resources, calls to
of_node_put() and put_device() must be added.
In this case, the resources are not assigned anywhere in the probe
function, and they must be released in the error and success paths. If
I overlooked any assignment, please report it as an error.
I have tried to affect the existing code and execution paths as less as
possible by releasing the device and device_node as soon as possible,
but if goto jumps to labels for the cleanup are desired, I can go for
that approach instead.
This series has been compiled successfully, but not tested on real
hardware as I don't have access to it. Any validation with the affected
hardware is always welcome.
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
---
Javier Carrasco (2):
phy: tegra: xusb: fix device release in tegra210_xusb_padctl_probe
phy: tegra: xusb: fix device node release in tegra210_xusb_padctl_probe
drivers/phy/tegra/xusb-tegra210.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
---
base-commit: dec9255a128e19c5fcc3bdb18175d78094cc624d
change-id: 20241028-phy-tegra-xusb-tegra210-put_device-ff7ae76403b4
Best regards,
--
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
Flush xe ordered_wq in case of ufence timeout which is observed
on LNL and that points to recent scheduling issue with E-cores.
This is similar to the recent fix:
commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
response timeout") and should be removed once there is a E-core
scheduling fix for LNL.
v2: Add platform check(Himal)
s/__flush_workqueue/flush_workqueue(Jani)
v3: Remove gfx platform check as the issue related to cpu
platform(John)
Cc: Badal Nilawar <badal.nilawar(a)intel.com>
Cc: Matthew Auld <matthew.auld(a)intel.com>
Cc: John Harrison <John.C.Harrison(a)Intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray(a)intel.com>
Cc: Lucas De Marchi <lucas.demarchi(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.11+
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754
Suggested-by: Matthew Brost <matthew.brost(a)intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das(a)intel.com>
Reviewed-by: Matthew Brost <matthew.brost(a)intel.com>
---
drivers/gpu/drm/xe/xe_wait_user_fence.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c
index f5deb81eba01..886c9862d89c 100644
--- a/drivers/gpu/drm/xe/xe_wait_user_fence.c
+++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c
@@ -155,6 +155,17 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data,
}
if (!timeout) {
+ /*
+ * This is analogous to e51527233804 ("drm/xe/guc/ct: Flush g2h worker
+ * in case of g2h response timeout")
+ *
+ * TODO: Drop this change once workqueue scheduling delay issue is
+ * fixed on LNL Hybrid CPU.
+ */
+ flush_workqueue(xe->ordered_wq);
+ err = do_compare(addr, args->value, args->mask, args->op);
+ if (err <= 0)
+ break;
err = -ETIME;
break;
}
--
2.46.0
Syzbot hit the following WARN_ON() in kvm_timer_update_irq():
WARNING: CPU: 0 PID: 3281 at arch/arm64/kvm/arch_timer.c:459
kvm_timer_update_irq+0x21c/0x394
Call trace:
kvm_timer_update_irq+0x21c/0x394 arch/arm64/kvm/arch_timer.c:459
kvm_timer_vcpu_reset+0x158/0x684 arch/arm64/kvm/arch_timer.c:968
kvm_reset_vcpu+0x3b4/0x560 arch/arm64/kvm/reset.c:264
kvm_vcpu_set_target arch/arm64/kvm/arm.c:1553 [inline]
kvm_arch_vcpu_ioctl_vcpu_init arch/arm64/kvm/arm.c:1573 [inline]
kvm_arch_vcpu_ioctl+0x112c/0x1b3c arch/arm64/kvm/arm.c:1695
kvm_vcpu_ioctl+0x4ec/0xf74 virt/kvm/kvm_main.c:4658
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:907 [inline]
__se_sys_ioctl fs/ioctl.c:893 [inline]
__arm64_sys_ioctl+0x108/0x184 fs/ioctl.c:893
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x78/0x1b8 arch/arm64/kernel/syscall.c:49
el0_svc_common+0xe8/0x1b0 arch/arm64/kernel/syscall.c:132
do_el0_svc+0x40/0x50 arch/arm64/kernel/syscall.c:151
el0_svc+0x54/0x14c arch/arm64/kernel/entry-common.c:712
el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
The sequence that led to the report is when KVM_ARM_VCPU_INIT ioctl is
invoked after a failed first KVM_RUN. In a general sense though, since
kvm_arch_vcpu_run_pid_change() doesn't tear down any of the past
initiatializations, it's possible that the VM's state could be left
half-baked. Any upcoming ioctls could behave erroneously because of
this.
Since these late vCPU initializations is past the point of attributing
the failures to any ioctl, instead of tearing down each of the previous
setups, simply mark the VM as dead, gving an opportunity for the
userspace to close and try again.
Cc: <stable(a)vger.kernel.org>
Reported-by: syzbot <syzkaller(a)googlegroups.com>
Suggested-by: Oliver Upton <oliver.upton(a)linux.dev>
Signed-off-by: Raghavendra Rao Ananta <rananta(a)google.com>
---
arch/arm64/kvm/arm.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a0d01c46e4084..ae3551bc98aeb 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -821,12 +821,12 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
*/
ret = kvm_vgic_map_resources(kvm);
if (ret)
- return ret;
+ goto out_err;
}
ret = kvm_finalize_sys_regs(vcpu);
if (ret)
- return ret;
+ goto out_err;
/*
* This needs to happen after any restriction has been applied
@@ -836,16 +836,16 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
ret = kvm_timer_enable(vcpu);
if (ret)
- return ret;
+ goto out_err;
ret = kvm_arm_pmu_v3_enable(vcpu);
if (ret)
- return ret;
+ goto out_err;
if (is_protected_kvm_enabled()) {
ret = pkvm_create_hyp_vm(kvm);
if (ret)
- return ret;
+ goto out_err;
}
if (!irqchip_in_kernel(kvm)) {
@@ -869,6 +869,10 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
mutex_unlock(&kvm->arch.config_lock);
return ret;
+
+out_err:
+ kvm_vm_dead(kvm);
+ return ret;
}
bool kvm_arch_intc_initialized(struct kvm *kvm)
--
2.47.0.163.g1226f6d8fa-goog
This series adds the missing calls to of_node_put(arm_timer) to release
the resource, and then switches to the more robust approach that makes
use of the automatic cleanup facility (not available for all stable
kernels).
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
---
Changes in v2:
- Add second patch for automatic cleanup.
- Link to v1: https://lore.kernel.org/r/20241013-timer-ti-dm-systimer-of_node_put-v1-1-0c…
---
Javier Carrasco (2):
clocksource/drivers/timer-ti-dm: fix child node refcount handling
clocksource/drivers/timer-ti-dm: automate device_node cleanup in dmtimer_percpu_quirk_init()
drivers/clocksource/timer-ti-dm-systimer.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
---
base-commit: d61a00525464bfc5fe92c6ad713350988e492b88
change-id: 20241013-timer-ti-dm-systimer-of_node_put-d42735687698
Best regards,
--
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
This short series fixes an old bug that only happens if a memory
allocation fails, which might be the reason why it was never found. When
at it an unnecessary jump to a label has been removed to simplify the
error path and avoid jumps out of the loop, which might have hidden the
bug while reading the code.
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
---
Javier Carrasco (2):
virt: fsl: fix missing of_node_put() on early exit from for_each_compatible_node()
virt: fsl: refactor out_of_memory label
drivers/virt/fsl_hypervisor.c | 23 ++++++++++-------------
1 file changed, 10 insertions(+), 13 deletions(-)
---
base-commit: dec9255a128e19c5fcc3bdb18175d78094cc624d
change-id: 20241028-fsl_hypervisor-of_node_put-5051e664f038
Best regards,
--
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
Do not continue tpm2_sessions_init() further if the null key pair creation
fails.
Cc: stable(a)vger.kernel.org # v6.10+
Fixes: d2add27cf2b8 ("tpm: Add NULL primary creation")
Signed-off-by: Jarkko Sakkinen <jarkko(a)kernel.org>
---
v8:
- Refine commit message.
v7:
- Add the error message back but fix it up a bit:
1. Remove 'TPM:' given dev_err().
2. s/NULL/null/ as this has nothing to do with the macro in libc.
3. Fix the reasoning: null key creation failed
v6:
- Address:
https://lore.kernel.org/linux-integrity/69c893e7-6b87-4daa-80db-44d1120e80f…
as TPM RC is taken care of at the call site. Add also the missing
documentation for the return values.
v5:
- Do not print klog messages on error, as tpm2_save_context() already
takes care of this.
v4:
- Fixed up stable version.
v3:
- Handle TPM and POSIX error separately and return -ENODEV always back
to the caller.
v2:
- Refined the commit message.
---
drivers/char/tpm/tpm2-sessions.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/char/tpm/tpm2-sessions.c b/drivers/char/tpm/tpm2-sessions.c
index d3521aadd43e..a0306126e86c 100644
--- a/drivers/char/tpm/tpm2-sessions.c
+++ b/drivers/char/tpm/tpm2-sessions.c
@@ -1347,14 +1347,21 @@ static int tpm2_create_null_primary(struct tpm_chip *chip)
*
* Derive and context save the null primary and allocate memory in the
* struct tpm_chip for the authorizations.
+ *
+ * Return:
+ * * 0 - OK
+ * * -errno - A system error
+ * * TPM_RC - A TPM error
*/
int tpm2_sessions_init(struct tpm_chip *chip)
{
int rc;
rc = tpm2_create_null_primary(chip);
- if (rc)
- dev_err(&chip->dev, "TPM: security failed (NULL seed derivation): %d\n", rc);
+ if (rc) {
+ dev_err(&chip->dev, "null key creation failed with %d\n", rc);
+ return rc;
+ }
chip->auth = kmalloc(sizeof(*chip->auth), GFP_KERNEL);
if (!chip->auth)
--
2.47.0
Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
boundaries") a mmap() of anonymous memory without a specific address
hint and of at least PMD_SIZE will be aligned to PMD so that it can
benefit from a THP backing page.
However this change has been shown to regress some workloads
significantly. [1] reports regressions in various spec benchmarks, with
up to 600% slowdown of the cactusBSSN benchmark on some platforms. The
benchmark seems to create many mappings of 4632kB, which would have
merged to a large THP-backed area before commit efa7df3e3bb5 and now
they are fragmented to multiple areas each aligned to PMD boundary with
gaps between. The regression then seems to be caused mainly due to the
benchmark's memory access pattern suffering from TLB or cache aliasing
due to the aligned boundaries of the individual areas.
Another known regression bisected to commit efa7df3e3bb5 is darktable
[2] [3] and early testing suggests this patch fixes the regression there
as well.
To fix the regression but still try to benefit from THP-friendly
anonymous mapping alignment, add a condition that the size of the
mapping must be a multiple of PMD size instead of at least PMD size. In
case of many odd-sized mapping like the cactusBSSN creates, those will
stop being aligned and with gaps between, and instead naturally merge
again.
Reported-by: Michael Matz <matz(a)suse.de>
Debugged-by: Gabriel Krisman Bertazi <gabriel(a)krisman.be>
Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1]
Reported-by: Matthias Bodenbinder <matthias(a)bodenbinder.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2]
Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.i… [3]
Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
Cc: <stable(a)vger.kernel.org>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Yang Shi <yang(a)os.amperecomputing.com>
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
---
mm/mmap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 9c0fb43064b5..a5297cfb1dfc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
if (get_area) {
addr = get_area(file, addr, len, pgoff, flags);
- } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+ } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
+ && IS_ALIGNED(len, PMD_SIZE)) {
/* Ensures that larger anonymous mappings are THP aligned. */
addr = thp_get_unmapped_area_vmflags(file, addr, len,
pgoff, flags, vm_flags);
--
2.47.0
The following commit has been merged into the sched/urgent branch of tip:
Commit-ID: 9c70b2a33cd2aa6a5a59c5523ef053bd42265209
Gitweb: https://git.kernel.org/tip/9c70b2a33cd2aa6a5a59c5523ef053bd42265209
Author: Shawn Wang <shawnwang(a)linux.alibaba.com>
AuthorDate: Fri, 25 Oct 2024 10:22:08 +08:00
Committer: Peter Zijlstra <peterz(a)infradead.org>
CommitterDate: Sat, 26 Oct 2024 09:28:37 +02:00
sched/numa: Fix the potential null pointer dereference in task_numa_work()
When running stress-ng-vm-segv test, we found a null pointer dereference
error in task_numa_work(). Here is the backtrace:
[323676.066985] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
......
[323676.067108] CPU: 35 PID: 2694524 Comm: stress-ng-vm-se
......
[323676.067113] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
[323676.067115] pc : vma_migratable+0x1c/0xd0
[323676.067122] lr : task_numa_work+0x1ec/0x4e0
[323676.067127] sp : ffff8000ada73d20
[323676.067128] x29: ffff8000ada73d20 x28: 0000000000000000 x27: 000000003e89f010
[323676.067130] x26: 0000000000080000 x25: ffff800081b5c0d8 x24: ffff800081b27000
[323676.067133] x23: 0000000000010000 x22: 0000000104d18cc0 x21: ffff0009f7158000
[323676.067135] x20: 0000000000000000 x19: 0000000000000000 x18: ffff8000ada73db8
[323676.067138] x17: 0001400000000000 x16: ffff800080df40b0 x15: 0000000000000035
[323676.067140] x14: ffff8000ada73cc8 x13: 1fffe0017cc72001 x12: ffff8000ada73cc8
[323676.067142] x11: ffff80008001160c x10: ffff000be639000c x9 : ffff8000800f4ba4
[323676.067145] x8 : ffff000810375000 x7 : ffff8000ada73974 x6 : 0000000000000001
[323676.067147] x5 : 0068000b33e26707 x4 : 0000000000000001 x3 : ffff0009f7158000
[323676.067149] x2 : 0000000000000041 x1 : 0000000000004400 x0 : 0000000000000000
[323676.067152] Call trace:
[323676.067153] vma_migratable+0x1c/0xd0
[323676.067155] task_numa_work+0x1ec/0x4e0
[323676.067157] task_work_run+0x78/0xd8
[323676.067161] do_notify_resume+0x1ec/0x290
[323676.067163] el0_svc+0x150/0x160
[323676.067167] el0t_64_sync_handler+0xf8/0x128
[323676.067170] el0t_64_sync+0x17c/0x180
[323676.067173] Code: d2888001 910003fd f9000bf3 aa0003f3 (f9401000)
[323676.067177] SMP: stopping secondary CPUs
[323676.070184] Starting crashdump kernel...
stress-ng-vm-segv in stress-ng is used to stress test the SIGSEGV error
handling function of the system, which tries to cause a SIGSEGV error on
return from unmapping the whole address space of the child process.
Normally this program will not cause kernel crashes. But before the
munmap system call returns to user mode, a potential task_numa_work()
for numa balancing could be added and executed. In this scenario, since the
child process has no vma after munmap, the vma_next() in task_numa_work()
will return a null pointer even if the vma iterator restarts from 0.
Recheck the vma pointer before dereferencing it in task_numa_work().
Fixes: 214dbc428137 ("sched: convert to vma iterator")
Signed-off-by: Shawn Wang <shawnwang(a)linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org # v6.2+
Link: https://lkml.kernel.org/r/20241025022208.125527-1-shawnwang@linux.alibaba.c…
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8796146..2d16c85 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3369,7 +3369,7 @@ retry_pids:
vma = vma_next(&vmi);
}
- do {
+ for (; vma; vma = vma_next(&vmi)) {
if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
@@ -3491,7 +3491,7 @@ retry_pids:
*/
if (vma_pids_forced)
break;
- } for_each_vma(vmi, vma);
+ }
/*
* If no VMAs are remaining and VMAs were skipped due to the PID
The gpiolib debugfs interface exports a list of all gpio chips in a
system and the state of their pins.
The gpio chip sections are supposed to be separated by a newline
character, but a long-standing bug prevents the separator from
being included when output is generated in multiple sessions, making the
output inconsistent and hard to read.
Make sure to only suppress the newline separator at the beginning of the
file as intended.
Fixes: f9c4a31f6150 ("gpiolib: Use seq_file's iterator interface")
Cc: stable(a)vger.kernel.org # 3.7
Cc: Thierry Reding <treding(a)nvidia.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/gpio/gpiolib.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
index d5952ab7752c..e27488a90bc9 100644
--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -4926,6 +4926,8 @@ static void *gpiolib_seq_start(struct seq_file *s, loff_t *pos)
return NULL;
s->private = priv;
+ if (*pos > 0)
+ priv->newline = true;
priv->idx = srcu_read_lock(&gpio_devices_srcu);
list_for_each_entry_srcu(gdev, &gpio_devices, list,
--
2.45.2
I'm the creator and the maintainer of the mold linker
(https://github.com/rui314/mold). Recently, we discovered that mold
started causing process crashes in certain situations due to a change
in the Linux kernel. Here are the details:
- In general, overwriting an existing file is much faster than
creating an empty file and writing to it on Linux, so mold attempts to
reuse an existing executable file if it exists.
- If a program is running, opening the executable file for writing
previously failed with ETXTBSY. If that happens, mold falls back to
creating a new file.
- However, the Linux kernel recently changed the behavior so that
writing to an executable file is now always permitted
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…).
That caused mold to write to an executable file even if there's a
process running that file. Since changes to mmap'ed files are
immediately visible to other processes, any processes running that
file would almost certainly crash in a very mysterious way.
Identifying the cause of these random crashes took us a few days.
Rejecting writes to an executable file that is currently running is a
well-known behavior, and Linux had operated that way for a very long
time. So, I don’t believe relying on this behavior was our mistake;
rather, I see this as a regression in the Linux kernel.
Here is a bug report to the mold linker:
https://github.com/rui314/mold/issues/1361
#regzbot introduced: 2a010c41285345da60cece35575b4e0af7e7bf44
From: Shengjiu Wang <shengjiu.wang(a)nxp.com>
[ Upstream commit 54c805c1eb264c839fa3027d0073bb7f323b0722 ]
Irq handler need to be executed as fast as possible, so
the log in irq handler is better to use dev_dbg which needs
to be enabled when debugging.
Signed-off-by: Shengjiu Wang <shengjiu.wang(a)nxp.com>
Reviewed-by: Iuliana Prodan <iuliana.prodan(a)nxp.com>
Link: https://patch.msgid.link/1728622433-2873-1-git-send-email-shengjiu.wang@nxp…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
sound/soc/fsl/fsl_esai.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/sound/soc/fsl/fsl_esai.c b/sound/soc/fsl/fsl_esai.c
index baa76337c33f3..e3c324467a9b4 100644
--- a/sound/soc/fsl/fsl_esai.c
+++ b/sound/soc/fsl/fsl_esai.c
@@ -77,10 +77,10 @@ static irqreturn_t esai_isr(int irq, void *devid)
dev_dbg(&pdev->dev, "isr: Transmission Initialized\n");
if (esr & ESAI_ESR_RFF_MASK)
- dev_warn(&pdev->dev, "isr: Receiving overrun\n");
+ dev_dbg(&pdev->dev, "isr: Receiving overrun\n");
if (esr & ESAI_ESR_TFE_MASK)
- dev_warn(&pdev->dev, "isr: Transmission underrun\n");
+ dev_dbg(&pdev->dev, "isr: Transmission underrun\n");
if (esr & ESAI_ESR_TLS_MASK)
dev_dbg(&pdev->dev, "isr: Just transmitted the last slot\n");
--
2.43.0
From: Shengjiu Wang <shengjiu.wang(a)nxp.com>
[ Upstream commit 54c805c1eb264c839fa3027d0073bb7f323b0722 ]
Irq handler need to be executed as fast as possible, so
the log in irq handler is better to use dev_dbg which needs
to be enabled when debugging.
Signed-off-by: Shengjiu Wang <shengjiu.wang(a)nxp.com>
Reviewed-by: Iuliana Prodan <iuliana.prodan(a)nxp.com>
Link: https://patch.msgid.link/1728622433-2873-1-git-send-email-shengjiu.wang@nxp…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
sound/soc/fsl/fsl_esai.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/sound/soc/fsl/fsl_esai.c b/sound/soc/fsl/fsl_esai.c
index 33ade79fa032e..4904f48de612d 100644
--- a/sound/soc/fsl/fsl_esai.c
+++ b/sound/soc/fsl/fsl_esai.c
@@ -98,10 +98,10 @@ static irqreturn_t esai_isr(int irq, void *devid)
dev_dbg(&pdev->dev, "isr: Transmission Initialized\n");
if (esr & ESAI_ESR_RFF_MASK)
- dev_warn(&pdev->dev, "isr: Receiving overrun\n");
+ dev_dbg(&pdev->dev, "isr: Receiving overrun\n");
if (esr & ESAI_ESR_TFE_MASK)
- dev_warn(&pdev->dev, "isr: Transmission underrun\n");
+ dev_dbg(&pdev->dev, "isr: Transmission underrun\n");
if (esr & ESAI_ESR_TLS_MASK)
dev_dbg(&pdev->dev, "isr: Just transmitted the last slot\n");
--
2.43.0
Flush xe ordered_wq in case of ufence timeout which is observed
on LNL and that points to the recent scheduling issue with E-cores.
This is similar to the recent fix:
commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
response timeout") and should be removed once there is E core
scheduling fix.
v2: Add platform check(Himal)
s/__flush_workqueue/flush_workqueue(Jani)
Cc: Badal Nilawar <badal.nilawar(a)intel.com>
Cc: Jani Nikula <jani.nikula(a)intel.com>
Cc: Matthew Auld <matthew.auld(a)intel.com>
Cc: John Harrison <John.C.Harrison(a)Intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray(a)intel.com>
Cc: Lucas De Marchi <lucas.demarchi(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.11+
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754
Suggested-by: Matthew Brost <matthew.brost(a)intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das(a)intel.com>
Reviewed-by: Matthew Brost <matthew.brost(a)intel.com>
---
drivers/gpu/drm/xe/xe_wait_user_fence.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c
index f5deb81eba01..78a0ad3c78fe 100644
--- a/drivers/gpu/drm/xe/xe_wait_user_fence.c
+++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c
@@ -13,6 +13,7 @@
#include "xe_device.h"
#include "xe_gt.h"
#include "xe_macros.h"
+#include "compat-i915-headers/i915_drv.h"
#include "xe_exec_queue.h"
static int do_compare(u64 addr, u64 value, u64 mask, u16 op)
@@ -155,6 +156,19 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data,
}
if (!timeout) {
+ if (IS_LUNARLAKE(xe)) {
+ /*
+ * This is analogous to e51527233804 ("drm/xe/guc/ct: Flush g2h
+ * worker in case of g2h response timeout")
+ *
+ * TODO: Drop this change once workqueue scheduling delay issue is
+ * fixed on LNL Hybrid CPU.
+ */
+ flush_workqueue(xe->ordered_wq);
+ err = do_compare(addr, args->value, args->mask, args->op);
+ if (err <= 0)
+ break;
+ }
err = -ETIME;
break;
}
--
2.46.0
In psnet_open_pf_bar() and snet_open_vf_bar() a string later passed to
pcim_iomap_regions() is placed on the stack. Neither
pcim_iomap_regions() nor the functions it calls copy that string.
Should the string later ever be used, this, consequently, causes
undefined behavior since the stack frame will by then have disappeared.
Fix the bug by allocating the strings on the heap through
devm_kasprintf().
Cc: stable(a)vger.kernel.org # v6.3
Fixes: 51a8f9d7f587 ("virtio: vdpa: new SolidNET DPU driver.")
Reported-by: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
Closes: https://lore.kernel.org/all/74e9109a-ac59-49e2-9b1d-d825c9c9f891@wanadoo.fr/
Suggested-by: Andy Shevchenko <andy(a)kernel.org>
Signed-off-by: Philipp Stanner <pstanner(a)redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare(a)redhat.com>
---
Changes in v2:
- Add Stefano's RB
---
drivers/vdpa/solidrun/snet_main.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/drivers/vdpa/solidrun/snet_main.c b/drivers/vdpa/solidrun/snet_main.c
index 99428a04068d..c8b74980dbd1 100644
--- a/drivers/vdpa/solidrun/snet_main.c
+++ b/drivers/vdpa/solidrun/snet_main.c
@@ -555,7 +555,7 @@ static const struct vdpa_config_ops snet_config_ops = {
static int psnet_open_pf_bar(struct pci_dev *pdev, struct psnet *psnet)
{
- char name[50];
+ char *name;
int ret, i, mask = 0;
/* We don't know which BAR will be used to communicate..
* We will map every bar with len > 0.
@@ -573,7 +573,10 @@ static int psnet_open_pf_bar(struct pci_dev *pdev, struct psnet *psnet)
return -ENODEV;
}
- snprintf(name, sizeof(name), "psnet[%s]-bars", pci_name(pdev));
+ name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "psnet[%s]-bars", pci_name(pdev));
+ if (!name)
+ return -ENOMEM;
+
ret = pcim_iomap_regions(pdev, mask, name);
if (ret) {
SNET_ERR(pdev, "Failed to request and map PCI BARs\n");
@@ -590,10 +593,13 @@ static int psnet_open_pf_bar(struct pci_dev *pdev, struct psnet *psnet)
static int snet_open_vf_bar(struct pci_dev *pdev, struct snet *snet)
{
- char name[50];
+ char *name;
int ret;
- snprintf(name, sizeof(name), "snet[%s]-bar", pci_name(pdev));
+ name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "snet[%s]-bars", pci_name(pdev));
+ if (!name)
+ return -ENOMEM;
+
/* Request and map BAR */
ret = pcim_iomap_regions(pdev, BIT(snet->psnet->cfg.vf_bar), name);
if (ret) {
--
2.47.0
From: "Jason-JH.Lin" <jason-jh.lin(a)mediatek.com>
This reverts commit ac88a1f41f93499df6f50fd18ea835e6ff4f3200.
Reason for revert:
1. The commit [1] does not land on linux-5.15, so this patch does not
fix anything.
2. Since the fw_device improvements series [2] does not land on
linux-5.15, using device_set_fwnode() causes the panel to flash during
bootup.
Incorrect link management may lead to incorrect device initialization,
affecting firmware node links and consumer relationships.
The fwnode setting of panel to the DSI device would cause a DSI
initialization error without series[2], so this patch was reverted to
avoid using the incomplete fw_devlink functionality.
[1] commit 3fb16866b51d ("driver core: fw_devlink: Make cycle detection more robust")
[2] Link: https://lore.kernel.org/all/20230207014207.1678715-1-saravanak@google.com
Signed-off-by: Jason-JH.Lin <jason-jh.lin(a)mediatek.com>
---
drivers/gpu/drm/drm_mipi_dsi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/drm_mipi_dsi.c b/drivers/gpu/drm/drm_mipi_dsi.c
index 24606b632009..468a3a7cb6a5 100644
--- a/drivers/gpu/drm/drm_mipi_dsi.c
+++ b/drivers/gpu/drm/drm_mipi_dsi.c
@@ -221,7 +221,7 @@ mipi_dsi_device_register_full(struct mipi_dsi_host *host,
return dsi;
}
- device_set_node(&dsi->dev, of_fwnode_handle(info->node));
+ dsi->dev.of_node = info->node;
dsi->channel = info->channel;
strlcpy(dsi->name, info->type, sizeof(dsi->name));
---
base-commit: 74cdd62cb4706515b454ce5bacb73b566c1d1bcf
change-id: 20241024-fixup-5-15-5fdd68dae707
Best regards,
--
Jason-JH.Lin <jason-jh.lin(a)mediatek.com>
On 22. 10. 24, 19:45, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> xhci: dbgtty: use kfifo from tty_port struct
This is a cleanup, not needed in stable.
--
js
suse labs
Hello,
The patch "[PATCH v3] perf/x86/rapl: Fix the energy-pkg event for AMD CPUs" fixes the RAPL energy-pkg event on AMD CPUs. It was broken by "commit 63edbaa48a57 ("x86/cpu/topology: Add support for the AMD 0x80000026 leaf")", which got merged in v6.10-rc1.
I missed the "Fixes" tag while posting the patch on LKML.
Please backport the fix to 6.10 (I see this is EOL so probably this wont be possible) and 6.11 stable kernels. Mainline commit ID for the fix is "8d72eba1cf8c".
Thanks,
Dhananjay
Hi,
The commits below are required to enable amd_atl driver on AMD processors.
0f70fdd42559 x86/amd_nb: Add new PCI IDs for AMD family 1Ah model 60h-70
f8bc84b6096f x86/amd_nb: Add new PCI ID for AMD family 1Ah model 20h
Please add those 2 commits to stable kernel 6.11.y. Thanks!
Regards,
Richard
On 24.10.24 13:17, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> btrfs: also add stripe entries for NOCOW writes
>
> to the 6.11-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> btrfs-also-add-stripe-entries-for-nocow-writes.patch
> and it can be found in the queue-6.11 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
Hey Sasha,
this patch is for the RAID stripe-tree feature marked as experimental,
so I don't think it's needed to be backported, as noone should use it
(apart from testing).
>
>
>
> commit ec508a5930024b40064d70cb3dbdf856760cdf5d
> Author: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
> Date: Thu Sep 19 12:16:38 2024 +0200
>
> btrfs: also add stripe entries for NOCOW writes
>
> [ Upstream commit 97f9782276fc9cb0de37a5eecb82204e48a5a612 ]
>
> NOCOW writes do not generate stripe_extent entries in the RAID stripe
> tree, as the RAID stripe-tree feature initially was designed with a
> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
> writes. But the RAID stripe-tree feature is independent from the zoned
> feature, so we must also do NOCOW writes for RAID stripe-tree filesystems.
>
> Reviewed-by: Naohiro Aota <naohiro.aota(a)wdc.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
> Signed-off-by: David Sterba <dsterba(a)suse.com>
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b1b6564ab68f0..48149c2e68954 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3087,6 +3087,11 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> ret = btrfs_update_inode_fallback(trans, inode);
> if (ret) /* -ENOMEM or corruption */
> btrfs_abort_transaction(trans, ret);
> +
> + ret = btrfs_insert_raid_extent(trans, ordered_extent);
> + if (ret)
> + btrfs_abort_transaction(trans, ret);
> +
> goto out;
> }
>
>
From: "Jason-JH.Lin" <jason-jh.lin(a)mediatek.com>
This reverts commit ac88a1f41f93499df6f50fd18ea835e6ff4f3200.
Reason for revert:
1. The commit [1] does not land on linux-5.15, so this patch does not
fix anything.
2. Since the fw_device improvements series [2] does not land on
linux-5.15, using device_set_fwnode() causes the panel to flash during
bootup.
Incorrect link management may lead to incorrect device initialization,
affecting firmware node links and consumer relationships.
The fwnode setting of panel to the DSI device would cause a DSI
initialization error without series[2], so this patch was reverted to
avoid using the incomplete fw_devlink functionality.
[1] commit 3fb16866b51d ("driver core: fw_devlink: Make cycle detection more robust")
[2] Link: https://lore.kernel.org/all/20230207014207.1678715-1-saravanak@google.com
Cc: stable(a)vger.kernel.org # 5.15.169
Signed-off-by: Jason-JH.Lin <jason-jh.lin(a)mediatek.com>
---
drivers/gpu/drm/drm_mipi_dsi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/drm_mipi_dsi.c b/drivers/gpu/drm/drm_mipi_dsi.c
index 24606b632009..468a3a7cb6a5 100644
--- a/drivers/gpu/drm/drm_mipi_dsi.c
+++ b/drivers/gpu/drm/drm_mipi_dsi.c
@@ -221,7 +221,7 @@ mipi_dsi_device_register_full(struct mipi_dsi_host *host,
return dsi;
}
- device_set_node(&dsi->dev, of_fwnode_handle(info->node));
+ dsi->dev.of_node = info->node;
dsi->channel = info->channel;
strlcpy(dsi->name, info->type, sizeof(dsi->name));
---
base-commit: 74cdd62cb4706515b454ce5bacb73b566c1d1bcf
change-id: 20241024-fixup-5-15-5fdd68dae707
Best regards,
--
Jason-JH.Lin <jason-jh.lin(a)mediatek.com>
Performing a stability stress test on a USB3.0 2.5G ethernet adapter
results in errors like this:
[ 91.441469] r8152 2-3:1.0 eth3: get_registers -71
[ 91.458659] r8152 2-3:1.0 eth3: get_registers -71
[ 91.475911] r8152 2-3:1.0 eth3: get_registers -71
[ 91.493203] r8152 2-3:1.0 eth3: get_registers -71
[ 91.510421] r8152 2-3:1.0 eth3: get_registers -71
The r8152 driver will periodically issue lots of control-IN requests
to access the status of ethernet adapter hardware registers during
the test.
This happens when the xHCI driver enqueue a control TD (which cross
over the Link TRB between two ring segments, as shown) in the endpoint
zero's transfer ring. Seems the Etron xHCI host can not perform this
TD correctly, causing the USB transfer error occurred, maybe the upper
driver retry that control-IN request can solve problem, but not all
drivers do this.
| |
-------
| TRB | Setup Stage
-------
| TRB | Link
-------
-------
| TRB | Data Stage
-------
| TRB | Status Stage
-------
| |
To work around this, the xHCI driver should enqueue a No Op TRB if
next available TRB is the Link TRB in the ring segment, this can
prevent the Setup and Data Stage TRB to be breaked by the Link TRB.
Check if the XHCI_ETRON_HOST quirk flag is set before invoking the
workaround in xhci_queue_ctrl_tx().
Fixes: d0e96f5a71a0 ("USB: xhci: Control transfer support.")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Kuangyi Chiang <ki.chiang65(a)gmail.com>
---
drivers/usb/host/xhci-ring.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index b6eb928e260f..9e132b08bfde 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -3727,6 +3727,20 @@ int xhci_queue_ctrl_tx(struct xhci_hcd *xhci, gfp_t mem_flags,
if (!urb->setup_packet)
return -EINVAL;
+ if ((xhci->quirks & XHCI_ETRON_HOST) &&
+ urb->dev->speed >= USB_SPEED_SUPER) {
+ /*
+ * If next available TRB is the Link TRB in the ring segment then
+ * enqueue a No Op TRB, this can prevent the Setup and Data Stage
+ * TRB to be breaked by the Link TRB.
+ */
+ if (trb_is_link(ep_ring->enqueue + 1)) {
+ field = TRB_TYPE(TRB_TR_NOOP) | ep_ring->cycle_state;
+ queue_trb(xhci, ep_ring, false, 0, 0,
+ TRB_INTR_TARGET(0), field);
+ }
+ }
+
/* 1 TRB for setup, 1 for status */
num_trbs = 2;
/*
--
2.25.1
Hello Dear,
I am giving away my late husband's Yamaha Piano to any instrument lover. Kindly let me know if you are interested or have someone who will be interested in the instrument.
Thank you,
K.Hall
The patch below does not apply to the 6.11-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.11.y
git checkout FETCH_HEAD
git cherry-pick -x 23d16ede33a4db4973468bf6652a09da5efd1468
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102838-opal-rule-c671@gregkh' --subject-prefix 'PATCH 6.11.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 23d16ede33a4db4973468bf6652a09da5efd1468 Mon Sep 17 00:00:00 2001
From: Aurabindo Pillai <aurabindo.pillai(a)amd.com>
Date: Tue, 1 Oct 2024 18:03:02 -0400
Subject: [PATCH] drm/amd/display: temp w/a for dGPU to enter idle
optimizations
[Why&How]
vblank immediate disable currently does not work for all asics. On
DCN401, the vblank interrupts never stop coming, and hence we never
get a chance to trigger idle optimizations.
Add a workaround to enable immediate disable only on APUs for now. This
adds a 2-frame delay for triggering idle optimization, which is a
negligible overhead.
Fixes: 58a261bfc967 ("drm/amd/display: use a more lax vblank enable policy for older ASICs")
Fixes: e45b6716de4b ("drm/amd/display: use a more lax vblank enable policy for DCN35+")
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Reviewed-by: Harry Wentland <harry.wentland(a)amd.com>
Reviewed-by: Rodrigo Siqueira <rodrigo.siqueira(a)amd.com>
Signed-off-by: Aurabindo Pillai <aurabindo.pillai(a)amd.com>
Signed-off-by: Wayne Lin <wayne.lin(a)amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit 9b47278cec98e9894adf39229e91aaf4ab9140c5)
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 6b5e2206e687..13421a58210d 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -8374,7 +8374,8 @@ static void manage_dm_interrupts(struct amdgpu_device *adev,
if (amdgpu_ip_version(adev, DCE_HWIP, 0) <
IP_VERSION(3, 5, 0) ||
acrtc_state->stream->link->psr_settings.psr_version <
- DC_PSR_VERSION_UNSUPPORTED) {
+ DC_PSR_VERSION_UNSUPPORTED ||
+ !(adev->flags & AMD_IS_APU)) {
timing = &acrtc_state->stream->timing;
/* at least 2 frames */
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102844-gibberish-surplus-b6d4@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a Mon Sep 17 00:00:00 2001
From: Olga Kornievskaia <okorniev(a)redhat.com>
Date: Fri, 18 Oct 2024 15:24:58 -0400
Subject: [PATCH] nfsd: fix race between laundromat and free_stateid
There is a race between laundromat handling of revoked delegations
and a client sending free_stateid operation. Laundromat thread
finds that delegation has expired and needs to be revoked so it
marks the delegation stid revoked and it puts it on a reaper list
but then it unlock the state lock and the actual delegation revocation
happens without the lock. Once the stid is marked revoked a racing
free_stateid processing thread does the following (1) it calls
list_del_init() which removes it from the reaper list and (2) frees
the delegation stid structure. The laundromat thread ends up not
calling the revoke_delegation() function for this particular delegation
but that means it will no release the lock lease that exists on
the file.
Now, a new open for this file comes in and ends up finding that
lease list isn't empty and calls nfsd_breaker_owns_lease() which ends
up trying to derefence a freed delegation stateid. Leading to the
followint use-after-free KASAN warning:
kernel: ==================================================================
kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205
kernel:
kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9
kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024
kernel: Call trace:
kernel: dump_backtrace+0x98/0x120
kernel: show_stack+0x1c/0x30
kernel: dump_stack_lvl+0x80/0xe8
kernel: print_address_description.constprop.0+0x84/0x390
kernel: print_report+0xa4/0x268
kernel: kasan_report+0xb4/0xf8
kernel: __asan_report_load8_noabort+0x1c/0x28
kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd]
kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd]
kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd]
kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd]
kernel: nfsd4_open+0xa08/0xe80 [nfsd]
kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd]
kernel: nfsd_dispatch+0x22c/0x718 [nfsd]
kernel: svc_process_common+0x8e8/0x1960 [sunrpc]
kernel: svc_process+0x3d4/0x7e0 [sunrpc]
kernel: svc_handle_xprt+0x828/0xe10 [sunrpc]
kernel: svc_recv+0x2cc/0x6a8 [sunrpc]
kernel: nfsd+0x270/0x400 [nfsd]
kernel: kthread+0x288/0x310
kernel: ret_from_fork+0x10/0x20
This patch proposes a fixed that's based on adding 2 new additional
stid's sc_status values that help coordinate between the laundromat
and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()).
First to make sure, that once the stid is marked revoked, it is not
removed by the nfsd4_free_stateid(), the laundromat take a reference
on the stateid. Then, coordinating whether the stid has been put
on the cl_revoked list or we are processing FREE_STATEID and need to
make sure to remove it from the list, each check that state and act
accordingly. If laundromat has added to the cl_revoke list before
the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove
it from the list. If nfsd4_free_stateid() finds that operations arrived
before laundromat has placed it on cl_revoke list, it marks the state
freed and then laundromat will no longer add it to the list.
Also, for nfsd4_delegreturn() when looking for the specified stid,
we need to access stid that are marked removed or freeable, it means
the laundromat has started processing it but hasn't finished and this
delegreturn needs to return nfserr_deleg_revoked and not
nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the
lack of it will leave this stid on the cl_revoked list indefinitely.
Fixes: 2d4a532d385f ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock")
CC: stable(a)vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev(a)redhat.com>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 56b261608af4..d1a2c677be7e 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -1359,21 +1359,47 @@ static void destroy_delegation(struct nfs4_delegation *dp)
destroy_unhashed_deleg(dp);
}
+/**
+ * revoke_delegation - perform nfs4 delegation structure cleanup
+ * @dp: pointer to the delegation
+ *
+ * This function assumes that it's called either from the administrative
+ * interface (nfsd4_revoke_states()) that's revoking a specific delegation
+ * stateid or it's called from a laundromat thread (nfsd4_landromat()) that
+ * determined that this specific state has expired and needs to be revoked
+ * (both mark state with the appropriate stid sc_status mode). It is also
+ * assumed that a reference was taken on the @dp state.
+ *
+ * If this function finds that the @dp state is SC_STATUS_FREED it means
+ * that a FREE_STATEID operation for this stateid has been processed and
+ * we can proceed to removing it from recalled list. However, if @dp state
+ * isn't marked SC_STATUS_FREED, it means we need place it on the cl_revoked
+ * list and wait for the FREE_STATEID to arrive from the client. At the same
+ * time, we need to mark it as SC_STATUS_FREEABLE to indicate to the
+ * nfsd4_free_stateid() function that this stateid has already been added
+ * to the cl_revoked list and that nfsd4_free_stateid() is now responsible
+ * for removing it from the list. Inspection of where the delegation state
+ * in the revocation process is protected by the clp->cl_lock.
+ */
static void revoke_delegation(struct nfs4_delegation *dp)
{
struct nfs4_client *clp = dp->dl_stid.sc_client;
WARN_ON(!list_empty(&dp->dl_recall_lru));
+ WARN_ON_ONCE(!(dp->dl_stid.sc_status &
+ (SC_STATUS_REVOKED | SC_STATUS_ADMIN_REVOKED)));
trace_nfsd_stid_revoke(&dp->dl_stid);
- if (dp->dl_stid.sc_status &
- (SC_STATUS_REVOKED | SC_STATUS_ADMIN_REVOKED)) {
- spin_lock(&clp->cl_lock);
- refcount_inc(&dp->dl_stid.sc_count);
- list_add(&dp->dl_recall_lru, &clp->cl_revoked);
- spin_unlock(&clp->cl_lock);
+ spin_lock(&clp->cl_lock);
+ if (dp->dl_stid.sc_status & SC_STATUS_FREED) {
+ list_del_init(&dp->dl_recall_lru);
+ goto out;
}
+ list_add(&dp->dl_recall_lru, &clp->cl_revoked);
+ dp->dl_stid.sc_status |= SC_STATUS_FREEABLE;
+out:
+ spin_unlock(&clp->cl_lock);
destroy_unhashed_deleg(dp);
}
@@ -1780,6 +1806,7 @@ void nfsd4_revoke_states(struct net *net, struct super_block *sb)
mutex_unlock(&stp->st_mutex);
break;
case SC_TYPE_DELEG:
+ refcount_inc(&stid->sc_count);
dp = delegstateid(stid);
spin_lock(&state_lock);
if (!unhash_delegation_locked(
@@ -6545,6 +6572,7 @@ nfs4_laundromat(struct nfsd_net *nn)
dp = list_entry (pos, struct nfs4_delegation, dl_recall_lru);
if (!state_expired(<, dp->dl_time))
break;
+ refcount_inc(&dp->dl_stid.sc_count);
unhash_delegation_locked(dp, SC_STATUS_REVOKED);
list_add(&dp->dl_recall_lru, &reaplist);
}
@@ -7157,7 +7185,9 @@ nfsd4_free_stateid(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
s->sc_status |= SC_STATUS_CLOSED;
spin_unlock(&s->sc_lock);
dp = delegstateid(s);
- list_del_init(&dp->dl_recall_lru);
+ if (s->sc_status & SC_STATUS_FREEABLE)
+ list_del_init(&dp->dl_recall_lru);
+ s->sc_status |= SC_STATUS_FREED;
spin_unlock(&cl->cl_lock);
nfs4_put_stid(s);
ret = nfs_ok;
@@ -7487,7 +7517,9 @@ nfsd4_delegreturn(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
if ((status = fh_verify(rqstp, &cstate->current_fh, S_IFREG, 0)))
return status;
- status = nfsd4_lookup_stateid(cstate, stateid, SC_TYPE_DELEG, 0, &s, nn);
+ status = nfsd4_lookup_stateid(cstate, stateid, SC_TYPE_DELEG,
+ SC_STATUS_REVOKED | SC_STATUS_FREEABLE,
+ &s, nn);
if (status)
goto out;
dp = delegstateid(s);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 79c743c01a47..35b3564c065f 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -114,6 +114,8 @@ struct nfs4_stid {
/* For a deleg stateid kept around only to process free_stateid's: */
#define SC_STATUS_REVOKED BIT(1)
#define SC_STATUS_ADMIN_REVOKED BIT(2)
+#define SC_STATUS_FREEABLE BIT(3)
+#define SC_STATUS_FREED BIT(4)
unsigned short sc_status;
struct list_head sc_cp_list;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102842-isolation-backspace-e3c1@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a Mon Sep 17 00:00:00 2001
From: Olga Kornievskaia <okorniev(a)redhat.com>
Date: Fri, 18 Oct 2024 15:24:58 -0400
Subject: [PATCH] nfsd: fix race between laundromat and free_stateid
There is a race between laundromat handling of revoked delegations
and a client sending free_stateid operation. Laundromat thread
finds that delegation has expired and needs to be revoked so it
marks the delegation stid revoked and it puts it on a reaper list
but then it unlock the state lock and the actual delegation revocation
happens without the lock. Once the stid is marked revoked a racing
free_stateid processing thread does the following (1) it calls
list_del_init() which removes it from the reaper list and (2) frees
the delegation stid structure. The laundromat thread ends up not
calling the revoke_delegation() function for this particular delegation
but that means it will no release the lock lease that exists on
the file.
Now, a new open for this file comes in and ends up finding that
lease list isn't empty and calls nfsd_breaker_owns_lease() which ends
up trying to derefence a freed delegation stateid. Leading to the
followint use-after-free KASAN warning:
kernel: ==================================================================
kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205
kernel:
kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9
kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024
kernel: Call trace:
kernel: dump_backtrace+0x98/0x120
kernel: show_stack+0x1c/0x30
kernel: dump_stack_lvl+0x80/0xe8
kernel: print_address_description.constprop.0+0x84/0x390
kernel: print_report+0xa4/0x268
kernel: kasan_report+0xb4/0xf8
kernel: __asan_report_load8_noabort+0x1c/0x28
kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd]
kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd]
kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd]
kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd]
kernel: nfsd4_open+0xa08/0xe80 [nfsd]
kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd]
kernel: nfsd_dispatch+0x22c/0x718 [nfsd]
kernel: svc_process_common+0x8e8/0x1960 [sunrpc]
kernel: svc_process+0x3d4/0x7e0 [sunrpc]
kernel: svc_handle_xprt+0x828/0xe10 [sunrpc]
kernel: svc_recv+0x2cc/0x6a8 [sunrpc]
kernel: nfsd+0x270/0x400 [nfsd]
kernel: kthread+0x288/0x310
kernel: ret_from_fork+0x10/0x20
This patch proposes a fixed that's based on adding 2 new additional
stid's sc_status values that help coordinate between the laundromat
and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()).
First to make sure, that once the stid is marked revoked, it is not
removed by the nfsd4_free_stateid(), the laundromat take a reference
on the stateid. Then, coordinating whether the stid has been put
on the cl_revoked list or we are processing FREE_STATEID and need to
make sure to remove it from the list, each check that state and act
accordingly. If laundromat has added to the cl_revoke list before
the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove
it from the list. If nfsd4_free_stateid() finds that operations arrived
before laundromat has placed it on cl_revoke list, it marks the state
freed and then laundromat will no longer add it to the list.
Also, for nfsd4_delegreturn() when looking for the specified stid,
we need to access stid that are marked removed or freeable, it means
the laundromat has started processing it but hasn't finished and this
delegreturn needs to return nfserr_deleg_revoked and not
nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the
lack of it will leave this stid on the cl_revoked list indefinitely.
Fixes: 2d4a532d385f ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock")
CC: stable(a)vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev(a)redhat.com>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 56b261608af4..d1a2c677be7e 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -1359,21 +1359,47 @@ static void destroy_delegation(struct nfs4_delegation *dp)
destroy_unhashed_deleg(dp);
}
+/**
+ * revoke_delegation - perform nfs4 delegation structure cleanup
+ * @dp: pointer to the delegation
+ *
+ * This function assumes that it's called either from the administrative
+ * interface (nfsd4_revoke_states()) that's revoking a specific delegation
+ * stateid or it's called from a laundromat thread (nfsd4_landromat()) that
+ * determined that this specific state has expired and needs to be revoked
+ * (both mark state with the appropriate stid sc_status mode). It is also
+ * assumed that a reference was taken on the @dp state.
+ *
+ * If this function finds that the @dp state is SC_STATUS_FREED it means
+ * that a FREE_STATEID operation for this stateid has been processed and
+ * we can proceed to removing it from recalled list. However, if @dp state
+ * isn't marked SC_STATUS_FREED, it means we need place it on the cl_revoked
+ * list and wait for the FREE_STATEID to arrive from the client. At the same
+ * time, we need to mark it as SC_STATUS_FREEABLE to indicate to the
+ * nfsd4_free_stateid() function that this stateid has already been added
+ * to the cl_revoked list and that nfsd4_free_stateid() is now responsible
+ * for removing it from the list. Inspection of where the delegation state
+ * in the revocation process is protected by the clp->cl_lock.
+ */
static void revoke_delegation(struct nfs4_delegation *dp)
{
struct nfs4_client *clp = dp->dl_stid.sc_client;
WARN_ON(!list_empty(&dp->dl_recall_lru));
+ WARN_ON_ONCE(!(dp->dl_stid.sc_status &
+ (SC_STATUS_REVOKED | SC_STATUS_ADMIN_REVOKED)));
trace_nfsd_stid_revoke(&dp->dl_stid);
- if (dp->dl_stid.sc_status &
- (SC_STATUS_REVOKED | SC_STATUS_ADMIN_REVOKED)) {
- spin_lock(&clp->cl_lock);
- refcount_inc(&dp->dl_stid.sc_count);
- list_add(&dp->dl_recall_lru, &clp->cl_revoked);
- spin_unlock(&clp->cl_lock);
+ spin_lock(&clp->cl_lock);
+ if (dp->dl_stid.sc_status & SC_STATUS_FREED) {
+ list_del_init(&dp->dl_recall_lru);
+ goto out;
}
+ list_add(&dp->dl_recall_lru, &clp->cl_revoked);
+ dp->dl_stid.sc_status |= SC_STATUS_FREEABLE;
+out:
+ spin_unlock(&clp->cl_lock);
destroy_unhashed_deleg(dp);
}
@@ -1780,6 +1806,7 @@ void nfsd4_revoke_states(struct net *net, struct super_block *sb)
mutex_unlock(&stp->st_mutex);
break;
case SC_TYPE_DELEG:
+ refcount_inc(&stid->sc_count);
dp = delegstateid(stid);
spin_lock(&state_lock);
if (!unhash_delegation_locked(
@@ -6545,6 +6572,7 @@ nfs4_laundromat(struct nfsd_net *nn)
dp = list_entry (pos, struct nfs4_delegation, dl_recall_lru);
if (!state_expired(<, dp->dl_time))
break;
+ refcount_inc(&dp->dl_stid.sc_count);
unhash_delegation_locked(dp, SC_STATUS_REVOKED);
list_add(&dp->dl_recall_lru, &reaplist);
}
@@ -7157,7 +7185,9 @@ nfsd4_free_stateid(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
s->sc_status |= SC_STATUS_CLOSED;
spin_unlock(&s->sc_lock);
dp = delegstateid(s);
- list_del_init(&dp->dl_recall_lru);
+ if (s->sc_status & SC_STATUS_FREEABLE)
+ list_del_init(&dp->dl_recall_lru);
+ s->sc_status |= SC_STATUS_FREED;
spin_unlock(&cl->cl_lock);
nfs4_put_stid(s);
ret = nfs_ok;
@@ -7487,7 +7517,9 @@ nfsd4_delegreturn(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
if ((status = fh_verify(rqstp, &cstate->current_fh, S_IFREG, 0)))
return status;
- status = nfsd4_lookup_stateid(cstate, stateid, SC_TYPE_DELEG, 0, &s, nn);
+ status = nfsd4_lookup_stateid(cstate, stateid, SC_TYPE_DELEG,
+ SC_STATUS_REVOKED | SC_STATUS_FREEABLE,
+ &s, nn);
if (status)
goto out;
dp = delegstateid(s);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 79c743c01a47..35b3564c065f 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -114,6 +114,8 @@ struct nfs4_stid {
/* For a deleg stateid kept around only to process free_stateid's: */
#define SC_STATUS_REVOKED BIT(1)
#define SC_STATUS_ADMIN_REVOKED BIT(2)
+#define SC_STATUS_FREEABLE BIT(3)
+#define SC_STATUS_FREED BIT(4)
unsigned short sc_status;
struct list_head sc_cp_list;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102839-bullhorn-canister-fae3@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a Mon Sep 17 00:00:00 2001
From: Olga Kornievskaia <okorniev(a)redhat.com>
Date: Fri, 18 Oct 2024 15:24:58 -0400
Subject: [PATCH] nfsd: fix race between laundromat and free_stateid
There is a race between laundromat handling of revoked delegations
and a client sending free_stateid operation. Laundromat thread
finds that delegation has expired and needs to be revoked so it
marks the delegation stid revoked and it puts it on a reaper list
but then it unlock the state lock and the actual delegation revocation
happens without the lock. Once the stid is marked revoked a racing
free_stateid processing thread does the following (1) it calls
list_del_init() which removes it from the reaper list and (2) frees
the delegation stid structure. The laundromat thread ends up not
calling the revoke_delegation() function for this particular delegation
but that means it will no release the lock lease that exists on
the file.
Now, a new open for this file comes in and ends up finding that
lease list isn't empty and calls nfsd_breaker_owns_lease() which ends
up trying to derefence a freed delegation stateid. Leading to the
followint use-after-free KASAN warning:
kernel: ==================================================================
kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205
kernel:
kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9
kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024
kernel: Call trace:
kernel: dump_backtrace+0x98/0x120
kernel: show_stack+0x1c/0x30
kernel: dump_stack_lvl+0x80/0xe8
kernel: print_address_description.constprop.0+0x84/0x390
kernel: print_report+0xa4/0x268
kernel: kasan_report+0xb4/0xf8
kernel: __asan_report_load8_noabort+0x1c/0x28
kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd]
kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd]
kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd]
kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd]
kernel: nfsd4_open+0xa08/0xe80 [nfsd]
kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd]
kernel: nfsd_dispatch+0x22c/0x718 [nfsd]
kernel: svc_process_common+0x8e8/0x1960 [sunrpc]
kernel: svc_process+0x3d4/0x7e0 [sunrpc]
kernel: svc_handle_xprt+0x828/0xe10 [sunrpc]
kernel: svc_recv+0x2cc/0x6a8 [sunrpc]
kernel: nfsd+0x270/0x400 [nfsd]
kernel: kthread+0x288/0x310
kernel: ret_from_fork+0x10/0x20
This patch proposes a fixed that's based on adding 2 new additional
stid's sc_status values that help coordinate between the laundromat
and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()).
First to make sure, that once the stid is marked revoked, it is not
removed by the nfsd4_free_stateid(), the laundromat take a reference
on the stateid. Then, coordinating whether the stid has been put
on the cl_revoked list or we are processing FREE_STATEID and need to
make sure to remove it from the list, each check that state and act
accordingly. If laundromat has added to the cl_revoke list before
the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove
it from the list. If nfsd4_free_stateid() finds that operations arrived
before laundromat has placed it on cl_revoke list, it marks the state
freed and then laundromat will no longer add it to the list.
Also, for nfsd4_delegreturn() when looking for the specified stid,
we need to access stid that are marked removed or freeable, it means
the laundromat has started processing it but hasn't finished and this
delegreturn needs to return nfserr_deleg_revoked and not
nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the
lack of it will leave this stid on the cl_revoked list indefinitely.
Fixes: 2d4a532d385f ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock")
CC: stable(a)vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev(a)redhat.com>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 56b261608af4..d1a2c677be7e 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -1359,21 +1359,47 @@ static void destroy_delegation(struct nfs4_delegation *dp)
destroy_unhashed_deleg(dp);
}
+/**
+ * revoke_delegation - perform nfs4 delegation structure cleanup
+ * @dp: pointer to the delegation
+ *
+ * This function assumes that it's called either from the administrative
+ * interface (nfsd4_revoke_states()) that's revoking a specific delegation
+ * stateid or it's called from a laundromat thread (nfsd4_landromat()) that
+ * determined that this specific state has expired and needs to be revoked
+ * (both mark state with the appropriate stid sc_status mode). It is also
+ * assumed that a reference was taken on the @dp state.
+ *
+ * If this function finds that the @dp state is SC_STATUS_FREED it means
+ * that a FREE_STATEID operation for this stateid has been processed and
+ * we can proceed to removing it from recalled list. However, if @dp state
+ * isn't marked SC_STATUS_FREED, it means we need place it on the cl_revoked
+ * list and wait for the FREE_STATEID to arrive from the client. At the same
+ * time, we need to mark it as SC_STATUS_FREEABLE to indicate to the
+ * nfsd4_free_stateid() function that this stateid has already been added
+ * to the cl_revoked list and that nfsd4_free_stateid() is now responsible
+ * for removing it from the list. Inspection of where the delegation state
+ * in the revocation process is protected by the clp->cl_lock.
+ */
static void revoke_delegation(struct nfs4_delegation *dp)
{
struct nfs4_client *clp = dp->dl_stid.sc_client;
WARN_ON(!list_empty(&dp->dl_recall_lru));
+ WARN_ON_ONCE(!(dp->dl_stid.sc_status &
+ (SC_STATUS_REVOKED | SC_STATUS_ADMIN_REVOKED)));
trace_nfsd_stid_revoke(&dp->dl_stid);
- if (dp->dl_stid.sc_status &
- (SC_STATUS_REVOKED | SC_STATUS_ADMIN_REVOKED)) {
- spin_lock(&clp->cl_lock);
- refcount_inc(&dp->dl_stid.sc_count);
- list_add(&dp->dl_recall_lru, &clp->cl_revoked);
- spin_unlock(&clp->cl_lock);
+ spin_lock(&clp->cl_lock);
+ if (dp->dl_stid.sc_status & SC_STATUS_FREED) {
+ list_del_init(&dp->dl_recall_lru);
+ goto out;
}
+ list_add(&dp->dl_recall_lru, &clp->cl_revoked);
+ dp->dl_stid.sc_status |= SC_STATUS_FREEABLE;
+out:
+ spin_unlock(&clp->cl_lock);
destroy_unhashed_deleg(dp);
}
@@ -1780,6 +1806,7 @@ void nfsd4_revoke_states(struct net *net, struct super_block *sb)
mutex_unlock(&stp->st_mutex);
break;
case SC_TYPE_DELEG:
+ refcount_inc(&stid->sc_count);
dp = delegstateid(stid);
spin_lock(&state_lock);
if (!unhash_delegation_locked(
@@ -6545,6 +6572,7 @@ nfs4_laundromat(struct nfsd_net *nn)
dp = list_entry (pos, struct nfs4_delegation, dl_recall_lru);
if (!state_expired(<, dp->dl_time))
break;
+ refcount_inc(&dp->dl_stid.sc_count);
unhash_delegation_locked(dp, SC_STATUS_REVOKED);
list_add(&dp->dl_recall_lru, &reaplist);
}
@@ -7157,7 +7185,9 @@ nfsd4_free_stateid(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
s->sc_status |= SC_STATUS_CLOSED;
spin_unlock(&s->sc_lock);
dp = delegstateid(s);
- list_del_init(&dp->dl_recall_lru);
+ if (s->sc_status & SC_STATUS_FREEABLE)
+ list_del_init(&dp->dl_recall_lru);
+ s->sc_status |= SC_STATUS_FREED;
spin_unlock(&cl->cl_lock);
nfs4_put_stid(s);
ret = nfs_ok;
@@ -7487,7 +7517,9 @@ nfsd4_delegreturn(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
if ((status = fh_verify(rqstp, &cstate->current_fh, S_IFREG, 0)))
return status;
- status = nfsd4_lookup_stateid(cstate, stateid, SC_TYPE_DELEG, 0, &s, nn);
+ status = nfsd4_lookup_stateid(cstate, stateid, SC_TYPE_DELEG,
+ SC_STATUS_REVOKED | SC_STATUS_FREEABLE,
+ &s, nn);
if (status)
goto out;
dp = delegstateid(s);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 79c743c01a47..35b3564c065f 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -114,6 +114,8 @@ struct nfs4_stid {
/* For a deleg stateid kept around only to process free_stateid's: */
#define SC_STATUS_REVOKED BIT(1)
#define SC_STATUS_ADMIN_REVOKED BIT(2)
+#define SC_STATUS_FREEABLE BIT(3)
+#define SC_STATUS_FREED BIT(4)
unsigned short sc_status;
struct list_head sc_cp_list;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102837-elusive-service-68d6@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a Mon Sep 17 00:00:00 2001
From: Olga Kornievskaia <okorniev(a)redhat.com>
Date: Fri, 18 Oct 2024 15:24:58 -0400
Subject: [PATCH] nfsd: fix race between laundromat and free_stateid
There is a race between laundromat handling of revoked delegations
and a client sending free_stateid operation. Laundromat thread
finds that delegation has expired and needs to be revoked so it
marks the delegation stid revoked and it puts it on a reaper list
but then it unlock the state lock and the actual delegation revocation
happens without the lock. Once the stid is marked revoked a racing
free_stateid processing thread does the following (1) it calls
list_del_init() which removes it from the reaper list and (2) frees
the delegation stid structure. The laundromat thread ends up not
calling the revoke_delegation() function for this particular delegation
but that means it will no release the lock lease that exists on
the file.
Now, a new open for this file comes in and ends up finding that
lease list isn't empty and calls nfsd_breaker_owns_lease() which ends
up trying to derefence a freed delegation stateid. Leading to the
followint use-after-free KASAN warning:
kernel: ==================================================================
kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205
kernel:
kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9
kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024
kernel: Call trace:
kernel: dump_backtrace+0x98/0x120
kernel: show_stack+0x1c/0x30
kernel: dump_stack_lvl+0x80/0xe8
kernel: print_address_description.constprop.0+0x84/0x390
kernel: print_report+0xa4/0x268
kernel: kasan_report+0xb4/0xf8
kernel: __asan_report_load8_noabort+0x1c/0x28
kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd]
kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd]
kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd]
kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd]
kernel: nfsd4_open+0xa08/0xe80 [nfsd]
kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd]
kernel: nfsd_dispatch+0x22c/0x718 [nfsd]
kernel: svc_process_common+0x8e8/0x1960 [sunrpc]
kernel: svc_process+0x3d4/0x7e0 [sunrpc]
kernel: svc_handle_xprt+0x828/0xe10 [sunrpc]
kernel: svc_recv+0x2cc/0x6a8 [sunrpc]
kernel: nfsd+0x270/0x400 [nfsd]
kernel: kthread+0x288/0x310
kernel: ret_from_fork+0x10/0x20
This patch proposes a fixed that's based on adding 2 new additional
stid's sc_status values that help coordinate between the laundromat
and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()).
First to make sure, that once the stid is marked revoked, it is not
removed by the nfsd4_free_stateid(), the laundromat take a reference
on the stateid. Then, coordinating whether the stid has been put
on the cl_revoked list or we are processing FREE_STATEID and need to
make sure to remove it from the list, each check that state and act
accordingly. If laundromat has added to the cl_revoke list before
the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove
it from the list. If nfsd4_free_stateid() finds that operations arrived
before laundromat has placed it on cl_revoke list, it marks the state
freed and then laundromat will no longer add it to the list.
Also, for nfsd4_delegreturn() when looking for the specified stid,
we need to access stid that are marked removed or freeable, it means
the laundromat has started processing it but hasn't finished and this
delegreturn needs to return nfserr_deleg_revoked and not
nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the
lack of it will leave this stid on the cl_revoked list indefinitely.
Fixes: 2d4a532d385f ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock")
CC: stable(a)vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev(a)redhat.com>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 56b261608af4..d1a2c677be7e 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -1359,21 +1359,47 @@ static void destroy_delegation(struct nfs4_delegation *dp)
destroy_unhashed_deleg(dp);
}
+/**
+ * revoke_delegation - perform nfs4 delegation structure cleanup
+ * @dp: pointer to the delegation
+ *
+ * This function assumes that it's called either from the administrative
+ * interface (nfsd4_revoke_states()) that's revoking a specific delegation
+ * stateid or it's called from a laundromat thread (nfsd4_landromat()) that
+ * determined that this specific state has expired and needs to be revoked
+ * (both mark state with the appropriate stid sc_status mode). It is also
+ * assumed that a reference was taken on the @dp state.
+ *
+ * If this function finds that the @dp state is SC_STATUS_FREED it means
+ * that a FREE_STATEID operation for this stateid has been processed and
+ * we can proceed to removing it from recalled list. However, if @dp state
+ * isn't marked SC_STATUS_FREED, it means we need place it on the cl_revoked
+ * list and wait for the FREE_STATEID to arrive from the client. At the same
+ * time, we need to mark it as SC_STATUS_FREEABLE to indicate to the
+ * nfsd4_free_stateid() function that this stateid has already been added
+ * to the cl_revoked list and that nfsd4_free_stateid() is now responsible
+ * for removing it from the list. Inspection of where the delegation state
+ * in the revocation process is protected by the clp->cl_lock.
+ */
static void revoke_delegation(struct nfs4_delegation *dp)
{
struct nfs4_client *clp = dp->dl_stid.sc_client;
WARN_ON(!list_empty(&dp->dl_recall_lru));
+ WARN_ON_ONCE(!(dp->dl_stid.sc_status &
+ (SC_STATUS_REVOKED | SC_STATUS_ADMIN_REVOKED)));
trace_nfsd_stid_revoke(&dp->dl_stid);
- if (dp->dl_stid.sc_status &
- (SC_STATUS_REVOKED | SC_STATUS_ADMIN_REVOKED)) {
- spin_lock(&clp->cl_lock);
- refcount_inc(&dp->dl_stid.sc_count);
- list_add(&dp->dl_recall_lru, &clp->cl_revoked);
- spin_unlock(&clp->cl_lock);
+ spin_lock(&clp->cl_lock);
+ if (dp->dl_stid.sc_status & SC_STATUS_FREED) {
+ list_del_init(&dp->dl_recall_lru);
+ goto out;
}
+ list_add(&dp->dl_recall_lru, &clp->cl_revoked);
+ dp->dl_stid.sc_status |= SC_STATUS_FREEABLE;
+out:
+ spin_unlock(&clp->cl_lock);
destroy_unhashed_deleg(dp);
}
@@ -1780,6 +1806,7 @@ void nfsd4_revoke_states(struct net *net, struct super_block *sb)
mutex_unlock(&stp->st_mutex);
break;
case SC_TYPE_DELEG:
+ refcount_inc(&stid->sc_count);
dp = delegstateid(stid);
spin_lock(&state_lock);
if (!unhash_delegation_locked(
@@ -6545,6 +6572,7 @@ nfs4_laundromat(struct nfsd_net *nn)
dp = list_entry (pos, struct nfs4_delegation, dl_recall_lru);
if (!state_expired(<, dp->dl_time))
break;
+ refcount_inc(&dp->dl_stid.sc_count);
unhash_delegation_locked(dp, SC_STATUS_REVOKED);
list_add(&dp->dl_recall_lru, &reaplist);
}
@@ -7157,7 +7185,9 @@ nfsd4_free_stateid(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
s->sc_status |= SC_STATUS_CLOSED;
spin_unlock(&s->sc_lock);
dp = delegstateid(s);
- list_del_init(&dp->dl_recall_lru);
+ if (s->sc_status & SC_STATUS_FREEABLE)
+ list_del_init(&dp->dl_recall_lru);
+ s->sc_status |= SC_STATUS_FREED;
spin_unlock(&cl->cl_lock);
nfs4_put_stid(s);
ret = nfs_ok;
@@ -7487,7 +7517,9 @@ nfsd4_delegreturn(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
if ((status = fh_verify(rqstp, &cstate->current_fh, S_IFREG, 0)))
return status;
- status = nfsd4_lookup_stateid(cstate, stateid, SC_TYPE_DELEG, 0, &s, nn);
+ status = nfsd4_lookup_stateid(cstate, stateid, SC_TYPE_DELEG,
+ SC_STATUS_REVOKED | SC_STATUS_FREEABLE,
+ &s, nn);
if (status)
goto out;
dp = delegstateid(s);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 79c743c01a47..35b3564c065f 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -114,6 +114,8 @@ struct nfs4_stid {
/* For a deleg stateid kept around only to process free_stateid's: */
#define SC_STATUS_REVOKED BIT(1)
#define SC_STATUS_ADMIN_REVOKED BIT(2)
+#define SC_STATUS_FREEABLE BIT(3)
+#define SC_STATUS_FREED BIT(4)
unsigned short sc_status;
struct list_head sc_cp_list;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102835-pungent-sadly-717e@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a Mon Sep 17 00:00:00 2001
From: Olga Kornievskaia <okorniev(a)redhat.com>
Date: Fri, 18 Oct 2024 15:24:58 -0400
Subject: [PATCH] nfsd: fix race between laundromat and free_stateid
There is a race between laundromat handling of revoked delegations
and a client sending free_stateid operation. Laundromat thread
finds that delegation has expired and needs to be revoked so it
marks the delegation stid revoked and it puts it on a reaper list
but then it unlock the state lock and the actual delegation revocation
happens without the lock. Once the stid is marked revoked a racing
free_stateid processing thread does the following (1) it calls
list_del_init() which removes it from the reaper list and (2) frees
the delegation stid structure. The laundromat thread ends up not
calling the revoke_delegation() function for this particular delegation
but that means it will no release the lock lease that exists on
the file.
Now, a new open for this file comes in and ends up finding that
lease list isn't empty and calls nfsd_breaker_owns_lease() which ends
up trying to derefence a freed delegation stateid. Leading to the
followint use-after-free KASAN warning:
kernel: ==================================================================
kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205
kernel:
kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9
kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024
kernel: Call trace:
kernel: dump_backtrace+0x98/0x120
kernel: show_stack+0x1c/0x30
kernel: dump_stack_lvl+0x80/0xe8
kernel: print_address_description.constprop.0+0x84/0x390
kernel: print_report+0xa4/0x268
kernel: kasan_report+0xb4/0xf8
kernel: __asan_report_load8_noabort+0x1c/0x28
kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd]
kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd]
kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd]
kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd]
kernel: nfsd4_open+0xa08/0xe80 [nfsd]
kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd]
kernel: nfsd_dispatch+0x22c/0x718 [nfsd]
kernel: svc_process_common+0x8e8/0x1960 [sunrpc]
kernel: svc_process+0x3d4/0x7e0 [sunrpc]
kernel: svc_handle_xprt+0x828/0xe10 [sunrpc]
kernel: svc_recv+0x2cc/0x6a8 [sunrpc]
kernel: nfsd+0x270/0x400 [nfsd]
kernel: kthread+0x288/0x310
kernel: ret_from_fork+0x10/0x20
This patch proposes a fixed that's based on adding 2 new additional
stid's sc_status values that help coordinate between the laundromat
and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()).
First to make sure, that once the stid is marked revoked, it is not
removed by the nfsd4_free_stateid(), the laundromat take a reference
on the stateid. Then, coordinating whether the stid has been put
on the cl_revoked list or we are processing FREE_STATEID and need to
make sure to remove it from the list, each check that state and act
accordingly. If laundromat has added to the cl_revoke list before
the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove
it from the list. If nfsd4_free_stateid() finds that operations arrived
before laundromat has placed it on cl_revoke list, it marks the state
freed and then laundromat will no longer add it to the list.
Also, for nfsd4_delegreturn() when looking for the specified stid,
we need to access stid that are marked removed or freeable, it means
the laundromat has started processing it but hasn't finished and this
delegreturn needs to return nfserr_deleg_revoked and not
nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the
lack of it will leave this stid on the cl_revoked list indefinitely.
Fixes: 2d4a532d385f ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock")
CC: stable(a)vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev(a)redhat.com>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 56b261608af4..d1a2c677be7e 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -1359,21 +1359,47 @@ static void destroy_delegation(struct nfs4_delegation *dp)
destroy_unhashed_deleg(dp);
}
+/**
+ * revoke_delegation - perform nfs4 delegation structure cleanup
+ * @dp: pointer to the delegation
+ *
+ * This function assumes that it's called either from the administrative
+ * interface (nfsd4_revoke_states()) that's revoking a specific delegation
+ * stateid or it's called from a laundromat thread (nfsd4_landromat()) that
+ * determined that this specific state has expired and needs to be revoked
+ * (both mark state with the appropriate stid sc_status mode). It is also
+ * assumed that a reference was taken on the @dp state.
+ *
+ * If this function finds that the @dp state is SC_STATUS_FREED it means
+ * that a FREE_STATEID operation for this stateid has been processed and
+ * we can proceed to removing it from recalled list. However, if @dp state
+ * isn't marked SC_STATUS_FREED, it means we need place it on the cl_revoked
+ * list and wait for the FREE_STATEID to arrive from the client. At the same
+ * time, we need to mark it as SC_STATUS_FREEABLE to indicate to the
+ * nfsd4_free_stateid() function that this stateid has already been added
+ * to the cl_revoked list and that nfsd4_free_stateid() is now responsible
+ * for removing it from the list. Inspection of where the delegation state
+ * in the revocation process is protected by the clp->cl_lock.
+ */
static void revoke_delegation(struct nfs4_delegation *dp)
{
struct nfs4_client *clp = dp->dl_stid.sc_client;
WARN_ON(!list_empty(&dp->dl_recall_lru));
+ WARN_ON_ONCE(!(dp->dl_stid.sc_status &
+ (SC_STATUS_REVOKED | SC_STATUS_ADMIN_REVOKED)));
trace_nfsd_stid_revoke(&dp->dl_stid);
- if (dp->dl_stid.sc_status &
- (SC_STATUS_REVOKED | SC_STATUS_ADMIN_REVOKED)) {
- spin_lock(&clp->cl_lock);
- refcount_inc(&dp->dl_stid.sc_count);
- list_add(&dp->dl_recall_lru, &clp->cl_revoked);
- spin_unlock(&clp->cl_lock);
+ spin_lock(&clp->cl_lock);
+ if (dp->dl_stid.sc_status & SC_STATUS_FREED) {
+ list_del_init(&dp->dl_recall_lru);
+ goto out;
}
+ list_add(&dp->dl_recall_lru, &clp->cl_revoked);
+ dp->dl_stid.sc_status |= SC_STATUS_FREEABLE;
+out:
+ spin_unlock(&clp->cl_lock);
destroy_unhashed_deleg(dp);
}
@@ -1780,6 +1806,7 @@ void nfsd4_revoke_states(struct net *net, struct super_block *sb)
mutex_unlock(&stp->st_mutex);
break;
case SC_TYPE_DELEG:
+ refcount_inc(&stid->sc_count);
dp = delegstateid(stid);
spin_lock(&state_lock);
if (!unhash_delegation_locked(
@@ -6545,6 +6572,7 @@ nfs4_laundromat(struct nfsd_net *nn)
dp = list_entry (pos, struct nfs4_delegation, dl_recall_lru);
if (!state_expired(<, dp->dl_time))
break;
+ refcount_inc(&dp->dl_stid.sc_count);
unhash_delegation_locked(dp, SC_STATUS_REVOKED);
list_add(&dp->dl_recall_lru, &reaplist);
}
@@ -7157,7 +7185,9 @@ nfsd4_free_stateid(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
s->sc_status |= SC_STATUS_CLOSED;
spin_unlock(&s->sc_lock);
dp = delegstateid(s);
- list_del_init(&dp->dl_recall_lru);
+ if (s->sc_status & SC_STATUS_FREEABLE)
+ list_del_init(&dp->dl_recall_lru);
+ s->sc_status |= SC_STATUS_FREED;
spin_unlock(&cl->cl_lock);
nfs4_put_stid(s);
ret = nfs_ok;
@@ -7487,7 +7517,9 @@ nfsd4_delegreturn(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
if ((status = fh_verify(rqstp, &cstate->current_fh, S_IFREG, 0)))
return status;
- status = nfsd4_lookup_stateid(cstate, stateid, SC_TYPE_DELEG, 0, &s, nn);
+ status = nfsd4_lookup_stateid(cstate, stateid, SC_TYPE_DELEG,
+ SC_STATUS_REVOKED | SC_STATUS_FREEABLE,
+ &s, nn);
if (status)
goto out;
dp = delegstateid(s);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 79c743c01a47..35b3564c065f 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -114,6 +114,8 @@ struct nfs4_stid {
/* For a deleg stateid kept around only to process free_stateid's: */
#define SC_STATUS_REVOKED BIT(1)
#define SC_STATUS_ADMIN_REVOKED BIT(2)
+#define SC_STATUS_FREEABLE BIT(3)
+#define SC_STATUS_FREED BIT(4)
unsigned short sc_status;
struct list_head sc_cp_list;
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102832-murky-pasty-feca@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a Mon Sep 17 00:00:00 2001
From: Olga Kornievskaia <okorniev(a)redhat.com>
Date: Fri, 18 Oct 2024 15:24:58 -0400
Subject: [PATCH] nfsd: fix race between laundromat and free_stateid
There is a race between laundromat handling of revoked delegations
and a client sending free_stateid operation. Laundromat thread
finds that delegation has expired and needs to be revoked so it
marks the delegation stid revoked and it puts it on a reaper list
but then it unlock the state lock and the actual delegation revocation
happens without the lock. Once the stid is marked revoked a racing
free_stateid processing thread does the following (1) it calls
list_del_init() which removes it from the reaper list and (2) frees
the delegation stid structure. The laundromat thread ends up not
calling the revoke_delegation() function for this particular delegation
but that means it will no release the lock lease that exists on
the file.
Now, a new open for this file comes in and ends up finding that
lease list isn't empty and calls nfsd_breaker_owns_lease() which ends
up trying to derefence a freed delegation stateid. Leading to the
followint use-after-free KASAN warning:
kernel: ==================================================================
kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205
kernel:
kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9
kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024
kernel: Call trace:
kernel: dump_backtrace+0x98/0x120
kernel: show_stack+0x1c/0x30
kernel: dump_stack_lvl+0x80/0xe8
kernel: print_address_description.constprop.0+0x84/0x390
kernel: print_report+0xa4/0x268
kernel: kasan_report+0xb4/0xf8
kernel: __asan_report_load8_noabort+0x1c/0x28
kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd]
kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd]
kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd]
kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd]
kernel: nfsd4_open+0xa08/0xe80 [nfsd]
kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd]
kernel: nfsd_dispatch+0x22c/0x718 [nfsd]
kernel: svc_process_common+0x8e8/0x1960 [sunrpc]
kernel: svc_process+0x3d4/0x7e0 [sunrpc]
kernel: svc_handle_xprt+0x828/0xe10 [sunrpc]
kernel: svc_recv+0x2cc/0x6a8 [sunrpc]
kernel: nfsd+0x270/0x400 [nfsd]
kernel: kthread+0x288/0x310
kernel: ret_from_fork+0x10/0x20
This patch proposes a fixed that's based on adding 2 new additional
stid's sc_status values that help coordinate between the laundromat
and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()).
First to make sure, that once the stid is marked revoked, it is not
removed by the nfsd4_free_stateid(), the laundromat take a reference
on the stateid. Then, coordinating whether the stid has been put
on the cl_revoked list or we are processing FREE_STATEID and need to
make sure to remove it from the list, each check that state and act
accordingly. If laundromat has added to the cl_revoke list before
the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove
it from the list. If nfsd4_free_stateid() finds that operations arrived
before laundromat has placed it on cl_revoke list, it marks the state
freed and then laundromat will no longer add it to the list.
Also, for nfsd4_delegreturn() when looking for the specified stid,
we need to access stid that are marked removed or freeable, it means
the laundromat has started processing it but hasn't finished and this
delegreturn needs to return nfserr_deleg_revoked and not
nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the
lack of it will leave this stid on the cl_revoked list indefinitely.
Fixes: 2d4a532d385f ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock")
CC: stable(a)vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev(a)redhat.com>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 56b261608af4..d1a2c677be7e 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -1359,21 +1359,47 @@ static void destroy_delegation(struct nfs4_delegation *dp)
destroy_unhashed_deleg(dp);
}
+/**
+ * revoke_delegation - perform nfs4 delegation structure cleanup
+ * @dp: pointer to the delegation
+ *
+ * This function assumes that it's called either from the administrative
+ * interface (nfsd4_revoke_states()) that's revoking a specific delegation
+ * stateid or it's called from a laundromat thread (nfsd4_landromat()) that
+ * determined that this specific state has expired and needs to be revoked
+ * (both mark state with the appropriate stid sc_status mode). It is also
+ * assumed that a reference was taken on the @dp state.
+ *
+ * If this function finds that the @dp state is SC_STATUS_FREED it means
+ * that a FREE_STATEID operation for this stateid has been processed and
+ * we can proceed to removing it from recalled list. However, if @dp state
+ * isn't marked SC_STATUS_FREED, it means we need place it on the cl_revoked
+ * list and wait for the FREE_STATEID to arrive from the client. At the same
+ * time, we need to mark it as SC_STATUS_FREEABLE to indicate to the
+ * nfsd4_free_stateid() function that this stateid has already been added
+ * to the cl_revoked list and that nfsd4_free_stateid() is now responsible
+ * for removing it from the list. Inspection of where the delegation state
+ * in the revocation process is protected by the clp->cl_lock.
+ */
static void revoke_delegation(struct nfs4_delegation *dp)
{
struct nfs4_client *clp = dp->dl_stid.sc_client;
WARN_ON(!list_empty(&dp->dl_recall_lru));
+ WARN_ON_ONCE(!(dp->dl_stid.sc_status &
+ (SC_STATUS_REVOKED | SC_STATUS_ADMIN_REVOKED)));
trace_nfsd_stid_revoke(&dp->dl_stid);
- if (dp->dl_stid.sc_status &
- (SC_STATUS_REVOKED | SC_STATUS_ADMIN_REVOKED)) {
- spin_lock(&clp->cl_lock);
- refcount_inc(&dp->dl_stid.sc_count);
- list_add(&dp->dl_recall_lru, &clp->cl_revoked);
- spin_unlock(&clp->cl_lock);
+ spin_lock(&clp->cl_lock);
+ if (dp->dl_stid.sc_status & SC_STATUS_FREED) {
+ list_del_init(&dp->dl_recall_lru);
+ goto out;
}
+ list_add(&dp->dl_recall_lru, &clp->cl_revoked);
+ dp->dl_stid.sc_status |= SC_STATUS_FREEABLE;
+out:
+ spin_unlock(&clp->cl_lock);
destroy_unhashed_deleg(dp);
}
@@ -1780,6 +1806,7 @@ void nfsd4_revoke_states(struct net *net, struct super_block *sb)
mutex_unlock(&stp->st_mutex);
break;
case SC_TYPE_DELEG:
+ refcount_inc(&stid->sc_count);
dp = delegstateid(stid);
spin_lock(&state_lock);
if (!unhash_delegation_locked(
@@ -6545,6 +6572,7 @@ nfs4_laundromat(struct nfsd_net *nn)
dp = list_entry (pos, struct nfs4_delegation, dl_recall_lru);
if (!state_expired(<, dp->dl_time))
break;
+ refcount_inc(&dp->dl_stid.sc_count);
unhash_delegation_locked(dp, SC_STATUS_REVOKED);
list_add(&dp->dl_recall_lru, &reaplist);
}
@@ -7157,7 +7185,9 @@ nfsd4_free_stateid(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
s->sc_status |= SC_STATUS_CLOSED;
spin_unlock(&s->sc_lock);
dp = delegstateid(s);
- list_del_init(&dp->dl_recall_lru);
+ if (s->sc_status & SC_STATUS_FREEABLE)
+ list_del_init(&dp->dl_recall_lru);
+ s->sc_status |= SC_STATUS_FREED;
spin_unlock(&cl->cl_lock);
nfs4_put_stid(s);
ret = nfs_ok;
@@ -7487,7 +7517,9 @@ nfsd4_delegreturn(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
if ((status = fh_verify(rqstp, &cstate->current_fh, S_IFREG, 0)))
return status;
- status = nfsd4_lookup_stateid(cstate, stateid, SC_TYPE_DELEG, 0, &s, nn);
+ status = nfsd4_lookup_stateid(cstate, stateid, SC_TYPE_DELEG,
+ SC_STATUS_REVOKED | SC_STATUS_FREEABLE,
+ &s, nn);
if (status)
goto out;
dp = delegstateid(s);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 79c743c01a47..35b3564c065f 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -114,6 +114,8 @@ struct nfs4_stid {
/* For a deleg stateid kept around only to process free_stateid's: */
#define SC_STATUS_REVOKED BIT(1)
#define SC_STATUS_ADMIN_REVOKED BIT(2)
+#define SC_STATUS_FREEABLE BIT(3)
+#define SC_STATUS_FREED BIT(4)
unsigned short sc_status;
struct list_head sc_cp_list;
The patch below does not apply to the 6.11-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.11.y
git checkout FETCH_HEAD
git cherry-pick -x 108bc59fe817686a59d2008f217bad38a5cf4427
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102849-flannels-ashy-e1bf@gregkh' --subject-prefix 'PATCH 6.11.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 108bc59fe817686a59d2008f217bad38a5cf4427 Mon Sep 17 00:00:00 2001
From: Frank Min <Frank.Min(a)amd.com>
Date: Thu, 10 Oct 2024 16:41:32 +0800
Subject: [PATCH] drm/amdgpu: fix random data corruption for sdma 7
There is random data corruption caused by const fill, this is caused by
write compression mode not correctly configured.
So correct compression mode for const fill.
Signed-off-by: Frank Min <Frank.Min(a)amd.com>
Reviewed-by: Alex Deucher <alexander.deucher(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit 75400f8d6e36afc88d59db8a1f3e4b7d90d836ad)
Cc: stable(a)vger.kernel.org # 6.11.x
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
index a8763496aed3..9288f37a3cc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
@@ -51,6 +51,12 @@ MODULE_FIRMWARE("amdgpu/sdma_7_0_1.bin");
#define SDMA0_HYP_DEC_REG_END 0x589a
#define SDMA1_HYP_DEC_REG_OFFSET 0x20
+/*define for compression field for sdma7*/
+#define SDMA_PKT_CONSTANT_FILL_HEADER_compress_offset 0
+#define SDMA_PKT_CONSTANT_FILL_HEADER_compress_mask 0x00000001
+#define SDMA_PKT_CONSTANT_FILL_HEADER_compress_shift 16
+#define SDMA_PKT_CONSTANT_FILL_HEADER_COMPRESS(x) (((x) & SDMA_PKT_CONSTANT_FILL_HEADER_compress_mask) << SDMA_PKT_CONSTANT_FILL_HEADER_compress_shift)
+
static const struct amdgpu_hwip_reg_entry sdma_reg_list_7_0[] = {
SOC15_REG_ENTRY_STR(GC, 0, regSDMA0_STATUS_REG),
SOC15_REG_ENTRY_STR(GC, 0, regSDMA0_STATUS1_REG),
@@ -1724,7 +1730,8 @@ static void sdma_v7_0_emit_fill_buffer(struct amdgpu_ib *ib,
uint64_t dst_offset,
uint32_t byte_count)
{
- ib->ptr[ib->length_dw++] = SDMA_PKT_COPY_LINEAR_HEADER_OP(SDMA_OP_CONST_FILL);
+ ib->ptr[ib->length_dw++] = SDMA_PKT_CONSTANT_FILL_HEADER_OP(SDMA_OP_CONST_FILL) |
+ SDMA_PKT_CONSTANT_FILL_HEADER_COMPRESS(1);
ib->ptr[ib->length_dw++] = lower_32_bits(dst_offset);
ib->ptr[ib->length_dw++] = upper_32_bits(dst_offset);
ib->ptr[ib->length_dw++] = src_data;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x f559b2e9c5c5308850544ab59396b7d53cfc67bd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102849-giggle-five-0241@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
f559b2e9c5c5 ("KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory")
2732be902353 ("KVM: nSVM: Don't strip host's C-bit from guest's CR3 when reading PDPTRs")
883b0a91f41a ("KVM: SVM: Move Nested SVM Implementation to nested.c")
46a010dd6896 ("kVM SVM: Move SVM related files to own sub-directory")
d55c9d4009c7 ("KVM: nSVM: check for EFER.SVME=1 before entering guest")
0b66465344a7 ("KVM: nSVM: Remove an obsolete comment.")
78f2145c4d93 ("KVM: nSVM: avoid loss of pending IRQ/NMI before entering L2")
b518ba9fa691 ("KVM: nSVM: implement check_nested_events for interrupts")
64b5bd270426 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1")
b5ec2e020b70 ("KVM: nSVM: do not change host intercepts while nested VM is running")
689f3bf21628 ("KVM: x86: unify callbacks to load paging root")
257038745cae ("KVM: x86: Move nSVM CPUID 0x8000000A handling into common x86 code")
a50718cc3f43 ("KVM: nSVM: Expose SVM features to L1 iff nested is enabled")
703c335d0693 ("KVM: x86/mmu: Configure max page level during hardware setup")
bde772355958 ("KVM: x86/mmu: Merge kvm_{enable,disable}_tdp() into a common function")
213e0e1f500b ("KVM: SVM: Refactor logging of NPT enabled/disabled")
a1bead2abaa1 ("KVM: VMX: Directly query Intel PT mode when refreshing PMUs")
139085101f85 ("KVM: x86: Use KVM cpu caps to detect MSR_TSC_AUX virt support")
93c380e7b528 ("KVM: x86: Set emulated/transmuted feature bits via kvm_cpu_caps")
bd7919999047 ("KVM: x86: Override host CPUID results with kvm_cpu_caps")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f559b2e9c5c5308850544ab59396b7d53cfc67bd Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Wed, 9 Oct 2024 07:08:38 -0700
Subject: [PATCH] KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory
Ignore nCR3[4:0] when loading PDPTEs from memory for nested SVM, as bits
4:0 of CR3 are ignored when PAE paging is used, and thus VMRUN doesn't
enforce 32-byte alignment of nCR3.
In the absolute worst case scenario, failure to ignore bits 4:0 can result
in an out-of-bounds read, e.g. if the target page is at the end of a
memslot, and the VMM isn't using guard pages.
Per the APM:
The CR3 register points to the base address of the page-directory-pointer
table. The page-directory-pointer table is aligned on a 32-byte boundary,
with the low 5 address bits 4:0 assumed to be 0.
And the SDM's much more explicit:
4:0 Ignored
Note, KVM gets this right when loading PDPTRs, it's only the nSVM flow
that is broken.
Fixes: e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE from guest memory")
Reported-by: Kirk Swidowski <swidowski(a)google.com>
Cc: Andy Nguyen <theflow(a)google.com>
Cc: 3pvd <3pvd(a)google.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-ID: <20241009140838.1036226-1-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index d5314cb7dff4..cf84103ce38b 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -63,8 +63,12 @@ static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
u64 pdpte;
int ret;
+ /*
+ * Note, nCR3 is "assumed" to be 32-byte aligned, i.e. the CPU ignores
+ * nCR3[4:0] when loading PDPTEs from memory.
+ */
ret = kvm_vcpu_read_guest_page(vcpu, gpa_to_gfn(cr3), &pdpte,
- offset_in_page(cr3) + index * 8, 8);
+ (cr3 & GENMASK(11, 5)) + index * 8, 8);
if (ret)
return 0;
return pdpte;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x f559b2e9c5c5308850544ab59396b7d53cfc67bd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102847-level-stoic-fc77@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
f559b2e9c5c5 ("KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory")
2732be902353 ("KVM: nSVM: Don't strip host's C-bit from guest's CR3 when reading PDPTRs")
883b0a91f41a ("KVM: SVM: Move Nested SVM Implementation to nested.c")
46a010dd6896 ("kVM SVM: Move SVM related files to own sub-directory")
d55c9d4009c7 ("KVM: nSVM: check for EFER.SVME=1 before entering guest")
0b66465344a7 ("KVM: nSVM: Remove an obsolete comment.")
78f2145c4d93 ("KVM: nSVM: avoid loss of pending IRQ/NMI before entering L2")
b518ba9fa691 ("KVM: nSVM: implement check_nested_events for interrupts")
64b5bd270426 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1")
b5ec2e020b70 ("KVM: nSVM: do not change host intercepts while nested VM is running")
689f3bf21628 ("KVM: x86: unify callbacks to load paging root")
257038745cae ("KVM: x86: Move nSVM CPUID 0x8000000A handling into common x86 code")
a50718cc3f43 ("KVM: nSVM: Expose SVM features to L1 iff nested is enabled")
703c335d0693 ("KVM: x86/mmu: Configure max page level during hardware setup")
bde772355958 ("KVM: x86/mmu: Merge kvm_{enable,disable}_tdp() into a common function")
213e0e1f500b ("KVM: SVM: Refactor logging of NPT enabled/disabled")
a1bead2abaa1 ("KVM: VMX: Directly query Intel PT mode when refreshing PMUs")
139085101f85 ("KVM: x86: Use KVM cpu caps to detect MSR_TSC_AUX virt support")
93c380e7b528 ("KVM: x86: Set emulated/transmuted feature bits via kvm_cpu_caps")
bd7919999047 ("KVM: x86: Override host CPUID results with kvm_cpu_caps")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f559b2e9c5c5308850544ab59396b7d53cfc67bd Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Wed, 9 Oct 2024 07:08:38 -0700
Subject: [PATCH] KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory
Ignore nCR3[4:0] when loading PDPTEs from memory for nested SVM, as bits
4:0 of CR3 are ignored when PAE paging is used, and thus VMRUN doesn't
enforce 32-byte alignment of nCR3.
In the absolute worst case scenario, failure to ignore bits 4:0 can result
in an out-of-bounds read, e.g. if the target page is at the end of a
memslot, and the VMM isn't using guard pages.
Per the APM:
The CR3 register points to the base address of the page-directory-pointer
table. The page-directory-pointer table is aligned on a 32-byte boundary,
with the low 5 address bits 4:0 assumed to be 0.
And the SDM's much more explicit:
4:0 Ignored
Note, KVM gets this right when loading PDPTRs, it's only the nSVM flow
that is broken.
Fixes: e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE from guest memory")
Reported-by: Kirk Swidowski <swidowski(a)google.com>
Cc: Andy Nguyen <theflow(a)google.com>
Cc: 3pvd <3pvd(a)google.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-ID: <20241009140838.1036226-1-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index d5314cb7dff4..cf84103ce38b 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -63,8 +63,12 @@ static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
u64 pdpte;
int ret;
+ /*
+ * Note, nCR3 is "assumed" to be 32-byte aligned, i.e. the CPU ignores
+ * nCR3[4:0] when loading PDPTEs from memory.
+ */
ret = kvm_vcpu_read_guest_page(vcpu, gpa_to_gfn(cr3), &pdpte,
- offset_in_page(cr3) + index * 8, 8);
+ (cr3 & GENMASK(11, 5)) + index * 8, 8);
if (ret)
return 0;
return pdpte;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 088984c8d54c0053fc4ae606981291d741c5924b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102850-postal-pogo-94ed@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 088984c8d54c0053fc4ae606981291d741c5924b Mon Sep 17 00:00:00 2001
From: Koba Ko <kobak(a)nvidia.com>
Date: Sun, 13 Oct 2024 04:50:10 +0800
Subject: [PATCH] ACPI: PRM: Find EFI_MEMORY_RUNTIME block for PRM handler and
context
PRMT needs to find the correct type of block to translate the PA-VA
mapping for EFI runtime services.
The issue arises because the PRMT is finding a block of type
EFI_CONVENTIONAL_MEMORY, which is not appropriate for runtime services
as described in Section 2.2.2 (Runtime Services) of the UEFI
Specification [1]. Since the PRM handler is a type of runtime service,
this causes an exception when the PRM handler is called.
[Firmware Bug]: Unable to handle paging request in EFI runtime service
WARNING: CPU: 22 PID: 4330 at drivers/firmware/efi/runtime-wrappers.c:341
__efi_queue_work+0x11c/0x170
Call trace:
Let PRMT find a block with EFI_MEMORY_RUNTIME for PRM handler and PRM
context.
If no suitable block is found, a warning message will be printed, but
the procedure continues to manage the next PRM handler.
However, if the PRM handler is actually called without proper allocation,
it would result in a failure during error handling.
By using the correct memory types for runtime services, ensure that the
PRM handler and the context are properly mapped in the virtual address
space during runtime, preventing the paging request error.
The issue is really that only memory that has been remapped for runtime
by the firmware can be used by the PRM handler, and so the region needs
to have the EFI_MEMORY_RUNTIME attribute.
Link: https://uefi.org/sites/default/files/resources/UEFI_Spec_2_10_Aug29.pdf # [1]
Fixes: cefc7ca46235 ("ACPI: PRM: implement OperationRegion handler for the PlatformRtMechanism subtype")
Cc: All applicable <stable(a)vger.kernel.org>
Signed-off-by: Koba Ko <kobak(a)nvidia.com>
Reviewed-by: Matthew R. Ochs <mochs(a)nvidia.com>
Reviewed-by: Zhang Rui <rui.zhang(a)intel.com>
Reviewed-by: Ard Biesheuvel <ardb(a)kernel.org>
Link: https://patch.msgid.link/20241012205010.4165798-1-kobak@nvidia.com
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
diff --git a/drivers/acpi/prmt.c b/drivers/acpi/prmt.c
index 1cfaa5957ac4..d59307a76ca3 100644
--- a/drivers/acpi/prmt.c
+++ b/drivers/acpi/prmt.c
@@ -72,17 +72,21 @@ struct prm_module_info {
struct prm_handler_info handlers[] __counted_by(handler_count);
};
-static u64 efi_pa_va_lookup(u64 pa)
+static u64 efi_pa_va_lookup(efi_guid_t *guid, u64 pa)
{
efi_memory_desc_t *md;
u64 pa_offset = pa & ~PAGE_MASK;
u64 page = pa & PAGE_MASK;
for_each_efi_memory_desc(md) {
- if (md->phys_addr < pa && pa < md->phys_addr + PAGE_SIZE * md->num_pages)
+ if ((md->attribute & EFI_MEMORY_RUNTIME) &&
+ (md->phys_addr < pa && pa < md->phys_addr + PAGE_SIZE * md->num_pages)) {
return pa_offset + md->virt_addr + page - md->phys_addr;
+ }
}
+ pr_warn("Failed to find VA for GUID: %pUL, PA: 0x%llx", guid, pa);
+
return 0;
}
@@ -148,9 +152,15 @@ acpi_parse_prmt(union acpi_subtable_headers *header, const unsigned long end)
th = &tm->handlers[cur_handler];
guid_copy(&th->guid, (guid_t *)handler_info->handler_guid);
- th->handler_addr = (void *)efi_pa_va_lookup(handler_info->handler_address);
- th->static_data_buffer_addr = efi_pa_va_lookup(handler_info->static_data_buffer_address);
- th->acpi_param_buffer_addr = efi_pa_va_lookup(handler_info->acpi_param_buffer_address);
+ th->handler_addr =
+ (void *)efi_pa_va_lookup(&th->guid, handler_info->handler_address);
+
+ th->static_data_buffer_addr =
+ efi_pa_va_lookup(&th->guid, handler_info->static_data_buffer_address);
+
+ th->acpi_param_buffer_addr =
+ efi_pa_va_lookup(&th->guid, handler_info->acpi_param_buffer_address);
+
} while (++cur_handler < tm->handler_count && (handler_info = get_next_handler(handler_info)));
return 0;
@@ -277,6 +287,13 @@ static acpi_status acpi_platformrt_space_handler(u32 function,
if (!handler || !module)
goto invalid_guid;
+ if (!handler->handler_addr ||
+ !handler->static_data_buffer_addr ||
+ !handler->acpi_param_buffer_addr) {
+ buffer->prm_status = PRM_HANDLER_ERROR;
+ return AE_OK;
+ }
+
ACPI_COPY_NAMESEG(context.signature, "PRMC");
context.revision = 0x0;
context.reserved = 0x0;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x f10f59f91a6278e9637327d1206140d28e2d5004
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102859-around-strangely-5c99@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f10f59f91a6278e9637327d1206140d28e2d5004 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Wed, 9 Oct 2024 09:37:03 +1030
Subject: [PATCH] btrfs: fix the delalloc range locking if sector size < page
size
Inside lock_delalloc_folios(), there are several problems related to
sector size < page size handling:
- Set the writer locks without checking if the folio is still valid
We call btrfs_folio_start_writer_lock() just like it's folio_lock().
But since the folio may not even be the folio of the current mapping,
we can easily screw up the folio->private.
- The range is not clamped inside the page
This means we can over write other bitmaps if the start/len is not
properly handled, and trigger the btrfs_subpage_assert().
- @processed_end is always rounded up to page end
If the delalloc range is not page aligned, and we need to retry
(returning -EAGAIN), then we will unlock to the page end.
Thankfully this is not a huge problem, as now
btrfs_folio_end_writer_lock() can handle range larger than the locked
range, and only unlock what is already locked.
Fix all these problems by:
- Lock and check the folio first, then call
btrfs_folio_set_writer_lock()
So that if we got a folio not belonging to the inode, we won't
touch folio->private.
- Properly truncate the range inside the page
- Update @processed_end to the locked range end
Fixes: 1e1de38792e0 ("btrfs: make process_one_page() to handle subpage locking")
CC: stable(a)vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 309a8ae48434..872cca54cc6c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -262,22 +262,23 @@ static noinline int lock_delalloc_folios(struct inode *inode,
for (i = 0; i < found_folios; i++) {
struct folio *folio = fbatch.folios[i];
- u32 len = end + 1 - start;
+ u64 range_start;
+ u32 range_len;
if (folio == locked_folio)
continue;
- if (btrfs_folio_start_writer_lock(fs_info, folio, start,
- len))
- goto out;
-
+ folio_lock(folio);
if (!folio_test_dirty(folio) || folio->mapping != mapping) {
- btrfs_folio_end_writer_lock(fs_info, folio, start,
- len);
+ folio_unlock(folio);
goto out;
}
+ range_start = max_t(u64, folio_pos(folio), start);
+ range_len = min_t(u64, folio_pos(folio) + folio_size(folio),
+ end + 1) - range_start;
+ btrfs_folio_set_writer_lock(fs_info, folio, range_start, range_len);
- processed_end = folio_pos(folio) + folio_size(folio) - 1;
+ processed_end = range_start + range_len - 1;
}
folio_batch_release(&fbatch);
cond_resched();
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x f10f59f91a6278e9637327d1206140d28e2d5004
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102857-unvaried-grip-2f5d@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f10f59f91a6278e9637327d1206140d28e2d5004 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Wed, 9 Oct 2024 09:37:03 +1030
Subject: [PATCH] btrfs: fix the delalloc range locking if sector size < page
size
Inside lock_delalloc_folios(), there are several problems related to
sector size < page size handling:
- Set the writer locks without checking if the folio is still valid
We call btrfs_folio_start_writer_lock() just like it's folio_lock().
But since the folio may not even be the folio of the current mapping,
we can easily screw up the folio->private.
- The range is not clamped inside the page
This means we can over write other bitmaps if the start/len is not
properly handled, and trigger the btrfs_subpage_assert().
- @processed_end is always rounded up to page end
If the delalloc range is not page aligned, and we need to retry
(returning -EAGAIN), then we will unlock to the page end.
Thankfully this is not a huge problem, as now
btrfs_folio_end_writer_lock() can handle range larger than the locked
range, and only unlock what is already locked.
Fix all these problems by:
- Lock and check the folio first, then call
btrfs_folio_set_writer_lock()
So that if we got a folio not belonging to the inode, we won't
touch folio->private.
- Properly truncate the range inside the page
- Update @processed_end to the locked range end
Fixes: 1e1de38792e0 ("btrfs: make process_one_page() to handle subpage locking")
CC: stable(a)vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 309a8ae48434..872cca54cc6c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -262,22 +262,23 @@ static noinline int lock_delalloc_folios(struct inode *inode,
for (i = 0; i < found_folios; i++) {
struct folio *folio = fbatch.folios[i];
- u32 len = end + 1 - start;
+ u64 range_start;
+ u32 range_len;
if (folio == locked_folio)
continue;
- if (btrfs_folio_start_writer_lock(fs_info, folio, start,
- len))
- goto out;
-
+ folio_lock(folio);
if (!folio_test_dirty(folio) || folio->mapping != mapping) {
- btrfs_folio_end_writer_lock(fs_info, folio, start,
- len);
+ folio_unlock(folio);
goto out;
}
+ range_start = max_t(u64, folio_pos(folio), start);
+ range_len = min_t(u64, folio_pos(folio) + folio_size(folio),
+ end + 1) - range_start;
+ btrfs_folio_set_writer_lock(fs_info, folio, range_start, range_len);
- processed_end = folio_pos(folio) + folio_size(folio) - 1;
+ processed_end = range_start + range_len - 1;
}
folio_batch_release(&fbatch);
cond_resched();
The patch below does not apply to the 6.11-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.11.y
git checkout FETCH_HEAD
git cherry-pick -x f10f59f91a6278e9637327d1206140d28e2d5004
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102855-swab-anytime-3da1@gregkh' --subject-prefix 'PATCH 6.11.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f10f59f91a6278e9637327d1206140d28e2d5004 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Wed, 9 Oct 2024 09:37:03 +1030
Subject: [PATCH] btrfs: fix the delalloc range locking if sector size < page
size
Inside lock_delalloc_folios(), there are several problems related to
sector size < page size handling:
- Set the writer locks without checking if the folio is still valid
We call btrfs_folio_start_writer_lock() just like it's folio_lock().
But since the folio may not even be the folio of the current mapping,
we can easily screw up the folio->private.
- The range is not clamped inside the page
This means we can over write other bitmaps if the start/len is not
properly handled, and trigger the btrfs_subpage_assert().
- @processed_end is always rounded up to page end
If the delalloc range is not page aligned, and we need to retry
(returning -EAGAIN), then we will unlock to the page end.
Thankfully this is not a huge problem, as now
btrfs_folio_end_writer_lock() can handle range larger than the locked
range, and only unlock what is already locked.
Fix all these problems by:
- Lock and check the folio first, then call
btrfs_folio_set_writer_lock()
So that if we got a folio not belonging to the inode, we won't
touch folio->private.
- Properly truncate the range inside the page
- Update @processed_end to the locked range end
Fixes: 1e1de38792e0 ("btrfs: make process_one_page() to handle subpage locking")
CC: stable(a)vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 309a8ae48434..872cca54cc6c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -262,22 +262,23 @@ static noinline int lock_delalloc_folios(struct inode *inode,
for (i = 0; i < found_folios; i++) {
struct folio *folio = fbatch.folios[i];
- u32 len = end + 1 - start;
+ u64 range_start;
+ u32 range_len;
if (folio == locked_folio)
continue;
- if (btrfs_folio_start_writer_lock(fs_info, folio, start,
- len))
- goto out;
-
+ folio_lock(folio);
if (!folio_test_dirty(folio) || folio->mapping != mapping) {
- btrfs_folio_end_writer_lock(fs_info, folio, start,
- len);
+ folio_unlock(folio);
goto out;
}
+ range_start = max_t(u64, folio_pos(folio), start);
+ range_len = min_t(u64, folio_pos(folio) + folio_size(folio),
+ end + 1) - range_start;
+ btrfs_folio_set_writer_lock(fs_info, folio, range_start, range_len);
- processed_end = folio_pos(folio) + folio_size(folio) - 1;
+ processed_end = range_start + range_len - 1;
}
folio_batch_release(&fbatch);
cond_resched();
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 5f9062a48db260fd6b53d86ecfb4d5dc59266316
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102858-yield-gestate-e6bc@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5f9062a48db260fd6b53d86ecfb4d5dc59266316 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Tue, 10 Sep 2024 15:21:04 +0930
Subject: [PATCH] btrfs: qgroup: set a more sane default value for subtree drop
threshold
Since commit 011b46c30476 ("btrfs: skip subtree scan if it's too high to
avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can
automatically skip large subtree scan at the cost of marking qgroup
inconsistent.
It's designed to address the final performance problem of snapshot drop
with qgroup enabled, but to be safe the default value is
BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value
to make it work.
I'd say it's not a good idea to rely on user space tool to set this
default value, especially when some operations (snapshot dropping) can
be triggered immediately after mount, leaving a very small window to
that that sysfs interface.
So instead of disabling this new feature by default, enable it with a
low threshold (3), so that large subvolume tree drop at mount time won't
cause huge qgroup workload.
CC: stable(a)vger.kernel.org # 6.1
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1238a38c59b2..5afb68c0304b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1959,7 +1959,7 @@ static void btrfs_init_qgroup(struct btrfs_fs_info *fs_info)
fs_info->qgroup_seq = 1;
fs_info->qgroup_ulist = NULL;
fs_info->qgroup_rescan_running = false;
- fs_info->qgroup_drop_subtree_thres = BTRFS_MAX_LEVEL;
+ fs_info->qgroup_drop_subtree_thres = BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT;
mutex_init(&fs_info->qgroup_rescan_lock);
}
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 1332ec59c539..a0e8deca87a7 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1407,7 +1407,7 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
fs_info->quota_root = NULL;
fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_ON;
fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE;
- fs_info->qgroup_drop_subtree_thres = BTRFS_MAX_LEVEL;
+ fs_info->qgroup_drop_subtree_thres = BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT;
spin_unlock(&fs_info->qgroup_lock);
btrfs_free_qgroup_config(fs_info);
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 98adf4ec7b01..c229256d6fd5 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -121,6 +121,8 @@ struct btrfs_inode;
#define BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN (1ULL << 63)
#define BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING (1ULL << 62)
+#define BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT (3)
+
/*
* Record a dirty extent, and info qgroup to update quota on it
*/
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 5f9062a48db260fd6b53d86ecfb4d5dc59266316
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102856-chance-seventh-bedd@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5f9062a48db260fd6b53d86ecfb4d5dc59266316 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Tue, 10 Sep 2024 15:21:04 +0930
Subject: [PATCH] btrfs: qgroup: set a more sane default value for subtree drop
threshold
Since commit 011b46c30476 ("btrfs: skip subtree scan if it's too high to
avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can
automatically skip large subtree scan at the cost of marking qgroup
inconsistent.
It's designed to address the final performance problem of snapshot drop
with qgroup enabled, but to be safe the default value is
BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value
to make it work.
I'd say it's not a good idea to rely on user space tool to set this
default value, especially when some operations (snapshot dropping) can
be triggered immediately after mount, leaving a very small window to
that that sysfs interface.
So instead of disabling this new feature by default, enable it with a
low threshold (3), so that large subvolume tree drop at mount time won't
cause huge qgroup workload.
CC: stable(a)vger.kernel.org # 6.1
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1238a38c59b2..5afb68c0304b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1959,7 +1959,7 @@ static void btrfs_init_qgroup(struct btrfs_fs_info *fs_info)
fs_info->qgroup_seq = 1;
fs_info->qgroup_ulist = NULL;
fs_info->qgroup_rescan_running = false;
- fs_info->qgroup_drop_subtree_thres = BTRFS_MAX_LEVEL;
+ fs_info->qgroup_drop_subtree_thres = BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT;
mutex_init(&fs_info->qgroup_rescan_lock);
}
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 1332ec59c539..a0e8deca87a7 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1407,7 +1407,7 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
fs_info->quota_root = NULL;
fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_ON;
fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE;
- fs_info->qgroup_drop_subtree_thres = BTRFS_MAX_LEVEL;
+ fs_info->qgroup_drop_subtree_thres = BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT;
spin_unlock(&fs_info->qgroup_lock);
btrfs_free_qgroup_config(fs_info);
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 98adf4ec7b01..c229256d6fd5 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -121,6 +121,8 @@ struct btrfs_inode;
#define BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN (1ULL << 63)
#define BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING (1ULL << 62)
+#define BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT (3)
+
/*
* Record a dirty extent, and info qgroup to update quota on it
*/
The Synopsys DesignWare mmc controller on the JH7110 SoC
(dw_mmc-starfive.c driver) is using a 32-bit IDMAC address bus width,
and thus requires the use of SWIOTLB.
The commit 8396c793ffdf ("mmc: dw_mmc: Fix IDMAC operation with pages
bigger than 4K") increased the max_seq_size, even for 4K pages, causing
"swiotlb buffer is full" to happen because swiotlb can only handle a
memory size up to 256kB only.
Fix the issue, by making sure the dw_mmc driver doesn't use segments
bigger than what SWIOTLB can handle.
Reported-by: Ron Economos <re(a)w6rz.net>
Reported-by: Jing Luo <jing(a)jing.rocks>
Fixes: 8396c793ffdf ("mmc: dw_mmc: Fix IDMAC operation with pages bigger than 4K")
Cc: stable(a)vger.kernel.org
Signed-off-by: Aurelien Jarno <aurelien(a)aurel32.net>
---
drivers/mmc/host/dw_mmc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index 41e451235f637..dc0d6201f7b73 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -2958,7 +2958,8 @@ static int dw_mci_init_slot(struct dw_mci *host)
mmc->max_segs = host->ring_size;
mmc->max_blk_size = 65535;
mmc->max_req_size = DW_MCI_DESC_DATA_LENGTH * host->ring_size;
- mmc->max_seg_size = mmc->max_req_size;
+ mmc->max_seg_size =
+ min_t(size_t, mmc->max_req_size, dma_max_mapping_size(host->dev));
mmc->max_blk_count = mmc->max_req_size / 512;
} else if (host->use_dma == TRANS_MODE_EDMAC) {
mmc->max_segs = 64;
--
2.45.2
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: b5413156bad91dc2995a5c4eab1b05e56914638a
Gitweb: https://git.kernel.org/tip/b5413156bad91dc2995a5c4eab1b05e56914638a
Author: Benjamin Segall <bsegall(a)google.com>
AuthorDate: Fri, 25 Oct 2024 18:35:35 -07:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Sun, 27 Oct 2024 10:36:04 +01:00
posix-cpu-timers: Clear TICK_DEP_BIT_POSIX_TIMER on clone
When cloning a new thread, its posix_cputimers are not inherited, and
are cleared by posix_cputimers_init(). However, this does not clear the
tick dependency it creates in tsk->tick_dep_mask, and the handler does
not reach the code to clear the dependency if there were no timers to
begin with.
Thus if a thread has a cputimer running before clone/fork, all
descendants will prevent nohz_full unless they create a cputimer of
their own.
Fix this by entirely clearing the tick_dep_mask in copy_process().
(There is currently no inherited state that needs a tick dependency)
Process-wide timers do not have this problem because fork does not copy
signal_struct as a baseline, it creates one from scratch.
Fixes: b78783000d5c ("posix-cpu-timers: Migrate to use new tick dependency mask model")
Signed-off-by: Ben Segall <bsegall(a)google.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/xm26o737bq8o.fsf@google.com
---
include/linux/tick.h | 8 ++++++++
kernel/fork.c | 2 ++
2 files changed, 10 insertions(+)
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7274463..99c9c5a 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -251,12 +251,19 @@ static inline void tick_dep_set_task(struct task_struct *tsk,
if (tick_nohz_full_enabled())
tick_nohz_dep_set_task(tsk, bit);
}
+
static inline void tick_dep_clear_task(struct task_struct *tsk,
enum tick_dep_bits bit)
{
if (tick_nohz_full_enabled())
tick_nohz_dep_clear_task(tsk, bit);
}
+
+static inline void tick_dep_init_task(struct task_struct *tsk)
+{
+ atomic_set(&tsk->tick_dep_mask, 0);
+}
+
static inline void tick_dep_set_signal(struct task_struct *tsk,
enum tick_dep_bits bit)
{
@@ -290,6 +297,7 @@ static inline void tick_dep_set_task(struct task_struct *tsk,
enum tick_dep_bits bit) { }
static inline void tick_dep_clear_task(struct task_struct *tsk,
enum tick_dep_bits bit) { }
+static inline void tick_dep_init_task(struct task_struct *tsk) { }
static inline void tick_dep_set_signal(struct task_struct *tsk,
enum tick_dep_bits bit) { }
static inline void tick_dep_clear_signal(struct signal_struct *signal,
diff --git a/kernel/fork.c b/kernel/fork.c
index 89ceb4a..6fa9fe6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
#include <linux/rseq.h>
#include <uapi/linux/pidfd.h>
#include <linux/pidfs.h>
+#include <linux/tick.h>
#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -2292,6 +2293,7 @@ __latent_entropy struct task_struct *copy_process(
acct_clear_integrals(p);
posix_cputimers_init(&p->posix_cputimers);
+ tick_dep_init_task(p);
p->io_context = NULL;
audit_set_context(p, NULL);
Dear Linux Community Members,
Over the years, various informal groups have formed within our community,
serving purposes such as maintaining connections with companies and external
bodies, handling sensitive information, making challenging decisions, and,
at times, representing the community as a whole. These groups contribute significantly
to our community's development and deserve our recognition and appreciation.
I'll name a few below that I identified from `Documentation/`:
- Code of Conduct Committee <conduct(a)kernel.org>
- Linux kernel security team <security(a)kernel.org>
- Linux kernel hardware security team <hardware-security(a)kernel.org>
- Kernel CVE assignment team <cve(a)kernel.org>
- Stable Team for unpublished vulnerabilities <stable(a)kernel.org>
(I suspect it's just an alias to regular stable team, but I found no evidence).
Over recent events, I've taken a closer look at how our community's governance
operates, only to find that there's remarkably little public information available
about those informal groups. With the exception of the Linux kernel hardware security
team, it seems none of these groups maintain a public list of members that I can
easily find.
Upon digging into the details, I’d like to raise a few concerns and offer some thoughts
for further discussion:
- Absence of a Membership Register
Our community is built on mutual trust. Without knowing who comprises these groups,
it's understandably difficult for people to have full confidence in their work.
A publicly available membership list would not only foster trust but also allow us to
address our recognition and appreciation.
- Lack of Guidelines for Actions
Many of these groups appear to operate without documented guidelines. While I trust each
respectful individual's integrity, documented guidelines would enable the wider community
to better understand and appreciate the roles and responsibilities involved.
- Insufficient Transparency in Decision-Making
I fully respect the need for confidentiality in handling security matters, yet some
degree of openness around decision-making processes is essential in my opinion.
Releasing communications post-embargo, for instance, could promote understanding and
prevent potential abuse of confidential procedures.
- No Conflict of Interest Policy
Particularly in the case of the Code of Conduct Committee, there may arise situations
where individuals face challenging decisions involving personal connections. A conflict
of interest policy would provide valuable guidance in such circumstances.
Thank you for reading. I know none of us enjoy being pulled away by these non-technical
concerns, we love coding after all. However, I feel these concerns are vital for the
community's continued health. It might be a candidate of Linux TAB discussion.
I'm looking forward to everyone's input.
Thanks
- Jiaxun
Since commit 50ea5449c563 ("can: mcp251xfd: fix ring configuration
when switching from CAN-CC to CAN-FD mode"), the current ring and
coalescing configuration is passed to can_ram_get_layout(). That fixed
the issue when switching between CAN-CC and CAN-FD mode with
configured ring (rx, tx) and/or coalescing parameters (rx-frames-irq,
tx-frames-irq).
However 50ea5449c563 ("can: mcp251xfd: fix ring configuration when
switching from CAN-CC to CAN-FD mode"), introduced a regression when
switching CAN modes with disabled coalescing configuration: Even if
the previous CAN mode has no coalescing configured, the new mode is
configured with active coalescing. This leads to delayed receiving of
CAN-FD frames.
This comes from the fact, that ethtool uses usecs = 0 and max_frames =
1 to disable coalescing, however the driver uses internally
priv->{rx,tx}_obj_num_coalesce_irq = 0 to indicate disabled
coalescing.
Fix the regression by assigning struct ethtool_coalesce
ec->{rx,tx}_max_coalesced_frames_irq = 1 if coalescing is disabled in
the driver as can_ram_get_layout() expects this.
Reported-by: https://github.com/vdh-robothania
Closes: https://github.com/raspberrypi/linux/issues/6407
Fixes: 50ea5449c563 ("can: mcp251xfd: fix ring configuration when switching from CAN-CC to CAN-FD mode")
Cc: stable(a)vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl(a)pengutronix.de>
---
drivers/net/can/spi/mcp251xfd/mcp251xfd-ring.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/net/can/spi/mcp251xfd/mcp251xfd-ring.c b/drivers/net/can/spi/mcp251xfd/mcp251xfd-ring.c
index e684991fa3917d4f6b6ebda8329f72971237574e..7209a831f0f2089e409c6be635f0e5dc7b2271da 100644
--- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-ring.c
+++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-ring.c
@@ -2,7 +2,7 @@
//
// mcp251xfd - Microchip MCP251xFD Family CAN controller driver
//
-// Copyright (c) 2019, 2020, 2021 Pengutronix,
+// Copyright (c) 2019, 2020, 2021, 2024 Pengutronix,
// Marc Kleine-Budde <kernel(a)pengutronix.de>
//
// Based on:
@@ -483,9 +483,11 @@ int mcp251xfd_ring_alloc(struct mcp251xfd_priv *priv)
};
const struct ethtool_coalesce ec = {
.rx_coalesce_usecs_irq = priv->rx_coalesce_usecs_irq,
- .rx_max_coalesced_frames_irq = priv->rx_obj_num_coalesce_irq,
+ .rx_max_coalesced_frames_irq = priv->rx_obj_num_coalesce_irq == 0 ?
+ 1 : priv->rx_obj_num_coalesce_irq,
.tx_coalesce_usecs_irq = priv->tx_coalesce_usecs_irq,
- .tx_max_coalesced_frames_irq = priv->tx_obj_num_coalesce_irq,
+ .tx_max_coalesced_frames_irq = priv->tx_obj_num_coalesce_irq == 0 ?
+ 1 : priv->tx_obj_num_coalesce_irq,
};
struct can_ram_layout layout;
---
base-commit: 9efc44fb2dba6138b0575826319200049078679a
change-id: 20241010-mcp251xfd-fix-coalesing-f373066dd42e
Best regards,
--
Marc Kleine-Budde <mkl(a)pengutronix.de>
The quilt patch titled
Subject: mm: avoid unconditional one-tick sleep when swapcache_prepare fails
has been removed from the -mm tree. Its filename was
mm-avoid-unconditional-one-tick-sleep-when-swapcache_prepare-fails.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Barry Song <v-songbaohua(a)oppo.com>
Subject: mm: avoid unconditional one-tick sleep when swapcache_prepare fails
Date: Fri, 27 Sep 2024 09:19:36 +1200
Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
introduced an unconditional one-tick sleep when `swapcache_prepare()`
fails, which has led to reports of UI stuttering on latency-sensitive
Android devices. To address this, we can use a waitqueue to wake up tasks
that fail `swapcache_prepare()` sooner, instead of always sleeping for a
full tick. While tasks may occasionally be woken by an unrelated
`do_swap_page()`, this method is preferable to two scenarios: rapid
re-entry into page faults, which can cause livelocks, and multiple
millisecond sleeps, which visibly degrade user experience.
Oven's testing shows that a single waitqueue resolves the UI stuttering
issue. If a 'thundering herd' problem becomes apparent later, a waitqueue
hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]` for page bit
locks can be introduced.
[v-songbaohua(a)oppo.com: wake_up only when swapcache_wq waitqueue is active]
Link: https://lkml.kernel.org/r/20241008130807.40833-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20240926211936.75373-1-21cnbao@gmail.com
Fixes: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
Signed-off-by: Barry Song <v-songbaohua(a)oppo.com>
Reported-by: Oven Liyang <liyangouwen1(a)oppo.com>
Tested-by: Oven Liyang <liyangouwen1(a)oppo.com>
Cc: Kairui Song <kasong(a)tencent.com>
Cc: "Huang, Ying" <ying.huang(a)intel.com>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Chris Li <chrisl(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Yosry Ahmed <yosryahmed(a)google.com>
Cc: SeongJae Park <sj(a)kernel.org>
Cc: Kalesh Singh <kaleshsingh(a)google.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
--- a/mm/memory.c~mm-avoid-unconditional-one-tick-sleep-when-swapcache_prepare-fails
+++ a/mm/memory.c
@@ -4187,6 +4187,8 @@ static struct folio *alloc_swap_folio(st
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
+
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -4199,6 +4201,7 @@ vm_fault_t do_swap_page(struct vm_fault
{
struct vm_area_struct *vma = vmf->vma;
struct folio *swapcache, *folio = NULL;
+ DECLARE_WAITQUEUE(wait, current);
struct page *page;
struct swap_info_struct *si = NULL;
rmap_t rmap_flags = RMAP_NONE;
@@ -4297,7 +4300,9 @@ vm_fault_t do_swap_page(struct vm_fault
* Relax a bit to prevent rapid
* repeated page faults.
*/
+ add_wait_queue(&swapcache_wq, &wait);
schedule_timeout_uninterruptible(1);
+ remove_wait_queue(&swapcache_wq, &wait);
goto out_page;
}
need_clear_cache = true;
@@ -4604,8 +4609,11 @@ unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
/* Clear the swap cache pin for direct swapin after PTL unlock */
- if (need_clear_cache)
+ if (need_clear_cache) {
swapcache_clear(si, entry, nr_pages);
+ if (waitqueue_active(&swapcache_wq))
+ wake_up(&swapcache_wq);
+ }
if (si)
put_swap_device(si);
return ret;
@@ -4620,8 +4628,11 @@ out_release:
folio_unlock(swapcache);
folio_put(swapcache);
}
- if (need_clear_cache)
+ if (need_clear_cache) {
swapcache_clear(si, entry, nr_pages);
+ if (waitqueue_active(&swapcache_wq))
+ wake_up(&swapcache_wq);
+ }
if (si)
put_swap_device(si);
return ret;
_
Patches currently in -mm which might be from v-songbaohua(a)oppo.com are
mm-fix-pswpin-counter-for-large-folios-swap-in.patch
The patch titled
Subject: mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-mmap-limit-thp-aligment-of-anonymous-mappings-to-pmd-aligned-sizes.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Vlastimil Babka <vbabka(a)suse.cz>
Subject: mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes
Date: Thu, 24 Oct 2024 17:12:29 +0200
Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
boundaries") a mmap() of anonymous memory without a specific address hint
and of at least PMD_SIZE will be aligned to PMD so that it can benefit
from a THP backing page.
However this change has been shown to regress some workloads
significantly. [1] reports regressions in various spec benchmarks, with
up to 600% slowdown of the cactusBSSN benchmark on some platforms. The
benchmark seems to create many mappings of 4632kB, which would have merged
to a large THP-backed area before commit efa7df3e3bb5 and now they are
fragmented to multiple areas each aligned to PMD boundary with gaps
between. The regression then seems to be caused mainly due to the
benchmark's memory access pattern suffering from TLB or cache aliasing due
to the aligned boundaries of the individual areas.
Another known regression bisected to commit efa7df3e3bb5 is darktable [2]
[3] and early testing suggests this patch fixes the regression there as
well.
To fix the regression but still try to benefit from THP-friendly anonymous
mapping alignment, add a condition that the size of the mapping must be a
multiple of PMD size instead of at least PMD size. In case of many
odd-sized mapping like the cactusBSSN creates, those will stop being
aligned and with gaps between, and instead naturally merge again.
Link: https://lkml.kernel.org/r/20241024151228.101841-2-vbabka@suse.cz
Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
Reported-by: Michael Matz <matz(a)suse.de>
Debugged-by: Gabriel Krisman Bertazi <gabriel(a)krisman.be>
Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1]
Reported-by: Matthias Bodenbinder <matthias(a)bodenbinder.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2]
Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.i… [3]
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Yang Shi <yang(a)os.amperecomputing.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Petr Tesarik <ptesarik(a)suse.com>
Cc: Thorsten Leemhuis <regressions(a)leemhuis.info>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mmap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/mmap.c~mm-mmap-limit-thp-aligment-of-anonymous-mappings-to-pmd-aligned-sizes
+++ a/mm/mmap.c
@@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, u
if (get_area) {
addr = get_area(file, addr, len, pgoff, flags);
- } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+ } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
+ && IS_ALIGNED(len, PMD_SIZE)) {
/* Ensures that larger anonymous mappings are THP aligned. */
addr = thp_get_unmapped_area_vmflags(file, addr, len,
pgoff, flags, vm_flags);
_
Patches currently in -mm which might be from vbabka(a)suse.cz are
mm-mmap-limit-thp-aligment-of-anonymous-mappings-to-pmd-aligned-sizes.patch
The patch titled
Subject: mm: shrinker: avoid memleak in alloc_shrinker_info
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-shrinker-avoid-memleak-in-alloc_shrinker_info.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Chen Ridong <chenridong(a)huawei.com>
Subject: mm: shrinker: avoid memleak in alloc_shrinker_info
Date: Fri, 25 Oct 2024 06:09:42 +0000
A memleak was found as below:
unreferenced object 0xffff8881010d2a80 (size 32):
comm "mkdir", pid 1559, jiffies 4294932666
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 @...............
backtrace (crc 2e7ef6fa):
[<ffffffff81372754>] __kmalloc_node_noprof+0x394/0x470
[<ffffffff813024ab>] alloc_shrinker_info+0x7b/0x1a0
[<ffffffff813b526a>] mem_cgroup_css_online+0x11a/0x3b0
[<ffffffff81198dd9>] online_css+0x29/0xa0
[<ffffffff811a243d>] cgroup_apply_control_enable+0x20d/0x360
[<ffffffff811a5728>] cgroup_mkdir+0x168/0x5f0
[<ffffffff8148543e>] kernfs_iop_mkdir+0x5e/0x90
[<ffffffff813dbb24>] vfs_mkdir+0x144/0x220
[<ffffffff813e1c97>] do_mkdirat+0x87/0x130
[<ffffffff813e1de9>] __x64_sys_mkdir+0x49/0x70
[<ffffffff81f8c928>] do_syscall_64+0x68/0x140
[<ffffffff8200012f>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
alloc_shrinker_info(), when shrinker_unit_alloc() returns an errer, the
info won't be freed. Just fix it.
Link: https://lkml.kernel.org/r/20241025060942.1049263-1-chenridong@huaweicloud.c…
Fixes: 307bececcd12 ("mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}")
Signed-off-by: Chen Ridong <chenridong(a)huawei.com>
Acked-by: Qi Zheng <zhengqi.arch(a)bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin(a)linux.dev>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Dave Chinner <david(a)fromorbit.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Wang Weiyang <wangweiyang2(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/shrinker.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
--- a/mm/shrinker.c~mm-shrinker-avoid-memleak-in-alloc_shrinker_info
+++ a/mm/shrinker.c
@@ -76,19 +76,21 @@ void free_shrinker_info(struct mem_cgrou
int alloc_shrinker_info(struct mem_cgroup *memcg)
{
- struct shrinker_info *info;
int nid, ret = 0;
int array_size = 0;
mutex_lock(&shrinker_mutex);
array_size = shrinker_unit_size(shrinker_nr_max);
for_each_node(nid) {
- info = kvzalloc_node(sizeof(*info) + array_size, GFP_KERNEL, nid);
+ struct shrinker_info *info = kvzalloc_node(sizeof(*info) + array_size,
+ GFP_KERNEL, nid);
if (!info)
goto err;
info->map_nr_max = shrinker_nr_max;
- if (shrinker_unit_alloc(info, NULL, nid))
+ if (shrinker_unit_alloc(info, NULL, nid)) {
+ kvfree(info);
goto err;
+ }
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
}
mutex_unlock(&shrinker_mutex);
_
Patches currently in -mm which might be from chenridong(a)huawei.com are
mm-shrinker-avoid-memleak-in-alloc_shrinker_info.patch
We notice some platforms set "snps,dis_u3_susphy_quirk" and
"snps,dis_u2_susphy_quirk" when they should not need to. Just make sure that
the GUSB3PIPECTL.SUSPENDENABLE and GUSB2PHYCFG.SUSPHY are clear during
initialization. The host initialization involved xhci. So the dwc3 needs to
implement the xhci_plat_priv->plat_start() for xhci to re-enable the suspend
bits.
Since there's a prerequisite patch to drivers/usb/host/xhci-plat.h that's not a
fix patch, this series should go on Greg's usb-testing branch instead of
usb-linus.
Thinh Nguyen (2):
usb: xhci-plat: Don't include xhci.h
usb: dwc3: core: Prevent phy suspend during init
drivers/usb/dwc3/core.c | 90 +++++++++++++++---------------------
drivers/usb/dwc3/core.h | 1 +
drivers/usb/dwc3/gadget.c | 2 +
drivers/usb/dwc3/host.c | 27 +++++++++++
drivers/usb/host/xhci-plat.h | 4 +-
5 files changed, 71 insertions(+), 53 deletions(-)
base-commit: 3d122e6d27e417a9fa91181922743df26b2cd679
--
2.28.0
The patch titled
Subject: vmscan,migrate: fix double-decrement on node stats when demoting pages
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
vmscanmigrate-fix-double-decrement-on-node-stats-when-demoting-pages.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Gregory Price <gourry(a)gourry.net>
Subject: vmscan,migrate: fix double-decrement on node stats when demoting pages
Date: Fri, 25 Oct 2024 10:17:24 -0400
When numa balancing is enabled with demotion, vmscan will call
migrate_pages when shrinking LRUs. Successful demotions will cause node
vmstat numbers to double-decrement, leading to an imbalanced page count.
The result is dmesg output like such:
$ cat /proc/sys/vm/stat_refresh
[77383.088417] vmstat_refresh: nr_isolated_anon -103212
[77383.088417] vmstat_refresh: nr_isolated_file -899642
This negative value may impact compaction and reclaim throttling.
The double-decrement occurs in the migrate_pages path:
caller to shrink_folio_list decrements the count
shrink_folio_list
demote_folio_list
migrate_pages
migrate_pages_batch
migrate_folio_move
migrate_folio_done
mod_node_page_state(-ve) <- second decrement
This path happens for SUCCESSFUL migrations, not failures. Typically
callers to migrate_pages are required to handle putback/accounting for
failures, but this is already handled in the shrink code.
When accounting for migrations, instead do not decrement the count when
the migration reason is MR_DEMOTION. As of v6.11, this demotion logic is
the only source of MR_DEMOTION.
Link: https://lkml.kernel.org/r/20241025141724.17927-1-gourry@gourry.net
Fixes: 26aa2d199d6f2 ("mm/migrate: demote pages during reclaim")
Signed-off-by: Gregory Price <gourry(a)gourry.net>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Huang Ying <ying.huang(a)intel.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Wei Xu <weixugc(a)google.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/migrate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/migrate.c~vmscanmigrate-fix-double-decrement-on-node-stats-when-demoting-pages
+++ a/mm/migrate.c
@@ -1178,7 +1178,7 @@ static void migrate_folio_done(struct fo
* not accounted to NR_ISOLATED_*. They can be recognized
* as __folio_test_movable
*/
- if (likely(!__folio_test_movable(src)))
+ if (likely(!__folio_test_movable(src)) && reason != MR_DEMOTION)
mod_node_page_state(folio_pgdat(src), NR_ISOLATED_ANON +
folio_is_file_lru(src), -folio_nr_pages(src));
_
Patches currently in -mm which might be from gourry(a)gourry.net are
vmscanmigrate-fix-double-decrement-on-node-stats-when-demoting-pages.patch
Returning an abort to the guest for an unsupported MMIO access is a
documented feature of the KVM UAPI. Nevertheless, it's clear that this
plumbing has seen limited testing, since userspace can trivially cause a
WARN in the MMIO return:
WARNING: CPU: 0 PID: 30558 at arch/arm64/include/asm/kvm_emulate.h:536 kvm_handle_mmio_return+0x46c/0x5c4 arch/arm64/include/asm/kvm_emulate.h:536
Call trace:
kvm_handle_mmio_return+0x46c/0x5c4 arch/arm64/include/asm/kvm_emulate.h:536
kvm_arch_vcpu_ioctl_run+0x98/0x15b4 arch/arm64/kvm/arm.c:1133
kvm_vcpu_ioctl+0x75c/0xa78 virt/kvm/kvm_main.c:4487
__do_sys_ioctl fs/ioctl.c:51 [inline]
__se_sys_ioctl fs/ioctl.c:893 [inline]
__arm64_sys_ioctl+0x14c/0x1c8 fs/ioctl.c:893
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
el0_svc_common+0x1e0/0x23c arch/arm64/kernel/syscall.c:132
do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
el0_svc+0x38/0x68 arch/arm64/kernel/entry-common.c:712
el0t_64_sync_handler+0x90/0xfc arch/arm64/kernel/entry-common.c:730
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
The splat is complaining that KVM is advancing PC while an exception is
pending, i.e. that KVM is retiring the MMIO instruction despite a
pending synchronous external abort. Womp womp.
Fix the glaring UAPI bug by skipping over all the MMIO emulation in
case there is a pending synchronous exception. Note that while userspace
is capable of pending an asynchronous exception (SError, IRQ, or FIQ),
it is still safe to retire the MMIO instruction in this case as (1) they
are by definition asynchronous, and (2) KVM relies on hardware support
for pending/delivering these exceptions instead of the software state
machine for advancing PC.
Cc: stable(a)vger.kernel.org
Fixes: da345174ceca ("KVM: arm/arm64: Allow user injection of external data aborts")
Reported-by: Alexander Potapenko <glider(a)google.com>
Signed-off-by: Oliver Upton <oliver.upton(a)linux.dev>
---
arch/arm64/kvm/mmio.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
index cd6b7b83e2c3..ab365e839874 100644
--- a/arch/arm64/kvm/mmio.c
+++ b/arch/arm64/kvm/mmio.c
@@ -72,6 +72,31 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned int len)
return data;
}
+static bool kvm_pending_sync_exception(struct kvm_vcpu *vcpu)
+{
+ if (!vcpu_get_flag(vcpu, PENDING_EXCEPTION))
+ return false;
+
+ if (vcpu_el1_is_32bit(vcpu)) {
+ switch (vcpu_get_flag(vcpu, EXCEPT_MASK)) {
+ case unpack_vcpu_flag(EXCEPT_AA32_UND):
+ case unpack_vcpu_flag(EXCEPT_AA32_IABT):
+ case unpack_vcpu_flag(EXCEPT_AA32_DABT):
+ return true;
+ default:
+ return false;
+ }
+ } else {
+ switch (vcpu_get_flag(vcpu, EXCEPT_MASK)) {
+ case unpack_vcpu_flag(EXCEPT_AA64_EL1_SYNC):
+ case unpack_vcpu_flag(EXCEPT_AA64_EL2_SYNC):
+ return true;
+ default:
+ return false;
+ }
+ }
+}
+
/**
* kvm_handle_mmio_return -- Handle MMIO loads after user space emulation
* or in-kernel IO emulation
@@ -84,8 +109,11 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu)
unsigned int len;
int mask;
- /* Detect an already handled MMIO return */
- if (unlikely(!vcpu->mmio_needed))
+ /*
+ * Detect if the MMIO return was already handled or if userspace aborted
+ * the MMIO access.
+ */
+ if (unlikely(!vcpu->mmio_needed || kvm_pending_sync_exception(vcpu)))
return 1;
vcpu->mmio_needed = 0;
--
2.47.0.163.g1226f6d8fa-goog
The patch titled
Subject: sched/numa: fix the potential null pointer dereference in task_numa_work()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
sched-numa-fix-the-potential-null-pointer-dereference-in-task_numa_work.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Shawn Wang <shawnwang(a)linux.alibaba.com>
Subject: sched/numa: fix the potential null pointer dereference in task_numa_work()
Date: Fri, 25 Oct 2024 10:22:08 +0800
When running stress-ng-vm-segv test, we found a null pointer dereference
error in task_numa_work(). Here is the backtrace:
[323676.066985] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
......
[323676.067108] CPU: 35 PID: 2694524 Comm: stress-ng-vm-se
......
[323676.067113] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
[323676.067115] pc : vma_migratable+0x1c/0xd0
[323676.067122] lr : task_numa_work+0x1ec/0x4e0
[323676.067127] sp : ffff8000ada73d20
[323676.067128] x29: ffff8000ada73d20 x28: 0000000000000000 x27: 000000003e89f010
[323676.067130] x26: 0000000000080000 x25: ffff800081b5c0d8 x24: ffff800081b27000
[323676.067133] x23: 0000000000010000 x22: 0000000104d18cc0 x21: ffff0009f7158000
[323676.067135] x20: 0000000000000000 x19: 0000000000000000 x18: ffff8000ada73db8
[323676.067138] x17: 0001400000000000 x16: ffff800080df40b0 x15: 0000000000000035
[323676.067140] x14: ffff8000ada73cc8 x13: 1fffe0017cc72001 x12: ffff8000ada73cc8
[323676.067142] x11: ffff80008001160c x10: ffff000be639000c x9 : ffff8000800f4ba4
[323676.067145] x8 : ffff000810375000 x7 : ffff8000ada73974 x6 : 0000000000000001
[323676.067147] x5 : 0068000b33e26707 x4 : 0000000000000001 x3 : ffff0009f7158000
[323676.067149] x2 : 0000000000000041 x1 : 0000000000004400 x0 : 0000000000000000
[323676.067152] Call trace:
[323676.067153] vma_migratable+0x1c/0xd0
[323676.067155] task_numa_work+0x1ec/0x4e0
[323676.067157] task_work_run+0x78/0xd8
[323676.067161] do_notify_resume+0x1ec/0x290
[323676.067163] el0_svc+0x150/0x160
[323676.067167] el0t_64_sync_handler+0xf8/0x128
[323676.067170] el0t_64_sync+0x17c/0x180
[323676.067173] Code: d2888001 910003fd f9000bf3 aa0003f3 (f9401000)
[323676.067177] SMP: stopping secondary CPUs
[323676.070184] Starting crashdump kernel...
stress-ng-vm-segv in stress-ng is used to stress test the SIGSEGV error
handling function of the system, which tries to cause a SIGSEGV error on
return from unmapping the whole address space of the child process.
Normally this program will not cause kernel crashes. But before the
munmap system call returns to user mode, a potential task_numa_work() for
numa balancing could be added and executed. In this scenario, since the
child process has no vma after munmap, the vma_next() in task_numa_work()
will return a null pointer even if the vma iterator restarts from 0.
Recheck the vma pointer before dereferencing it in task_numa_work().
Link: https://lkml.kernel.org/r/20241025022208.125527-1-shawnwang@linux.alibaba.c…
Fixes: 214dbc428137 ("sched: convert to vma iterator")
Signed-off-by: Shawn Wang <shawnwang(a)linux.alibaba.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Dietmar Eggemann <dietmar.eggemann(a)arm.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Juri Lelli <juri.lelli(a)redhat.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Steven Rostedt (Google) <rostedt(a)goodmis.org>
Cc: Valentin Schneider <vschneid(a)redhat.com>
Cc: Vincent Guittot <vincent.guittot(a)linaro.org>
Cc: <stable(a)vger.kernel.org> [6.2+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/kernel/sched/fair.c~sched-numa-fix-the-potential-null-pointer-dereference-in-task_numa_work
+++ a/kernel/sched/fair.c
@@ -3369,7 +3369,7 @@ retry_pids:
vma = vma_next(&vmi);
}
- do {
+ for (; vma; vma = vma_next(&vmi)) {
if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
@@ -3491,7 +3491,7 @@ retry_pids:
*/
if (vma_pids_forced)
break;
- } for_each_vma(vmi, vma);
+ }
/*
* If no VMAs are remaining and VMAs were skipped due to the PID
_
Patches currently in -mm which might be from shawnwang(a)linux.alibaba.com are
sched-numa-fix-the-potential-null-pointer-dereference-in-task_numa_work.patch
Commit 723e8462a4fe ("pinctrl: qcom: spmi-gpio: Fix the GPIO strength
mapping") fixed a long-standing issue in the Qualcomm SPMI PMIC gpio
driver which had the 'low' and 'high' drive strength settings switched
but failed to update the debugfs interface which still gets this wrong.
Fix the debugfs code so that the exported values match the hardware
settings.
Note that this probably means that most devicetrees that try to describe
the firmware settings got this wrong if the settings were derived from
debugfs. Before the above mentioned commit the settings would have
actually matched the firmware settings even if they were described
incorrectly, but now they are inverted.
Fixes: 723e8462a4fe ("pinctrl: qcom: spmi-gpio: Fix the GPIO strength mapping")
Fixes: eadff3024472 ("pinctrl: Qualcomm SPMI PMIC GPIO pin controller driver")
Cc: Anjelique Melendez <quic_amelende(a)quicinc.com>
Cc: stable(a)vger.kernel.org # 3.19
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/pinctrl/qcom/pinctrl-spmi-gpio.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pinctrl/qcom/pinctrl-spmi-gpio.c b/drivers/pinctrl/qcom/pinctrl-spmi-gpio.c
index 3d03293f6320..3a12304e2b7d 100644
--- a/drivers/pinctrl/qcom/pinctrl-spmi-gpio.c
+++ b/drivers/pinctrl/qcom/pinctrl-spmi-gpio.c
@@ -667,7 +667,7 @@ static void pmic_gpio_config_dbg_show(struct pinctrl_dev *pctldev,
"push-pull", "open-drain", "open-source"
};
static const char *const strengths[] = {
- "no", "high", "medium", "low"
+ "no", "low", "medium", "high"
};
pad = pctldev->desc->pins[pin].drv_data;
--
2.45.2
The quilt patch titled
Subject: mm/page_alloc: fix NUMA stats update for cpu-less nodes
has been removed from the -mm tree. Its filename was
mm-page_alloc-fix-numa-stats-update-for-cpu-less-nodes.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: Dongjoo Seo <dongjoo.linux.dev(a)gmail.com>
Subject: mm/page_alloc: fix NUMA stats update for cpu-less nodes
Date: Wed, 23 Oct 2024 10:50:37 -0700
In the case of memoryless node, when a process prefers a node with no
memory(e.g., because it is running on a CPU local to that node), the
kernel treats a nearby node with memory as the preferred node. As a
result, such allocations do not increment the numa_foreign counter on the
memoryless node, leading to skewed NUMA_HIT, NUMA_MISS, and NUMA_FOREIGN
stats for the nearest node.
This patch corrects this issue by:
1. Checking if the zone or preferred zone is CPU-less before updating
the NUMA stats.
2. Ensuring NUMA_HIT is only updated if the zone is not CPU-less.
3. Ensuring NUMA_FOREIGN is only updated if the preferred zone is not
CPU-less.
Example Before and After Patch:
- Before Patch:
node0 node1 node2
numa_hit 86333181 114338269 5108
numa_miss 5199455 0 56844591
numa_foreign 32281033 29763013 0
interleave_hit 91 91 0
local_node 86326417 114288458 0
other_node 5206219 49768 56849702
- After Patch:
node0 node1 node2
numa_hit 2523058 9225528 0
numa_miss 150213 10226 21495942
numa_foreign 17144215 4501270 0
interleave_hit 91 94 0
local_node 2493918 9208226 0
other_node 179351 27528 21495942
Similarly, in the context of cpuless nodes, this patch ensures that NUMA
statistics are accurately updated by adding checks to prevent the
miscounting of memory allocations when the involved nodes have no CPUs.
This ensures more precise tracking of memory access patterns across all
nodes, regardless of whether they have CPUs or not, improving the overall
reliability of NUMA stat. The reason is that page allocation from
dev_dax, cpuset, memcg .. comes with preferred allocating zone in cpuless
node and it's hard to track the zone info for miss information.
Link: https://lkml.kernel.org/r/20241023175037.9125-1-dongjoo.linux.dev@gmail.com
Signed-off-by: Dongjoo Seo <dongjoo.linux.dev(a)gmail.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Fan Ni <nifan(a)outlook.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Adam Manzanares <a.manzanares(a)samsung.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/mm/page_alloc.c~mm-page_alloc-fix-numa-stats-update-for-cpu-less-nodes
+++ a/mm/page_alloc.c
@@ -2858,19 +2858,21 @@ static inline void zone_statistics(struc
{
#ifdef CONFIG_NUMA
enum numa_stat_item local_stat = NUMA_LOCAL;
+ bool z_is_cpuless = !node_state(zone_to_nid(z), N_CPU);
+ bool pref_is_cpuless = !node_state(zone_to_nid(preferred_zone), N_CPU);
- /* skip numa counters update if numa stats is disabled */
if (!static_branch_likely(&vm_numa_stat_key))
return;
- if (zone_to_nid(z) != numa_node_id())
+ if (zone_to_nid(z) != numa_node_id() || z_is_cpuless)
local_stat = NUMA_OTHER;
- if (zone_to_nid(z) == zone_to_nid(preferred_zone))
+ if (zone_to_nid(z) == zone_to_nid(preferred_zone) && !z_is_cpuless)
__count_numa_events(z, NUMA_HIT, nr_account);
else {
__count_numa_events(z, NUMA_MISS, nr_account);
- __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
+ if (!pref_is_cpuless)
+ __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
}
__count_numa_events(z, local_stat, nr_account);
#endif
_
Patches currently in -mm which might be from dongjoo.linux.dev(a)gmail.com are
Changes since v1 [1]:
- Fix some misspellings missed by checkpatch in changelogs (Jonathan)
- Add comments explaining the order of objects in drivers/cxl/Makefile
(Jonathan)
- Rename attach_device => cxl_rescan_attach (Jonathan)
- Fixup Zijun's email (Zijun)
[1]: http://lore.kernel.org/172862483180.2150669.5564474284074502692.stgit@dwill…
---
Original cover:
Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to
delayed arrival of the CXL "root" infrastructure [1] prompted questions
of how the existing mechanism for retrying cxl_mem_probe() could be
failing.
The critical missing piece in the debug was that Gregory's setup had
almost all CXL modules built-in to the kernel.
On the way to that discovery several other bugs and init-order corner
cases were discovered.
The main fix is to make sure the drivers/cxl/Makefile object order
supports root CXL ports being fully initialized upon cxl_acpi_probe()
exit. The modular case has some similar potential holes that are fixed
with MODULE_SOFTDEP() and other fix ups. Finally, an attempt to update
cxl_test to reproduce the original report resulted in the discovery of a
separate long standing use after free bug in cxl_region_detach().
[2]: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net
---
Dan Williams (6):
cxl/port: Fix CXL port initialization order when the subsystem is built-in
cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices()
cxl/acpi: Ensure ports ready at cxl_acpi_probe() return
cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
cxl/port: Prevent out-of-order decoder allocation
cxl/test: Improve init-order fidelity relative to real-world systems
drivers/base/core.c | 35 +++++++
drivers/cxl/Kconfig | 1
drivers/cxl/Makefile | 20 +++-
drivers/cxl/acpi.c | 7 +
drivers/cxl/core/hdm.c | 50 +++++++++--
drivers/cxl/core/port.c | 13 ++-
drivers/cxl/core/region.c | 91 ++++++++++---------
drivers/cxl/cxl.h | 3 -
include/linux/device.h | 3 +
tools/testing/cxl/test/cxl.c | 200 +++++++++++++++++++++++-------------------
tools/testing/cxl/test/mem.c | 1
11 files changed, 269 insertions(+), 155 deletions(-)
base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
From: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
KASAN reports that the GPU metrics table allocated in
vangogh_tables_init() is not large enough for the memset done in
smu_cmn_init_soft_gpu_metrics(). Condensed report follows:
[ 33.861314] BUG: KASAN: slab-out-of-bounds in smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu]
[ 33.861799] Write of size 168 at addr ffff888129f59500 by task mangoapp/1067
...
[ 33.861808] CPU: 6 UID: 1000 PID: 1067 Comm: mangoapp Tainted: G W 6.12.0-rc4 #356 1a56f59a8b5182eeaf67eb7cb8b13594dd23b544
[ 33.861816] Tainted: [W]=WARN
[ 33.861818] Hardware name: Valve Galileo/Galileo, BIOS F7G0107 12/01/2023
[ 33.861822] Call Trace:
[ 33.861826] <TASK>
[ 33.861829] dump_stack_lvl+0x66/0x90
[ 33.861838] print_report+0xce/0x620
[ 33.861853] kasan_report+0xda/0x110
[ 33.862794] kasan_check_range+0xfd/0x1a0
[ 33.862799] __asan_memset+0x23/0x40
[ 33.862803] smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.863306] vangogh_get_gpu_metrics_v2_4+0x123/0xad0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.864257] vangogh_common_get_gpu_metrics+0xb0c/0xbc0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.865682] amdgpu_dpm_get_gpu_metrics+0xcc/0x110 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.866160] amdgpu_get_gpu_metrics+0x154/0x2d0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.867135] dev_attr_show+0x43/0xc0
[ 33.867147] sysfs_kf_seq_show+0x1f1/0x3b0
[ 33.867155] seq_read_iter+0x3f8/0x1140
[ 33.867173] vfs_read+0x76c/0xc50
[ 33.867198] ksys_read+0xfb/0x1d0
[ 33.867214] do_syscall_64+0x90/0x160
...
[ 33.867353] Allocated by task 378 on cpu 7 at 22.794876s:
[ 33.867358] kasan_save_stack+0x33/0x50
[ 33.867364] kasan_save_track+0x17/0x60
[ 33.867367] __kasan_kmalloc+0x87/0x90
[ 33.867371] vangogh_init_smc_tables+0x3f9/0x840 [amdgpu]
[ 33.867835] smu_sw_init+0xa32/0x1850 [amdgpu]
[ 33.868299] amdgpu_device_init+0x467b/0x8d90 [amdgpu]
[ 33.868733] amdgpu_driver_load_kms+0x19/0xf0 [amdgpu]
[ 33.869167] amdgpu_pci_probe+0x2d6/0xcd0 [amdgpu]
[ 33.869608] local_pci_probe+0xda/0x180
[ 33.869614] pci_device_probe+0x43f/0x6b0
Empirically we can confirm that the former allocates 152 bytes for the
table, while the latter memsets the 168 large block.
This is somewhat alleviated by the fact that allocation goes into a 192
SLAB bucket, but then for v3_0 metrics the table grows to 264 bytes which
would definitely be a problem.
Root cause appears that when GPU metrics tables for v2_4 parts were added
it was not considered to enlarge the table to fit.
The fix in this patch is rather "brute force" and perhaps later should be
done in a smarter way, by extracting and consolidating the part version to
size logic to a common helper, instead of brute forcing the largest
possible allocation. Nevertheless, for now this works and fixes the out of
bounds write.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
Fixes: 41cec40bc9ba ("drm/amd/pm: Vangogh: Add new gpu_metrics_v2_4 to acquire gpu_metrics")
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Evan Quan <evan.quan(a)amd.com>
Cc: Wenyou Yang <WenYou.Yang(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: <stable(a)vger.kernel.org> # v6.6+
---
drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
index 22737b11b1bf..36f4a4651918 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
@@ -242,7 +242,10 @@ static int vangogh_tables_init(struct smu_context *smu)
goto err0_out;
smu_table->metrics_time = 0;
- smu_table->gpu_metrics_table_size = max(sizeof(struct gpu_metrics_v2_3), sizeof(struct gpu_metrics_v2_2));
+ smu_table->gpu_metrics_table_size = sizeof(struct gpu_metrics_v2_2);
+ smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_3));
+ smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_4));
+ smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v3_0));
smu_table->gpu_metrics_table = kzalloc(smu_table->gpu_metrics_table_size, GFP_KERNEL);
if (!smu_table->gpu_metrics_table)
goto err1_out;
--
2.46.0
The previous implementation limited the tracing capabilities when perf
was run in the init PID namespace, making it impossible to trace
applications in non-init PID namespaces.
This update improves the tracing process by verifying the event owner.
This allows us to determine whether the user has the necessary
permissions to trace the application.
Cc: stable(a)vger.kernel.org
Fixes: aab473867fed ("coresight: etm4x: Don't trace PID for non-root PID namespace")
Signed-off-by: Julien Meunier <julien.meunier(a)nokia.com>
---
drivers/hwtracing/coresight/coresight-etm4x-core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c
index bf01f01964cf..8365307b1aec 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-core.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c
@@ -695,7 +695,7 @@ static int etm4_parse_event_config(struct coresight_device *csdev,
/* Only trace contextID when runs in root PID namespace */
if ((attr->config & BIT(ETM_OPT_CTXTID)) &&
- task_is_in_init_pid_ns(current))
+ task_is_in_init_pid_ns(event->owner))
/* bit[6], Context ID tracing bit */
config->cfg |= TRCCONFIGR_CID;
@@ -710,7 +710,7 @@ static int etm4_parse_event_config(struct coresight_device *csdev,
goto out;
}
/* Only trace virtual contextID when runs in root PID namespace */
- if (task_is_in_init_pid_ns(current))
+ if (task_is_in_init_pid_ns(event->owner))
config->cfg |= TRCCONFIGR_VMID | TRCCONFIGR_VMIDOPT;
}
--
2.34.1
On Fri, Oct 25, 2024 at 06:06:47PM +0200, Nirmoy Das wrote:
> On 10/24/2024 7:22 PM, Matthew Brost wrote:
>
> On Thu, Oct 24, 2024 at 10:14:21AM -0700, John Harrison wrote:
>
> On 10/24/2024 08:18, Nirmoy Das wrote:
>
> Flush xe ordered_wq in case of ufence timeout which is observed
> on LNL and that points to the recent scheduling issue with E-cores.
>
> This is similar to the recent fix:
> commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
> response timeout") and should be removed once there is E core
> scheduling fix.
>
> v2: Add platform check(Himal)
> s/__flush_workqueue/flush_workqueue(Jani)
>
> Cc: Badal Nilawar [1]<badal.nilawar(a)intel.com>
> Cc: Jani Nikula [2]<jani.nikula(a)intel.com>
> Cc: Matthew Auld [3]<matthew.auld(a)intel.com>
> Cc: John Harrison [4]<John.C.Harrison(a)Intel.com>
> Cc: Himal Prasad Ghimiray [5]<himal.prasad.ghimiray(a)intel.com>
> Cc: Lucas De Marchi [6]<lucas.demarchi(a)intel.com>
> Cc: [7]<stable(a)vger.kernel.org> # v6.11+
> Link: [8]https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754
> Suggested-by: Matthew Brost [9]<matthew.brost(a)intel.com>
> Signed-off-by: Nirmoy Das [10]<nirmoy.das(a)intel.com>
> Reviewed-by: Matthew Brost [11]<matthew.brost(a)intel.com>
> ---
> drivers/gpu/drm/xe/xe_wait_user_fence.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wai
> t_user_fence.c
> index f5deb81eba01..78a0ad3c78fe 100644
> --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c
> +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c
> @@ -13,6 +13,7 @@
> #include "xe_device.h"
> #include "xe_gt.h"
> #include "xe_macros.h"
> +#include "compat-i915-headers/i915_drv.h"
> #include "xe_exec_queue.h"
> static int do_compare(u64 addr, u64 value, u64 mask, u16 op)
> @@ -155,6 +156,19 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *
> data,
> }
> if (!timeout) {
> + if (IS_LUNARLAKE(xe)) {
> + /*
> + * This is analogous to e51527233804 ("drm/xe/gu
> c/ct: Flush g2h
> + * worker in case of g2h response timeout")
> + *
> + * TODO: Drop this change once workqueue schedul
> ing delay issue is
> + * fixed on LNL Hybrid CPU.
> + */
> + flush_workqueue(xe->ordered_wq);
>
> If we are having multiple instances of this workaround, can we wrap them up
> in as 'LNL_FLUSH_WORKQUEUE(q)' or some such? Put the IS_LNL check inside the
> macro and make it pretty obvious exactly where all the instances are by
> having a single macro name to search for.
>
>
> +1, I think Lucas is suggesting something similar to this on the chat to
> make sure we don't lose track of removing these W/A when this gets
> fixed.
>
> Matt
>
> Sounds good. I will add LNL_FLUSH_WORKQUEUE() and use that for all the
> places we need this WA.
>
You will need 2 macros...
- LNL_FLUSH_WORKQUEUE() which accepts xe_device, workqueue_struct
- LNL_FLUSH_WORK() which accepts xe_device, work_struct
Matt
> Regards,
>
> Nirmoy
>
>
>
> John.
>
>
> + err = do_compare(addr, args->value, args->mask,
> args->op);
> + if (err <= 0)
> + break;
> + }
> err = -ETIME;
> break;
> }
>
> References
>
> 1. mailto:badal.nilawar@intel.com
> 2. mailto:jani.nikula@intel.com
> 3. mailto:matthew.auld@intel.com
> 4. mailto:John.C.Harrison@Intel.com
> 5. mailto:himal.prasad.ghimiray@intel.com
> 6. mailto:lucas.demarchi@intel.com
> 7. mailto:stable@vger.kernel.org
> 8. https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754
> 9. mailto:matthew.brost@intel.com
> 10. mailto:nirmoy.das@intel.com
> 11. mailto:matthew.brost@intel.com
From: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
KASAN reports that the GPU metrics table allocated in
vangogh_tables_init() is not large enough for the memset done in
smu_cmn_init_soft_gpu_metrics(). Condensed report follows:
[ 33.861314] BUG: KASAN: slab-out-of-bounds in smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu]
[ 33.861799] Write of size 168 at addr ffff888129f59500 by task mangoapp/1067
...
[ 33.861808] CPU: 6 UID: 1000 PID: 1067 Comm: mangoapp Tainted: G W 6.12.0-rc4 #356 1a56f59a8b5182eeaf67eb7cb8b13594dd23b544
[ 33.861816] Tainted: [W]=WARN
[ 33.861818] Hardware name: Valve Galileo/Galileo, BIOS F7G0107 12/01/2023
[ 33.861822] Call Trace:
[ 33.861826] <TASK>
[ 33.861829] dump_stack_lvl+0x66/0x90
[ 33.861838] print_report+0xce/0x620
[ 33.861853] kasan_report+0xda/0x110
[ 33.862794] kasan_check_range+0xfd/0x1a0
[ 33.862799] __asan_memset+0x23/0x40
[ 33.862803] smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.863306] vangogh_get_gpu_metrics_v2_4+0x123/0xad0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.864257] vangogh_common_get_gpu_metrics+0xb0c/0xbc0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.865682] amdgpu_dpm_get_gpu_metrics+0xcc/0x110 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.866160] amdgpu_get_gpu_metrics+0x154/0x2d0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[ 33.867135] dev_attr_show+0x43/0xc0
[ 33.867147] sysfs_kf_seq_show+0x1f1/0x3b0
[ 33.867155] seq_read_iter+0x3f8/0x1140
[ 33.867173] vfs_read+0x76c/0xc50
[ 33.867198] ksys_read+0xfb/0x1d0
[ 33.867214] do_syscall_64+0x90/0x160
...
[ 33.867353] Allocated by task 378 on cpu 7 at 22.794876s:
[ 33.867358] kasan_save_stack+0x33/0x50
[ 33.867364] kasan_save_track+0x17/0x60
[ 33.867367] __kasan_kmalloc+0x87/0x90
[ 33.867371] vangogh_init_smc_tables+0x3f9/0x840 [amdgpu]
[ 33.867835] smu_sw_init+0xa32/0x1850 [amdgpu]
[ 33.868299] amdgpu_device_init+0x467b/0x8d90 [amdgpu]
[ 33.868733] amdgpu_driver_load_kms+0x19/0xf0 [amdgpu]
[ 33.869167] amdgpu_pci_probe+0x2d6/0xcd0 [amdgpu]
[ 33.869608] local_pci_probe+0xda/0x180
[ 33.869614] pci_device_probe+0x43f/0x6b0
Empirically we can confirm that the former allocates 152 bytes for the
table, while the latter memsets the 168 large block.
Root cause appears that when GPU metrics tables for v2_4 parts were added
it was not considered to enlarge the table to fit.
The fix in this patch is rather "brute force" and perhaps later should be
done in a smarter way, by extracting and consolidating the part version to
size logic to a common helper, instead of brute forcing the largest
possible allocation. Nevertheless, for now this works and fixes the out of
bounds write.
v2:
* Drop impossible v3_0 case. (Mario)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
Fixes: 41cec40bc9ba ("drm/amd/pm: Vangogh: Add new gpu_metrics_v2_4 to acquire gpu_metrics")
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Evan Quan <evan.quan(a)amd.com>
Cc: Wenyou Yang <WenYou.Yang(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: <stable(a)vger.kernel.org> # v6.6+
---
drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
index 22737b11b1bf..1fe020f1f4db 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
@@ -242,7 +242,9 @@ static int vangogh_tables_init(struct smu_context *smu)
goto err0_out;
smu_table->metrics_time = 0;
- smu_table->gpu_metrics_table_size = max(sizeof(struct gpu_metrics_v2_3), sizeof(struct gpu_metrics_v2_2));
+ smu_table->gpu_metrics_table_size = sizeof(struct gpu_metrics_v2_2);
+ smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_3));
+ smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_4));
smu_table->gpu_metrics_table = kzalloc(smu_table->gpu_metrics_table_size, GFP_KERNEL);
if (!smu_table->gpu_metrics_table)
goto err1_out;
--
2.46.0
In an event where the platform running the device control plane
is rebooted, reset is detected on the driver. It releases
all the resources and waits for the reset to complete. Once the
reset is done, it tries to build the resources back. At this
time if the device control plane is not yet started, then
the driver timeouts on the virtchnl message and retries to
establish the mailbox again.
In the retry flow, mailbox is deinitialized but the mailbox
workqueue is still alive and polling for the mailbox message.
This results in accessing the released control queue leading to
null-ptr-deref. Fix it by unrolling the work queue cancellation
and mailbox deinitialization in the order which they got
initialized.
Also remove the redundant scheduling of the mailbox task in
idpf_vc_core_init.
Fixes: 4930fbf419a7 ("idpf: add core init and interrupt request")
Fixes: 34c21fa894a1 ("idpf: implement virtchnl transaction manager")
Cc: stable(a)vger.kernel.org # 6.9+
Reviewed-by: Tarun K Singh <tarun.k.singh(a)intel.com>
Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga(a)intel.com>
---
drivers/net/ethernet/intel/idpf/idpf_lib.c | 1 +
drivers/net/ethernet/intel/idpf/idpf_virtchnl.c | 7 -------
2 files changed, 1 insertion(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
index c3848e10e7db..b4fbb99bfad2 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_lib.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c
@@ -1786,6 +1786,7 @@ static int idpf_init_hard_reset(struct idpf_adapter *adapter)
*/
err = idpf_vc_core_init(adapter);
if (err) {
+ cancel_delayed_work_sync(&adapter->mbx_task);
idpf_deinit_dflt_mbx(adapter);
goto unlock_mutex;
}
diff --git a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
index 3be883726b87..d77d6c3805e2 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
@@ -3017,11 +3017,6 @@ int idpf_vc_core_init(struct idpf_adapter *adapter)
goto err_netdev_alloc;
}
- /* Start the mailbox task before requesting vectors. This will ensure
- * vector information response from mailbox is handled
- */
- queue_delayed_work(adapter->mbx_wq, &adapter->mbx_task, 0);
-
queue_delayed_work(adapter->serv_wq, &adapter->serv_task,
msecs_to_jiffies(5 * (adapter->pdev->devfn & 0x07)));
@@ -3046,7 +3041,6 @@ int idpf_vc_core_init(struct idpf_adapter *adapter)
err_intr_req:
cancel_delayed_work_sync(&adapter->serv_task);
- cancel_delayed_work_sync(&adapter->mbx_task);
idpf_vport_params_buf_rel(adapter);
err_netdev_alloc:
kfree(adapter->vports);
@@ -3070,7 +3064,6 @@ int idpf_vc_core_init(struct idpf_adapter *adapter)
adapter->state = __IDPF_VER_CHECK;
if (adapter->vcxn_mngr)
idpf_vc_xn_shutdown(adapter->vcxn_mngr);
- idpf_deinit_dflt_mbx(adapter);
set_bit(IDPF_HR_DRV_LOAD, adapter->flags);
queue_delayed_work(adapter->vc_event_wq, &adapter->vc_event_task,
msecs_to_jiffies(task_delay));
--
2.43.0
PD3.1 spec ("8.3.3.3.3 PE_SNK_Wait_for_Capabilities State") mandates
that the policy engine perform a hard reset when SinkWaitCapTimer
expires. Instead the code explicitly does a GET_SOURCE_CAP when the
timer expires as part of SNK_WAIT_CAPABILITIES_TIMEOUT. Due to this the
following compliance test failures are reported by the compliance tester
(added excerpts from the PD Test Spec):
* COMMON.PROC.PD.2#1:
The Tester receives a Get_Source_Cap Message from the UUT. This
message is valid except the following conditions: [COMMON.PROC.PD.2#1]
a. The check fails if the UUT sends this message before the Tester
has established an Explicit Contract
...
* TEST.PD.PROT.SNK.4:
...
4. The check fails if the UUT does not send a Hard Reset between
tTypeCSinkWaitCap min and max. [TEST.PD.PROT.SNK.4#1] The delay is
between the VBUS present vSafe5V min and the time of the first bit
of Preamble of the Hard Reset sent by the UUT.
For the purpose of interoperability, restrict the quirk introduced in
https://lore.kernel.org/all/20240523171806.223727-1-sebastian.reichel@colla…
to only non self-powered devices as battery powered devices will not
have the issue mentioned in that commit.
Cc: stable(a)vger.kernel.org
Fixes: 122968f8dda8 ("usb: typec: tcpm: avoid resets for missing source capability messages")
Reported-by: Badhri Jagan Sridharan <badhri(a)google.com>
Closes: https://lore.kernel.org/all/CAPTae5LAwsVugb0dxuKLHFqncjeZeJ785nkY4Jfd+M-tCj…
Signed-off-by: Amit Sunil Dhamne <amitsd(a)google.com>
Reviewed-by: Badhri Jagan Sridharan <badhri(a)google.com>
---
drivers/usb/typec/tcpm/tcpm.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/drivers/usb/typec/tcpm/tcpm.c b/drivers/usb/typec/tcpm/tcpm.c
index d6f2412381cf..c8f467d24fbb 100644
--- a/drivers/usb/typec/tcpm/tcpm.c
+++ b/drivers/usb/typec/tcpm/tcpm.c
@@ -4515,7 +4515,8 @@ static inline enum tcpm_state hard_reset_state(struct tcpm_port *port)
return ERROR_RECOVERY;
if (port->pwr_role == TYPEC_SOURCE)
return SRC_UNATTACHED;
- if (port->state == SNK_WAIT_CAPABILITIES_TIMEOUT)
+ if (port->state == SNK_WAIT_CAPABILITIES ||
+ port->state == SNK_WAIT_CAPABILITIES_TIMEOUT)
return SNK_READY;
return SNK_UNATTACHED;
}
@@ -5043,8 +5044,11 @@ static void run_state_machine(struct tcpm_port *port)
tcpm_set_state(port, SNK_SOFT_RESET,
PD_T_SINK_WAIT_CAP);
} else {
- tcpm_set_state(port, SNK_WAIT_CAPABILITIES_TIMEOUT,
- PD_T_SINK_WAIT_CAP);
+ if (!port->self_powered)
+ upcoming_state = SNK_WAIT_CAPABILITIES_TIMEOUT;
+ else
+ upcoming_state = hard_reset_state(port);
+ tcpm_set_state(port, upcoming_state, PD_T_SINK_WAIT_CAP);
}
break;
case SNK_WAIT_CAPABILITIES_TIMEOUT:
base-commit: c6d9e43954bfa7415a1e9efdb2806ec1d8a8afc8
--
2.47.0.105.g07ac214952-goog
ctrl->dh_key might be used across multiple calls to nvmet_setup_dhgroup()
for the same controller. So it's better to nullify it after release on
error path in order to avoid double free later in nvmet_destroy_auth().
Found by Linux Verification Center (linuxtesting.org) with Svace.
Fixes: 7a277c37d352 ("nvmet-auth: Diffie-Hellman key exchange support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Vitaliy Shevtsov <v.shevtsov(a)maxima.ru>
---
drivers/nvme/target/auth.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/nvme/target/auth.c b/drivers/nvme/target/auth.c
index e900525b7866..7bca64de4a2f 100644
--- a/drivers/nvme/target/auth.c
+++ b/drivers/nvme/target/auth.c
@@ -101,6 +101,7 @@ int nvmet_setup_dhgroup(struct nvmet_ctrl *ctrl, u8 dhgroup_id)
pr_debug("%s: ctrl %d failed to generate private key, err %d\n",
__func__, ctrl->cntlid, ret);
kfree_sensitive(ctrl->dh_key);
+ ctrl->dh_key = NULL;
return ret;
}
ctrl->dh_keysize = crypto_kpp_maxsize(ctrl->dh_tfm);
--
2.46.1
As there is very little ordering in the KVM API, userspace can
instanciate a half-baked GIC (missing its memory map, for example)
at almost any time.
This means that, with the right timing, a thread running vcpu-0
can enter the kernel without a GIC configured and get a GIC created
behind its back by another thread. Amusingly, it will pick up
that GIC and start messing with the data structures without the
GIC having been fully initialised.
Similarly, a thread running vcpu-1 can enter the kernel, and try
to init the GIC that was previously created. Since this GIC isn't
properly configured (no memory map), it fails to correctly initialise.
And that's the point where we decide to teardown the GIC, freeing all
its resources. Behind vcpu-0's back. Things stop pretty abruptly,
with a variety of symptoms. Clearly, this isn't good, we should be
a bit more careful about this.
It is obvious that this guest is not viable, as it is missing some
important part of its configuration. So instead of trying to tear
bits of it down, let's just mark it as *dead*. It means that any
further interaction from userspace will result in -EIO. The memory
will be released on the "normal" path, when userspace gives up.
Cc: stable(a)vger.kernel.org
Reported-by: Alexander Potapenko <glider(a)google.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
---
arch/arm64/kvm/arm.c | 3 +++
arch/arm64/kvm/vgic/vgic-init.c | 6 +++---
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a0d01c46e4084..b97ada19f06a7 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -997,6 +997,9 @@ static int kvm_vcpu_suspend(struct kvm_vcpu *vcpu)
static int check_vcpu_requests(struct kvm_vcpu *vcpu)
{
if (kvm_request_pending(vcpu)) {
+ if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
+ return -EIO;
+
if (kvm_check_request(KVM_REQ_SLEEP, vcpu))
kvm_vcpu_sleep(vcpu);
diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index e7c53e8af3d16..c4cbf798e71a4 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -536,10 +536,10 @@ int kvm_vgic_map_resources(struct kvm *kvm)
out:
mutex_unlock(&kvm->arch.config_lock);
out_slots:
- mutex_unlock(&kvm->slots_lock);
-
if (ret)
- kvm_vgic_destroy(kvm);
+ kvm_vm_dead(kvm);
+
+ mutex_unlock(&kvm->slots_lock);
return ret;
}
--
2.39.2
İşlemlerinize ilişkin e-dekontlarınız ekte gönderilmiştir. İyi
günler dileriz. Bu mail otomatik olarak bilgilendirme amacıyla
gönderilmiştir. Lütfen cevaplamayınız.
Türkiye Cumhuriyeti Ziraat Bankası Anonim Şirketi, Finanskent
Mahallesi, Finans Caddesi No:44A Ümraniye/İstanbul
Ticaret Sicil No:475225-5 Mersis No: 0998006967505633 |
www.ziraatbank.com.tr
From: "Jiri Slaby (SUSE)" <jirislaby(a)kernel.org>
[ Upstream commit 602babaa84d627923713acaf5f7e9a4369e77473 ]
Commit af224ca2df29 (serial: core: Prevent unsafe uart port access, part
3) added few uport == NULL checks. It added one to uart_shutdown(), so
the commit assumes, uport can be NULL in there. But right after that
protection, there is an unprotected "uart_port_dtr_rts(uport, false);"
call. That is invoked only if HUPCL is set, so I assume that is the
reason why we do not see lots of these reports.
Or it cannot be NULL at this point at all for some reason :P.
Until the above is investigated, stay on the safe side and move this
dereference to the if too.
I got this inconsistency from Coverity under CID 1585130. Thanks.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby(a)kernel.org>
Cc: Peter Hurley <peter(a)hurleysoftware.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Link: https://lore.kernel.org/r/20240805102046.307511-3-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
[Adapted over commit 5701cb8bf50e ("tty: Call ->dtr_rts() parameter
active consistently") not in the tree]
Signed-off-by: Tomas Krcka <krckatom(a)amazon.de>
---
drivers/tty/serial/serial_core.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index 58e857fb8dee..fd9f635dc247 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -339,14 +339,16 @@ static void uart_shutdown(struct tty_struct *tty, struct uart_state *state)
/*
* Turn off DTR and RTS early.
*/
- if (uport && uart_console(uport) && tty) {
- uport->cons->cflag = tty->termios.c_cflag;
- uport->cons->ispeed = tty->termios.c_ispeed;
- uport->cons->ospeed = tty->termios.c_ospeed;
- }
+ if (uport) {
+ if (uart_console(uport) && tty) {
+ uport->cons->cflag = tty->termios.c_cflag;
+ uport->cons->ispeed = tty->termios.c_ispeed;
+ uport->cons->ospeed = tty->termios.c_ospeed;
+ }
- if (!tty || C_HUPCL(tty))
- uart_port_dtr_rts(uport, 0);
+ if (!tty || C_HUPCL(tty))
+ uart_port_dtr_rts(uport, 0);
+ }
uart_port_shutdown(port);
}
--
2.40.1
From: Si-Wei Liu <si-wei.liu(a)oracle.com>
When calculating the physical address range based on the iotlb and mr
[start,end) ranges, the offset of mr->start relative to map->start
is not taken into account. This leads to some incorrect and duplicate
mappings.
For the case when mr->start < map->start the code is already correct:
the range in [mr->start, map->start) was handled by a different
iteration.
Fixes: 94abbccdf291 ("vdpa/mlx5: Add shared memory registration code")
Cc: stable(a)vger.kernel.org
Signed-off-by: Si-Wei Liu <si-wei.liu(a)oracle.com>
Signed-off-by: Dragos Tatulea <dtatulea(a)nvidia.com>
---
drivers/vdpa/mlx5/core/mr.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 2dd21e0b399e..7d0c83b5b071 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -373,7 +373,7 @@ static int map_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr
struct page *pg;
unsigned int nsg;
int sglen;
- u64 pa;
+ u64 pa, offset;
u64 paend;
struct scatterlist *sg;
struct device *dma = mvdev->vdev.dma_dev;
@@ -396,8 +396,10 @@ static int map_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr
sg = mr->sg_head.sgl;
for (map = vhost_iotlb_itree_first(iotlb, mr->start, mr->end - 1);
map; map = vhost_iotlb_itree_next(map, mr->start, mr->end - 1)) {
- paend = map->addr + maplen(map, mr);
- for (pa = map->addr; pa < paend; pa += sglen) {
+ offset = mr->start > map->start ? mr->start - map->start : 0;
+ pa = map->addr + offset;
+ paend = map->addr + offset + maplen(map, mr);
+ for (; pa < paend; pa += sglen) {
pg = pfn_to_page(__phys_to_pfn(pa));
if (!sg) {
mlx5_vdpa_warn(mvdev, "sg null. start 0x%llx, end 0x%llx\n",
--
2.46.1
On a x86 system under test with 1780 CPUs, topology_span_sane() takes
around 8 seconds cumulatively for all the iterations. It is an expensive
operation which does the sanity of non-NUMA topology masks.
CPU topology is not something which changes very frequently hence make
this check optional for the systems where the topology is trusted and
need faster bootup.
Restrict this to SCHED_DEBUG builds so that this penalty can be avoided
for the systems who wants to avoid it.
Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap")
Signed-off-by: Saurabh Sengar <ssengar(a)linux.microsoft.com>
---
kernel/sched/topology.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..dacc8c6f978b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2354,6 +2354,7 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
return sd;
}
+#ifdef CONFIG_SCHED_DEBUG
/*
* Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
* any two given CPUs at this (non-NUMA) topology level.
@@ -2387,6 +2388,7 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
return true;
}
+#endif
/*
* Build sched domains for a given set of CPUs and attach the sched domains
@@ -2417,8 +2419,10 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
sd = NULL;
for_each_sd_topology(tl) {
+#ifdef CONFIG_SCHED_DEBUG
if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
goto error;
+#endif
sd = build_sched_domain(tl, cpu_map, attr, sd, i);
--
2.43.0
Hello,
I'd like to report a regression to menuconfig.
In "Device Drivers" ---> "Input device support"
There used to be submenus for keyboards, mice etc.
Now, only the entry "Hardware I/O ports" remains.
They can still be accessed (and configured) by searching for them,
then pressing the corresponding number:
/Keyboard
/KEYBOARD_ATKBD
I also determined the commit:
f79dc03fe68c79d388908182e68d702f7f1786bc
kconfig: refactor choice value calculation
#regzbot introduced: f79dc03fe68c
I've only encountered this here but it is not impossible other entries
elsewhere in menuconfig might be missing aswell.
The issue is present in stable 6.11.5 and mainline.
Kind regards,
Edmund Raile.
Hello,
I'd like to report a regression in firewire-ohci that results
in the kernel hardlocking when re-discovering a FireWire device.
TI XIO2213B
RME FireFace 800
It will occur under three conditions:
* power-cycling the FireWire device
* un- and re-plugging the FireWire device
* suspending and then waking the PC
Often it would also occur directly on boot in QEMU but I have not
yet observed this specific behavior on bare metal.
Here is an excerpt from the stack trace (don't know whether it is
acceptable to send in full):
kernel: ------------[ cut here ]------------
kernel: refcount_t: addition on 0; use-after-free.
kernel: WARNING: CPU: 3 PID: 116 at lib/refcount.c:25
refcount_warn_saturate (/build/linux/lib/refcount.c:25 (discriminator
1))
kernel: Workqueue: firewire_ohci bus_reset_work
kernel: RIP: 0010:refcount_warn_saturate
(/build/linux/lib/refcount.c:25 (discriminator 1))
kernel: Call Trace:
kernel: <TASK>
kernel: ? refcount_warn_saturate (/build/linux/lib/refcount.c:25
(discriminator 1))
kernel: ? __warn.cold (/build/linux/kernel/panic.c:693)
kernel: ? refcount_warn_saturate (/build/linux/lib/refcount.c:25
(discriminator 1))
kernel: ? report_bug (/build/linux/lib/bug.c:180
/build/linux/lib/bug.c:219)
kernel: ? handle_bug (/build/linux/arch/x86/kernel/traps.c:218)
kernel: ? exc_invalid_op (/build/linux/arch/x86/kernel/traps.c:260
(discriminator 1))
kernel: ? asm_exc_invalid_op
(/build/linux/./arch/x86/include/asm/idtentry.h:621)
kernel: ? refcount_warn_saturate (/build/linux/lib/refcount.c:25
(discriminator 1))
kernel: for_each_fw_node (/build/linux/./include/linux/refcount.h:190
/build/linux/./include/linux/refcount.h:241
/build/linux/./include/linux/refcount.h:258
/build/linux/drivers/firewire/core.h:199
/build/linux/drivers/firewire/core-topology.c:275)
kernel: ? __pfx_report_found_node (/build/linux/drivers/firewire/core-
topology.c:312)
kernel: fw_core_handle_bus_reset (/build/linux/drivers/firewire/core-
topology.c:399 (discriminator 1) /build/linux/drivers/firewire/core-
topology.c:504 (discriminator 1))
kernel: bus_reset_work (/build/linux/drivers/firewire/ohci.c:2121)
kernel: process_one_work
(/build/linux/./arch/x86/include/asm/jump_label.h:27
/build/linux/./include/linux/jump_label.h:207
/build/linux/./include/trace/events/workqueue.h:110
/build/linux/kernel/workqueue.c:3236)
kernel: worker_thread (/build/linux/kernel/workqueue.c:3306
(discriminator 2) /build/linux/kernel/workqueue.c:3393 (discriminator
2))
kernel: ? __pfx_worker_thread (/build/linux/kernel/workqueue.c:3339)
kernel: kthread (/build/linux/kernel/kthread.c:389)
kernel: ? __pfx_kthread (/build/linux/kernel/kthread.c:342)
kernel: ret_from_fork (/build/linux/arch/x86/kernel/process.c:153)
kernel: ? __pfx_kthread (/build/linux/kernel/kthread.c:342)
kernel: ret_from_fork_asm (/build/linux/arch/x86/entry/entry_64.S:254)
kernel: </TASK>
I have identified the commit via bisection:
24b7f8e5cd656196a13077e160aec45ad89b58d9
firewire: core: use helper functions for self ID sequence
It was part of the following patch series:
firewire: add tracepoints events for self ID sequence
https://lore.kernel.org/all/20240605235155.116468-6-o-takashi@sakamocchi.jp/
#regzbot introduced: 24b7f8e5cd65
Since this was before v6.10-rc5 and stable 6.10.14 is EOL,
stable v6.11.5 and mainline are affected.
Reversion appears to be non-trivial as it is part of a patch
series, other files have been altered as well and other commits
build on top of it.
Call chain:
core-topology.c fw_core_handle_bus_reset()
-> core-topology.c for_each_fw_node(card, local_node,
report_found_node)
-> core.h fw_node_get(root)
-> refcount.h __refcount_inc(&node)
-> refcount.h __refcount_add(1, r, oldp);
-> refcount.h refcount_warn_saturate(r, REFCOUNT_ADD_UAF);
-> refcount.h REFCOUNT_WARN("addition on 0; use-after-free")
Since local_node of fw_core_handle_bus_reset() is retrieved by
local_node = build_tree(card, self_ids, self_id_count);
build_tree() needs to be looked at, it was indeed altered by
24b7f8e5cd65.
After a hard 3 hour look traversing all used functions and comparing
against the original function (as of e404cacfc5ed), this caught my eye:
for (port_index = 0; port_index < total_port_count;
++port_index) {
switch (port_status) {
case PHY_PACKET_SELF_ID_PORT_STATUS_PARENT:
node->color = i;
In both for loops, "port_index" was replaced by "i"
"i" remains in use above:
for (i = 0, h = &stack; i < child_port_count; i++)
h = h->prev;
While the original also used the less descriptive i in the loop
for (i = 0; i < port_count; i++) {
switch (get_port_type(sid, i)) {
case SELFID_PORT_PARENT:
node->color = i;
but reset it to 0 at the beginning of the loop.
So the stray "i" in the for loop should be replaced with the loop
iterator "port_index" as it is meant to be synchronous with the
loop iterator (i.e. the port_index), no?
diff --git a/drivers/firewire/core-topology.c b/drivers/firewire/core-
topology.c
index 8c10f47cc8fc..7fd91ba9c9c4 100644
--- a/drivers/firewire/core-topology.c
+++ b/drivers/firewire/core-topology.c
@@ -207,7 +207,7 @@ static struct fw_node *build_tree(struct fw_card
*card, const u32 *sid, int self
// the node->ports array where the
parent node should be. Later,
// when we handle the parent node, we
fix up the reference.
++parent_count;
- node->color = i;
+ node->color = port_index;
break;
What threw me off was discaridng node->color as it would be replaced
later anyways (can't be important!), or so I thought.
Please tell me, is this line of reasoning correct or am I missing
something?
Compiling 24b7f8e5cd65 and later mainline with the patch above
resulted in a kernel that didn't crash!
In case my solution should turn out to be correct, I will gladly
submit the patch.
Kind regards,
Edmund Raile.
With the transition of pd-mapper into the kernel, the timing was altered
such that on some targets the initial rpmsg_send() requests from
pmic_glink clients would be attempted before the firmware had announced
intents, and the firmware reject intent requests.
Fix this
Signed-off-by: Bjorn Andersson <bjorn.andersson(a)oss.qualcomm.com>
---
Changes in v2:
- Introduced "intents" and fixed a few spelling mistakes in the commit
message of patch 1
- Cleaned up log snippet in commit message of patch 2, added battery
manager log
- Changed the arbitrary 10 second timeout to 5... Ought to be enough for
anybody.
- Added a small sleep in the send-loop in patch 2, and by that
refactored the loop completely.
- Link to v1: https://lore.kernel.org/r/20241022-pmic-glink-ecancelled-v1-0-9e26fc74e0a3@…
---
Bjorn Andersson (2):
rpmsg: glink: Handle rejected intent request better
soc: qcom: pmic_glink: Handle GLINK intent allocation rejections
drivers/rpmsg/qcom_glink_native.c | 10 +++++++---
drivers/soc/qcom/pmic_glink.c | 25 ++++++++++++++++++++++---
2 files changed, 29 insertions(+), 6 deletions(-)
---
base-commit: 42f7652d3eb527d03665b09edac47f85fb600924
change-id: 20241022-pmic-glink-ecancelled-d899a9ca0358
Best regards,
--
Bjorn Andersson <bjorn.andersson(a)oss.qualcomm.com>
Adds support for detecting and reporting the speed of unaligned vector
accesses on RISC-V CPUs. Adds vec_misaligned_speed key to the hwprobe
adds Zicclsm to cpufeature and fixes the check for scalar unaligned
emulated all CPUs. The vec_misaligned_speed key keeps the same format
as the scalar unaligned access speed key.
This set does not emulate unaligned vector accesses on CPUs that do not
support them. Only reports if userspace can run them and speed of
unaligned vector accesses if supported.
Signed-off-by: Charlie Jenkins <charlie(a)rivosinc.com>
Signed-off-by: Jesse Taube <jesse(a)rivosinc.com>
---
Changes in V6:
Added ("RISC-V: Scalar unaligned access emulated on hotplug CPUs")
Changes in V8:
Dropped Zicclsm
s/RISCV_HWPROBE_VECTOR_MISALIGNED/RISCV_HWPROBE_MISALIGNED_VECTOR/g
to match RISCV_HWPROBE_MISALIGNED_SCALAR_*
Rebased onto palmer/fixes (32d5f7add080a936e28ab4142bfeea6b06999789)
Changes in V9:
Missed a RISCV_HWPROBE_VECTOR_MISALIGNED...
Changes in V10:
- I sent on behalf of Jesse
- Remove v0 from clobber args in inline asm and leave comment
---
Jesse Taube (6):
RISC-V: Check scalar unaligned access on all CPUs
RISC-V: Scalar unaligned access emulated on hotplug CPUs
RISC-V: Replace RISCV_MISALIGNED with RISCV_SCALAR_MISALIGNED
RISC-V: Detect unaligned vector accesses supported
RISC-V: Report vector unaligned access speed hwprobe
RISC-V: hwprobe: Document unaligned vector perf key
Documentation/arch/riscv/hwprobe.rst | 16 +++
arch/riscv/Kconfig | 58 ++++++++++-
arch/riscv/include/asm/cpufeature.h | 10 +-
arch/riscv/include/asm/entry-common.h | 11 --
arch/riscv/include/asm/hwprobe.h | 2 +-
arch/riscv/include/asm/vector.h | 2 +
arch/riscv/include/uapi/asm/hwprobe.h | 5 +
arch/riscv/kernel/Makefile | 3 +-
arch/riscv/kernel/copy-unaligned.h | 5 +
arch/riscv/kernel/fpu.S | 4 +-
arch/riscv/kernel/sys_hwprobe.c | 41 ++++++++
arch/riscv/kernel/traps_misaligned.c | 139 +++++++++++++++++++++++--
arch/riscv/kernel/unaligned_access_speed.c | 156 +++++++++++++++++++++++++++--
arch/riscv/kernel/vec-copy-unaligned.S | 58 +++++++++++
arch/riscv/kernel/vector.c | 2 +-
15 files changed, 474 insertions(+), 38 deletions(-)
---
base-commit: 98f7e32f20d28ec452afb208f9cffc08448a2652
change-id: 20240920-jesse_unaligned_vector-7083fd28659c
--
- Charlie
The raw value conversion to obtain a measurement in lux as
INT_PLUS_MICRO does not calculate the decimal part properly to display
it as micro (in this case microlux). It only calculates the module to
obtain the decimal part from a resolution that is 10000 times the
provided in the datasheet (0.5376 lux/cnt for the veml6030). The
resulting value must still be multiplied by 100 to make it micro.
This bug was introduced with the original implementation of the driver.
Cc: stable(a)vger.kernel.org
Fixes: 7b779f573c48 ("iio: light: add driver for veml6030 ambient light sensor")
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
---
I found this almost by chance while testing new supported devices. The
decimal part was always suspiciously small, and when I compared samples
to the expected value according to the datasheet, it became clear what was
going on.
Example with a veml7700 (same resolution as the veml6030):
Resolution for gain = 1/8, IT = 100 ms: 0.5736 lux/cnt.
cat in_illuminance_raw in_illuminance_input
40
21.005040 -> wrong! 40 * 0.5736 is 21.504.
Tested with a veml6035 and a veml7700, the same will happen with the
original veml6030, as the operation is identical for all devices.
---
drivers/iio/light/veml6030.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iio/light/veml6030.c b/drivers/iio/light/veml6030.c
index d6f3b104b0e6..a0bf03e37df7 100644
--- a/drivers/iio/light/veml6030.c
+++ b/drivers/iio/light/veml6030.c
@@ -691,7 +691,7 @@ static int veml6030_read_raw(struct iio_dev *indio_dev,
}
if (mask == IIO_CHAN_INFO_PROCESSED) {
*val = (reg * data->cur_resolution) / 10000;
- *val2 = (reg * data->cur_resolution) % 10000;
+ *val2 = (reg * data->cur_resolution) % 10000 * 100;
return IIO_VAL_INT_PLUS_MICRO;
}
*val = reg;
---
base-commit: 15e7d45e786a62a211dd0098fee7c57f84f8c681
change-id: 20241016-veml6030-fix-processed-micro-616d00d555dc
Best regards,
--
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
Flush the g2h worker explicitly if TLB timeout happens which is
observed on LNL and that points to the recent scheduling issue with
E-cores on LNL.
This is similar to the recent fix:
commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
response timeout") and should be removed once there is E core
scheduling fix.
v2: Add platform check(Himal)
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687
Cc: Badal Nilawar <badal.nilawar(a)intel.com>
Cc: Matthew Brost <matthew.brost(a)intel.com>
Cc: Matthew Auld <matthew.auld(a)intel.com>
Cc: John Harrison <John.C.Harrison(a)Intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray(a)intel.com>
Cc: Lucas De Marchi <lucas.demarchi(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.11+
Signed-off-by: Nirmoy Das <nirmoy.das(a)intel.com>
Reviewed-by: Matthew Brost <matthew.brost(a)intel.com>
Acked-by: Badal Nilawar <badal.nilawar(a)intel.com>
---
drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
index 773de1f08db9..5aba6ed950b7 100644
--- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
+++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
@@ -6,6 +6,7 @@
#include "xe_gt_tlb_invalidation.h"
#include "abi/guc_actions_abi.h"
+#include "compat-i915-headers/i915_drv.h"
#include "xe_device.h"
#include "xe_force_wake.h"
#include "xe_gt.h"
@@ -72,6 +73,16 @@ static void xe_gt_tlb_fence_timeout(struct work_struct *work)
struct xe_device *xe = gt_to_xe(gt);
struct xe_gt_tlb_invalidation_fence *fence, *next;
+ /*
+ * This is analogous to e51527233804 ("drm/xe/guc/ct: Flush g2h worker
+ * in case of g2h response timeout")
+ *
+ * TODO: Drop this change once workqueue scheduling delay issue is
+ * fixed on LNL Hybrid CPU.
+ */
+ if (IS_LUNARLAKE(xe))
+ flush_work(>->uc.guc.ct.g2h_worker);
+
spin_lock_irq(>->tlb_invalidation.pending_lock);
list_for_each_entry_safe(fence, next,
>->tlb_invalidation.pending_fences, link) {
--
2.46.0
In the original code, the device reset would not be triggered
when the driver is set to multi-master and bus is free.
It needs to be considered with multi-master condition.
Fixes: <f327c686d3ba> ("i2c: aspeed: Reset the i2c controller when timeout occurs")
Signed-off-by: Tommy Huang <tommy_huang(a)aspeedtech.com>
---
drivers/i2c/busses/i2c-aspeed.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/drivers/i2c/busses/i2c-aspeed.c b/drivers/i2c/busses/i2c-aspeed.c
index cc5a26637fd5..7639ae3ace67 100644
--- a/drivers/i2c/busses/i2c-aspeed.c
+++ b/drivers/i2c/busses/i2c-aspeed.c
@@ -716,14 +716,15 @@ static int aspeed_i2c_master_xfer(struct i2c_adapter *adap,
if (time_left == 0) {
/*
* In a multi-master setup, if a timeout occurs, attempt
- * recovery. But if the bus is idle, we still need to reset the
- * i2c controller to clear the remaining interrupts.
+ * recovery device. But if the bus is idle,
+ * we still need to reset the i2c controller to clear
+ * the remaining interrupts or reset device abnormal condition.
*/
- if (bus->multi_master &&
- (readl(bus->base + ASPEED_I2C_CMD_REG) &
- ASPEED_I2CD_BUS_BUSY_STS))
- aspeed_i2c_recover_bus(bus);
- else
+ if ((readl(bus->base + ASPEED_I2C_CMD_REG) &
+ ASPEED_I2CD_BUS_BUSY_STS)){
+ if (bus->multi_master)
+ aspeed_i2c_recover_bus(bus);
+ } else
aspeed_i2c_reset(bus);
/*
--
2.25.1
Hi there,
We are excited to offer you a comprehensive email list of school districts that includes key contact information such as phone numbers, email addresses, mailing addresses, company revenue, size, and web addresses. Our databases also cover related industries such as:
* K-12 schools
* Universities
* Vocational schools and training programs
* Performing arts schools
* Fitness centers and gyms
* Child care services and providers
* Educational publishers and suppliers
If you're interested, we would be happy to provide you with relevant counts and a test file based on your specific requirements.
Thank you for your time and consideration, and please let us know if you have any questions or concerns.
Thanks,
Ava Phillips
To remove from this mailing reply with the subject line " LEAVE US".
tpm2_sessions_init() does not ignore the result of
tpm2_create_null_primary(). Address this by returning -ENODEV to the
caller. Given that upper layers cannot help healing the situation
further, deal with the TPM error here by
Cc: stable(a)vger.kernel.org # v6.10+
Fixes: d2add27cf2b8 ("tpm: Add NULL primary creation")
Signed-off-by: Jarkko Sakkinen <jarkko(a)kernel.org>
---
v7:
- Add the error message back but fix it up a bit:
1. Remove 'TPM:' given dev_err().
2. s/NULL/null/ as this has nothing to do with the macro in libc.
3. Fix the reasoning: null key creation failed
v6:
- Address:
https://lore.kernel.org/linux-integrity/69c893e7-6b87-4daa-80db-44d1120e80f…
as TPM RC is taken care of at the call site. Add also the missing
documentation for the return values.
v5:
- Do not print klog messages on error, as tpm2_save_context() already
takes care of this.
v4:
- Fixed up stable version.
v3:
- Handle TPM and POSIX error separately and return -ENODEV always back
to the caller.
v2:
- Refined the commit message.
---
drivers/char/tpm/tpm2-sessions.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/char/tpm/tpm2-sessions.c b/drivers/char/tpm/tpm2-sessions.c
index d3521aadd43e..1e12e0b2492e 100644
--- a/drivers/char/tpm/tpm2-sessions.c
+++ b/drivers/char/tpm/tpm2-sessions.c
@@ -1347,14 +1347,21 @@ static int tpm2_create_null_primary(struct tpm_chip *chip)
*
* Derive and context save the null primary and allocate memory in the
* struct tpm_chip for the authorizations.
+ *
+ * Return:
+ * * 0 - OK
+ * * -errno - A system error
+ * * TPM_RC - A TPM error
*/
int tpm2_sessions_init(struct tpm_chip *chip)
{
int rc;
rc = tpm2_create_null_primary(chip);
- if (rc)
- dev_err(&chip->dev, "TPM: security failed (NULL seed derivation): %d\n", rc);
+ if (rc) {
+ dev_err(&chip->dev, "null primary key creation failed with %d\n", rc);
+ return rc;
+ }
chip->auth = kmalloc(sizeof(*chip->auth), GFP_KERNEL);
if (!chip->auth)
--
2.47.0
The existing code moves VF to the same namespace as the synthetic NIC
during netvsc_register_vf(). But, if the synthetic device is moved to a
new namespace after the VF registration, the VF won't be moved together.
To make the behavior more consistent, add a namespace check for synthetic
NIC's NETDEV_REGISTER event (generated during its move), and move the VF
if it is not in the same namespace.
Cc: stable(a)vger.kernel.org
Fixes: c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device")
Suggested-by: Stephen Hemminger <stephen(a)networkplumber.org>
Signed-off-by: Haiyang Zhang <haiyangz(a)microsoft.com>
Reviewed-by: Simon Horman <horms(a)kernel.org>
---
v3: Use RCT order as suggested by Simon.
v2: Move my fix to synthetic NIC's NETDEV_REGISTER event as suggested by Stephen.
---
drivers/net/hyperv/netvsc_drv.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 153b97f8ec0d..23180f7b67b6 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2798,6 +2798,31 @@ static struct hv_driver netvsc_drv = {
},
};
+/* Set VF's namespace same as the synthetic NIC */
+static void netvsc_event_set_vf_ns(struct net_device *ndev)
+{
+ struct net_device_context *ndev_ctx = netdev_priv(ndev);
+ struct net_device *vf_netdev;
+ int ret;
+
+ vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
+ if (!vf_netdev)
+ return;
+
+ if (!net_eq(dev_net(ndev), dev_net(vf_netdev))) {
+ ret = dev_change_net_namespace(vf_netdev, dev_net(ndev),
+ "eth%d");
+ if (ret)
+ netdev_err(vf_netdev,
+ "Cannot move to same namespace as %s: %d\n",
+ ndev->name, ret);
+ else
+ netdev_info(vf_netdev,
+ "Moved VF to namespace with: %s\n",
+ ndev->name);
+ }
+}
+
/*
* On Hyper-V, every VF interface is matched with a corresponding
* synthetic interface. The synthetic interface is presented first
@@ -2810,6 +2835,11 @@ static int netvsc_netdev_event(struct notifier_block *this,
struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
int ret = 0;
+ if (event_dev->netdev_ops == &device_ops && event == NETDEV_REGISTER) {
+ netvsc_event_set_vf_ns(event_dev);
+ return NOTIFY_DONE;
+ }
+
ret = check_dev_is_matching_vf(event_dev);
if (ret != 0)
return NOTIFY_DONE;
--
2.34.1
On Tue, Mar 12, 2024 at 07:04:10AM -0400, Eric Hagberg wrote:
> On Thu, Mar 7, 2024 at 11:33 AM Steve Wahl <steve.wahl(a)hpe.com> wrote:
> > What Linux Distribution are you running on that machine? My guess
> > would be that this is not distro related; if you are running something
> > quite different from Pavin that would confirm this.
>
> Distro in use is Rocky 8, so it’s pretty clear not to be distro-specific.
>
> > I found an AMD based system to try to reproduce this on.
>
> yeah, it probably requires either a specific cpu or set or devices plus cpu
> to trigger… found that it also affects Dell R7625 servers in addition to
> the R6615s
I agree that it's likely the CPU or particular set of surrounding
devices that trigger the problem.
I have not succeeded in reproducing the problem yet. I tried an AMD
based system lent to me, but it's probably the wrong generation (AMD
EPYC 7251) and I didn't see the problem. I have a line on a system
that's more in line with the systems the bug was reported on that I
should be able to try tomorrow.
I would love to have some direction from the community at large on
this. The fact that nogbpages on the command line causes the same
problem without my patch suggests it's not bad code directly in my
patch, but something in the way kexec reacts to the resulting identity
map. One quick solution would be a kernel command line parameter to
select between the previous identity map creation behavior and the new
behavior. E.g. in addition to "nogbpages", we could have
"somegbpages" and "allgbpages" -- or gbpages=[all, some, none] with
nogbpages a synonym for backwards compatibility.
But I don't want to introduce a new command line parameter if the
actual problem can be understood and fixed. The question is how much
time do I have to persue a direct fix before some other action needs
to be taken?
Thanks,
--> Steve Wahl
--
Steve Wahl, Hewlett Packard Enterprise
When ata_qc_complete() schedules a command for EH using
ata_qc_schedule_eh(), blk_abort_request() will be called, which leads to
req->q->mq_ops->timeout() / scsi_timeout() being called.
scsi_timeout(), if the LLDD has no abort handler (libata has no abort
handler), will set host byte to DID_TIME_OUT, and then call
scsi_eh_scmd_add() to add the command to EH.
Thus, when commands first enter libata's EH strategy_handler, all the
commands that have been added to EH will have DID_TIME_OUT set.
Commit e5dd410acb34 ("ata: libata: Clear DID_TIME_OUT for ATA PT commands
with sense data") clears this bogus DID_TIME_OUT flag for all commands
that reached libata's EH strategy_handler.
libata has its own flag (AC_ERR_TIMEOUT), that it sets for commands that
have not received a completion at the time of entering EH.
ata_eh_worth_retry() has no special handling for AC_ERR_TIMEOUT, so by
default timed out commands will get flag ATA_QCFLAG_RETRY set and will be
retried after the port has been reset (ata_eh_link_autopsy() always
triggers a port reset if any command has AC_ERR_TIMEOUT set).
For commands that have ATA_QCFLAG_RETRY set, but also has an error flag
set (e.g. AC_ERR_TIMEOUT), ata_eh_finish() will not increment
scmd->allowed, so the command will at most be retried (scmd->allowed
number of times, which by default is set to 3).
However, scsi_eh_flush_done_q() will only retry commands for which
scsi_noretry_cmd() returns false.
For commands that has DID_TIME_OUT set, if the command is either
has FAILFAST or if the command is a passthrough command, scsi_noretry_cmd()
will return true. Thus, such commands will never be retried.
Thus, make sure that libata sets SCSI's DID_TIME_OUT flag for commands that
actually timed out (libata's AC_ERR_TIMEOUT flag), such that timed out
commands will once again not be retried if they are also a FAILFAST or
passthrough command.
Cc: stable(a)vger.kernel.org
Fixes: e5dd410acb34 ("ata: libata: Clear DID_TIME_OUT for ATA PT commands with sense data")
Reported-by: Lai, Yi <yi1.lai(a)linux.intel.com>
Closes: https://lore.kernel.org/linux-ide/ZxYz871I3Blsi30F@ly-workstation/
Signed-off-by: Niklas Cassel <cassel(a)kernel.org>
---
drivers/ata/libata-eh.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index fa41ea57a978..3b303d4ae37a 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -651,6 +651,7 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap,
/* the scmd has an associated qc */
if (!(qc->flags & ATA_QCFLAG_EH)) {
/* which hasn't failed yet, timeout */
+ set_host_byte(scmd, DID_TIME_OUT);
qc->err_mask |= AC_ERR_TIMEOUT;
qc->flags |= ATA_QCFLAG_EH;
nr_timedout++;
--
2.47.0
The patch titled
Subject: mm/page_alloc: fix NUMA stats update for cpu-less nodes
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-page_alloc-fix-numa-stats-update-for-cpu-less-nodes.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Dongjoo Seo <dongjoo.linux.dev(a)gmail.com>
Subject: mm/page_alloc: fix NUMA stats update for cpu-less nodes
Date: Wed, 23 Oct 2024 10:50:37 -0700
In the case of memoryless node, when a process prefers a node with no
memory(e.g., because it is running on a CPU local to that node), the
kernel treats a nearby node with memory as the preferred node. As a
result, such allocations do not increment the numa_foreign counter on the
memoryless node, leading to skewed NUMA_HIT, NUMA_MISS, and NUMA_FOREIGN
stats for the nearest node.
This patch corrects this issue by:
1. Checking if the zone or preferred zone is CPU-less before updating
the NUMA stats.
2. Ensuring NUMA_HIT is only updated if the zone is not CPU-less.
3. Ensuring NUMA_FOREIGN is only updated if the preferred zone is not
CPU-less.
Example Before and After Patch:
- Before Patch:
node0 node1 node2
numa_hit 86333181 114338269 5108
numa_miss 5199455 0 56844591
numa_foreign 32281033 29763013 0
interleave_hit 91 91 0
local_node 86326417 114288458 0
other_node 5206219 49768 56849702
- After Patch:
node0 node1 node2
numa_hit 2523058 9225528 0
numa_miss 150213 10226 21495942
numa_foreign 17144215 4501270 0
interleave_hit 91 94 0
local_node 2493918 9208226 0
other_node 179351 27528 21495942
Similarly, in the context of cpuless nodes, this patch ensures that NUMA
statistics are accurately updated by adding checks to prevent the
miscounting of memory allocations when the involved nodes have no CPUs.
This ensures more precise tracking of memory access patterns accross all
nodes, regardless of whether they have CPUs or not, improving the overall
reliability of NUMA stat. The reason is that page allocation from
dev_dax, cpuset, memcg .. comes with preferred allocating zone in cpuless
node and its hard to track the zone info for miss information.
Link: https://lkml.kernel.org/r/20241023175037.9125-1-dongjoo.linux.dev@gmail.com
Signed-off-by: Dongjoo Seo <dongjoo.linux.dev(a)gmail.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Fan Ni <nifan(a)outlook.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Adam Manzanares <a.manzanares(a)samsung.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/mm/page_alloc.c~mm-page_alloc-fix-numa-stats-update-for-cpu-less-nodes
+++ a/mm/page_alloc.c
@@ -2858,19 +2858,21 @@ static inline void zone_statistics(struc
{
#ifdef CONFIG_NUMA
enum numa_stat_item local_stat = NUMA_LOCAL;
+ bool z_is_cpuless = !node_state(zone_to_nid(z), N_CPU);
+ bool pref_is_cpuless = !node_state(zone_to_nid(preferred_zone), N_CPU);
- /* skip numa counters update if numa stats is disabled */
if (!static_branch_likely(&vm_numa_stat_key))
return;
- if (zone_to_nid(z) != numa_node_id())
+ if (zone_to_nid(z) != numa_node_id() || z_is_cpuless)
local_stat = NUMA_OTHER;
- if (zone_to_nid(z) == zone_to_nid(preferred_zone))
+ if (zone_to_nid(z) == zone_to_nid(preferred_zone) && !z_is_cpuless)
__count_numa_events(z, NUMA_HIT, nr_account);
else {
__count_numa_events(z, NUMA_MISS, nr_account);
- __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
+ if (!pref_is_cpuless)
+ __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
}
__count_numa_events(z, local_stat, nr_account);
#endif
_
Patches currently in -mm which might be from dongjoo.linux.dev(a)gmail.com are
mm-page_alloc-fix-numa-stats-update-for-cpu-less-nodes.patch
On 22. 10. 24, 19:54, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> tty/serial: Make ->dcd_change()+uart_handle_dcd_change() status bool active
This is a cleanup, not needed in stable. (Unless something
context-depends on it.)
--
js
suse labs
The quilt patch titled
Subject: Revert "selftests/mm: fix deadlock for fork after pthread_create on ARM"
has been removed from the -mm tree. Its filename was
revert-selftests-mm-fix-deadlock-for-fork-after-pthread_create-on-arm.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Edward Liaw <edliaw(a)google.com>
Subject: Revert "selftests/mm: fix deadlock for fork after pthread_create on ARM"
Date: Fri, 18 Oct 2024 17:17:22 +0000
Patch series "selftests/mm: revert pthread_barrier change"
On Android arm, pthread_create followed by a fork caused a deadlock in
the case where the fork required work to be completed by the created
thread.
The previous patches incorrectly assumed that the parent would
always initialize the pthread_barrier for the child thread. This
reverts the change and replaces the fix for wp-fork-with-event with the
original use of atomic_bool.
This patch (of 3):
This reverts commit e142cc87ac4ec618f2ccf5f68aedcd6e28a59d9d.
fork_event_consumer may be called by other tests that do not initialize
the pthread_barrier, so this approach is not correct. The subsequent
patch will revert to using atomic_bool instead.
Link: https://lkml.kernel.org/r/20241018171734.2315053-1-edliaw@google.com
Link: https://lkml.kernel.org/r/20241018171734.2315053-2-edliaw@google.com
Fixes: e142cc87ac4e ("fix deadlock for fork after pthread_create on ARM")
Signed-off-by: Edward Liaw <edliaw(a)google.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
tools/testing/selftests/mm/uffd-unit-tests.c | 7 -------
1 file changed, 7 deletions(-)
--- a/tools/testing/selftests/mm/uffd-unit-tests.c~revert-selftests-mm-fix-deadlock-for-fork-after-pthread_create-on-arm
+++ a/tools/testing/selftests/mm/uffd-unit-tests.c
@@ -241,9 +241,6 @@ static void *fork_event_consumer(void *d
fork_event_args *args = data;
struct uffd_msg msg = { 0 };
- /* Ready for parent thread to fork */
- pthread_barrier_wait(&ready_for_fork);
-
/* Read until a full msg received */
while (uffd_read_msg(args->parent_uffd, &msg));
@@ -311,12 +308,8 @@ static int pagemap_test_fork(int uffd, b
/* Prepare a thread to resolve EVENT_FORK */
if (with_event) {
- pthread_barrier_init(&ready_for_fork, NULL, 2);
if (pthread_create(&thread, NULL, fork_event_consumer, &args))
err("pthread_create()");
- /* Wait for child thread to start before forking */
- pthread_barrier_wait(&ready_for_fork);
- pthread_barrier_destroy(&ready_for_fork);
}
child = fork();
_
Patches currently in -mm which might be from edliaw(a)google.com are
The quilt patch titled
Subject: nilfs2: fix kernel bug due to missing clearing of checked flag
has been removed from the -mm tree. Its filename was
nilfs2-fix-kernel-bug-due-to-missing-clearing-of-checked-flag.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Subject: nilfs2: fix kernel bug due to missing clearing of checked flag
Date: Fri, 18 Oct 2024 04:33:10 +0900
Syzbot reported that in directory operations after nilfs2 detects
filesystem corruption and degrades to read-only,
__block_write_begin_int(), which is called to prepare block writes, may
fail the BUG_ON check for accesses exceeding the folio/page size,
triggering a kernel bug.
This was found to be because the "checked" flag of a page/folio was not
cleared when it was discarded by nilfs2's own routine, which causes the
sanity check of directory entries to be skipped when the directory
page/folio is reloaded. So, fix that.
This was necessary when the use of nilfs2's own page discard routine was
applied to more than just metadata files.
Link: https://lkml.kernel.org/r/20241017193359.5051-1-konishi.ryusuke@gmail.com
Fixes: 8c26c4e2694a ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Reported-by: syzbot+d6ca2daf692c7a82f959(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d6ca2daf692c7a82f959
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/nilfs2/page.c | 1 +
1 file changed, 1 insertion(+)
--- a/fs/nilfs2/page.c~nilfs2-fix-kernel-bug-due-to-missing-clearing-of-checked-flag
+++ a/fs/nilfs2/page.c
@@ -400,6 +400,7 @@ void nilfs_clear_folio_dirty(struct foli
folio_clear_uptodate(folio);
folio_clear_mappedtodisk(folio);
+ folio_clear_checked(folio);
head = folio_buffers(folio);
if (head) {
_
Patches currently in -mm which might be from konishi.ryusuke(a)gmail.com are
nilfs2-fix-potential-deadlock-with-newly-created-symlinks.patch
The quilt patch titled
Subject: ocfs2: pass u64 to ocfs2_truncate_inline maybe overflow
has been removed from the -mm tree. Its filename was
ocfs2-pass-u64-to-ocfs2_truncate_inline-maybe-overflow.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Edward Adam Davis <eadavis(a)qq.com>
Subject: ocfs2: pass u64 to ocfs2_truncate_inline maybe overflow
Date: Wed, 16 Oct 2024 19:43:47 +0800
Syzbot reported a kernel BUG in ocfs2_truncate_inline. There are two
reasons for this: first, the parameter value passed is greater than
ocfs2_max_inline_data_with_xattr, second, the start and end parameters of
ocfs2_truncate_inline are "unsigned int".
So, we need to add a sanity check for byte_start and byte_len right before
ocfs2_truncate_inline() in ocfs2_remove_inode_range(), if they are greater
than ocfs2_max_inline_data_with_xattr return -EINVAL.
Link: https://lkml.kernel.org/r/tencent_D48DB5122ADDAEDDD11918CFB68D93258C07@qq.c…
Fixes: 1afc32b95233 ("ocfs2: Write support for inline data")
Signed-off-by: Edward Adam Davis <eadavis(a)qq.com>
Reported-by: syzbot+81092778aac03460d6b7(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=81092778aac03460d6b7
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Gang He <ghe(a)suse.com>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/file.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/fs/ocfs2/file.c~ocfs2-pass-u64-to-ocfs2_truncate_inline-maybe-overflow
+++ a/fs/ocfs2/file.c
@@ -1784,6 +1784,14 @@ int ocfs2_remove_inode_range(struct inod
return 0;
if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
+ int id_count = ocfs2_max_inline_data_with_xattr(inode->i_sb, di);
+
+ if (byte_start > id_count || byte_start + byte_len > id_count) {
+ ret = -EINVAL;
+ mlog_errno(ret);
+ goto out;
+ }
+
ret = ocfs2_truncate_inline(inode, di_bh, byte_start,
byte_start + byte_len, 0);
if (ret) {
_
Patches currently in -mm which might be from eadavis(a)qq.com are
The quilt patch titled
Subject: mm: shmem: fix data-race in shmem_getattr()
has been removed from the -mm tree. Its filename was
mm-shmem-fix-data-race-in-shmem_getattr.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Jeongjun Park <aha310510(a)gmail.com>
Subject: mm: shmem: fix data-race in shmem_getattr()
Date: Mon, 9 Sep 2024 21:35:58 +0900
I got the following KCSAN report during syzbot testing:
==================================================================
BUG: KCSAN: data-race in generic_fillattr / inode_set_ctime_current
write to 0xffff888102eb3260 of 4 bytes by task 6565 on cpu 1:
inode_set_ctime_to_ts include/linux/fs.h:1638 [inline]
inode_set_ctime_current+0x169/0x1d0 fs/inode.c:2626
shmem_mknod+0x117/0x180 mm/shmem.c:3443
shmem_create+0x34/0x40 mm/shmem.c:3497
lookup_open fs/namei.c:3578 [inline]
open_last_lookups fs/namei.c:3647 [inline]
path_openat+0xdbc/0x1f00 fs/namei.c:3883
do_filp_open+0xf7/0x200 fs/namei.c:3913
do_sys_openat2+0xab/0x120 fs/open.c:1416
do_sys_open fs/open.c:1431 [inline]
__do_sys_openat fs/open.c:1447 [inline]
__se_sys_openat fs/open.c:1442 [inline]
__x64_sys_openat+0xf3/0x120 fs/open.c:1442
x64_sys_call+0x1025/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:258
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x76/0x7e
read to 0xffff888102eb3260 of 4 bytes by task 3498 on cpu 0:
inode_get_ctime_nsec include/linux/fs.h:1623 [inline]
inode_get_ctime include/linux/fs.h:1629 [inline]
generic_fillattr+0x1dd/0x2f0 fs/stat.c:62
shmem_getattr+0x17b/0x200 mm/shmem.c:1157
vfs_getattr_nosec fs/stat.c:166 [inline]
vfs_getattr+0x19b/0x1e0 fs/stat.c:207
vfs_statx_path fs/stat.c:251 [inline]
vfs_statx+0x134/0x2f0 fs/stat.c:315
vfs_fstatat+0xec/0x110 fs/stat.c:341
__do_sys_newfstatat fs/stat.c:505 [inline]
__se_sys_newfstatat+0x58/0x260 fs/stat.c:499
__x64_sys_newfstatat+0x55/0x70 fs/stat.c:499
x64_sys_call+0x141f/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:263
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x76/0x7e
value changed: 0x2755ae53 -> 0x27ee44d3
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 UID: 0 PID: 3498 Comm: udevd Not tainted 6.11.0-rc6-syzkaller-00326-gd1f2d51b711a-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024
==================================================================
When calling generic_fillattr(), if you don't hold read lock, data-race
will occur in inode member variables, which can cause unexpected
behavior.
Since there is no special protection when shmem_getattr() calls
generic_fillattr(), data-race occurs by functions such as shmem_unlink()
or shmem_mknod(). This can cause unexpected results, so commenting it out
is not enough.
Therefore, when calling generic_fillattr() from shmem_getattr(), it is
appropriate to protect the inode using inode_lock_shared() and
inode_unlock_shared() to prevent data-race.
Link: https://lkml.kernel.org/r/20240909123558.70229-1-aha310510@gmail.com
Fixes: 44a30220bc0a ("shmem: recalculate file inode when fstat")
Signed-off-by: Jeongjun Park <aha310510(a)gmail.com>
Reported-by: syzbot <syzkaller(a)googlegroup.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/shmem.c | 2 ++
1 file changed, 2 insertions(+)
--- a/mm/shmem.c~mm-shmem-fix-data-race-in-shmem_getattr
+++ a/mm/shmem.c
@@ -1166,7 +1166,9 @@ static int shmem_getattr(struct mnt_idma
stat->attributes_mask |= (STATX_ATTR_APPEND |
STATX_ATTR_IMMUTABLE |
STATX_ATTR_NODUMP);
+ inode_lock_shared(inode);
generic_fillattr(idmap, request_mask, inode, stat);
+ inode_unlock_shared(inode);
if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
stat->blksize = HPAGE_PMD_SIZE;
_
Patches currently in -mm which might be from aha310510(a)gmail.com are
The quilt patch titled
Subject: x86/traps: move kmsan check after instrumentation_begin
has been removed from the -mm tree. Its filename was
x86-traps-move-kmsan-check-after-instrumentation_begin.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Sabyrzhan Tasbolatov <snovitoll(a)gmail.com>
Subject: x86/traps: move kmsan check after instrumentation_begin
Date: Wed, 16 Oct 2024 20:24:07 +0500
During x86_64 kernel build with CONFIG_KMSAN, the objtool warns following:
AR built-in.a
AR vmlinux.a
LD vmlinux.o
vmlinux.o: warning: objtool: handle_bug+0x4: call to
kmsan_unpoison_entry_regs() leaves .noinstr.text section
OBJCOPY modules.builtin.modinfo
GEN modules.builtin
MODPOST Module.symvers
CC .vmlinux.export.o
Moving kmsan_unpoison_entry_regs() _after_ instrumentation_begin() fixes
the warning.
There is decode_bug(regs->ip, &imm) is left before KMSAN unpoisoining, but
it has the return condition and if we include it after
instrumentation_begin() it results the warning "return with
instrumentation enabled", hence, I'm concerned that regs will not be KMSAN
unpoisoned if `ud_type == BUG_NONE` is true.
Link: https://lkml.kernel.org/r/20241016152407.3149001-1-snovitoll@gmail.com
Fixes: ba54d194f8da ("x86/traps: avoid KMSAN bugs originating from handle_bug()")
Signed-off-by: Sabyrzhan Tasbolatov <snovitoll(a)gmail.com>
Reviewed-by: Alexander Potapenko <glider(a)google.com>
Cc: Borislav Petkov (AMD) <bp(a)alien8.de>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/x86/kernel/traps.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
--- a/arch/x86/kernel/traps.c~x86-traps-move-kmsan-check-after-instrumentation_begin
+++ a/arch/x86/kernel/traps.c
@@ -261,12 +261,6 @@ static noinstr bool handle_bug(struct pt
int ud_type;
u32 imm;
- /*
- * Normally @regs are unpoisoned by irqentry_enter(), but handle_bug()
- * is a rare case that uses @regs without passing them to
- * irqentry_enter().
- */
- kmsan_unpoison_entry_regs(regs);
ud_type = decode_bug(regs->ip, &imm);
if (ud_type == BUG_NONE)
return handled;
@@ -276,6 +270,12 @@ static noinstr bool handle_bug(struct pt
*/
instrumentation_begin();
/*
+ * Normally @regs are unpoisoned by irqentry_enter(), but handle_bug()
+ * is a rare case that uses @regs without passing them to
+ * irqentry_enter().
+ */
+ kmsan_unpoison_entry_regs(regs);
+ /*
* Since we're emulating a CALL with exceptions, restore the interrupt
* state to what it was at the exception site.
*/
_
Patches currently in -mm which might be from snovitoll(a)gmail.com are
mm-kasan-kmsan-copy_from-to_kernel_nofault.patch
kasan-move-checks-to-do_strncpy_from_user.patch
kasan-migrate-copy_user_test-to-kunit.patch
kasan-delete-config_kasan_module_test.patch
kasan-delete-config_kasan_module_test-fix.patch
The quilt patch titled
Subject: mm/page_alloc: let GFP_ATOMIC order-0 allocs access highatomic reserves
has been removed from the -mm tree. Its filename was
mm-page_alloc-let-gfp_atomic-order-0-allocs-access-highatomic-reserves.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Matt Fleming <mfleming(a)cloudflare.com>
Subject: mm/page_alloc: let GFP_ATOMIC order-0 allocs access highatomic reserves
Date: Fri, 11 Oct 2024 13:07:37 +0100
Under memory pressure it's possible for GFP_ATOMIC order-0 allocations to
fail even though free pages are available in the highatomic reserves.
GFP_ATOMIC allocations cannot trigger unreserve_highatomic_pageblock()
since it's only run from reclaim.
Given that such allocations will pass the watermarks in
__zone_watermark_unusable_free(), it makes sense to fallback to highatomic
reserves the same way that ALLOC_OOM can.
This fixes order-0 page allocation failures observed on Cloudflare's fleet
when handling network packets:
kswapd1: page allocation failure: order:0, mode:0x820(GFP_ATOMIC),
nodemask=(null),cpuset=/,mems_allowed=0-7
CPU: 10 PID: 696 Comm: kswapd1 Kdump: loaded Tainted: G O 6.6.43-CUSTOM #1
Hardware name: MACHINE
Call Trace:
<IRQ>
dump_stack_lvl+0x3c/0x50
warn_alloc+0x13a/0x1c0
__alloc_pages_slowpath.constprop.0+0xc9d/0xd10
__alloc_pages+0x327/0x340
__napi_alloc_skb+0x16d/0x1f0
bnxt_rx_page_skb+0x96/0x1b0 [bnxt_en]
bnxt_rx_pkt+0x201/0x15e0 [bnxt_en]
__bnxt_poll_work+0x156/0x2b0 [bnxt_en]
bnxt_poll+0xd9/0x1c0 [bnxt_en]
__napi_poll+0x2b/0x1b0
bpf_trampoline_6442524138+0x7d/0x1000
__napi_poll+0x5/0x1b0
net_rx_action+0x342/0x740
handle_softirqs+0xcf/0x2b0
irq_exit_rcu+0x6c/0x90
sysvec_apic_timer_interrupt+0x72/0x90
</IRQ>
[mfleming(a)cloudflare.com: update comment]
Link: https://lkml.kernel.org/r/20241015125158.3597702-1-matt@readmodwrite.com
Link: https://lkml.kernel.org/r/20241011120737.3300370-1-matt@readmodwrite.com
Link: https://lore.kernel.org/all/CAGis_TWzSu=P7QJmjD58WWiu3zjMTVKSzdOwWE8ORaGytz…
Fixes: 1d91df85f399 ("mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs")
Signed-off-by: Matt Fleming <mfleming(a)cloudflare.com>
Suggested-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
--- a/mm/page_alloc.c~mm-page_alloc-let-gfp_atomic-order-0-allocs-access-highatomic-reserves
+++ a/mm/page_alloc.c
@@ -2893,12 +2893,12 @@ struct page *rmqueue_buddy(struct zone *
page = __rmqueue(zone, order, migratetype, alloc_flags);
/*
- * If the allocation fails, allow OOM handling access
- * to HIGHATOMIC reserves as failing now is worse than
- * failing a high-order atomic allocation in the
- * future.
+ * If the allocation fails, allow OOM handling and
+ * order-0 (atomic) allocs access to HIGHATOMIC
+ * reserves as failing now is worse than failing a
+ * high-order atomic allocation in the future.
*/
- if (!page && (alloc_flags & ALLOC_OOM))
+ if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_NON_BLOCK)))
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if (!page) {
_
Patches currently in -mm which might be from mfleming(a)cloudflare.com are
The quilt patch titled
Subject: fork: only invoke khugepaged, ksm hooks if no error
has been removed from the -mm tree. Its filename was
fork-only-invoke-khugepaged-ksm-hooks-if-no-error.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: fork: only invoke khugepaged, ksm hooks if no error
Date: Tue, 15 Oct 2024 18:56:06 +0100
There is no reason to invoke these hooks early against an mm that is in an
incomplete state.
The change in commit d24062914837 ("fork: use __mt_dup() to duplicate
maple tree in dup_mmap()") makes this more pertinent as we may be in a
state where entries in the maple tree are not yet consistent.
Their placement early in dup_mmap() only appears to have been meaningful
for early error checking, and since functionally it'd require a very small
allocation to fail (in practice 'too small to fail') that'd only occur in
the most dire circumstances, meaning the fork would fail or be OOM'd in
any case.
Since both khugepaged and KSM tracking are there to provide optimisations
to memory performance rather than critical functionality, it doesn't
really matter all that much if, under such dire memory pressure, we fail
to register an mm with these.
As a result, we follow the example of commit d2081b2bf819 ("mm:
khugepaged: make khugepaged_enter() void function") and make ksm_fork() a
void function also.
We only expose the mm to these functions once we are done with them and
only if no error occurred in the fork operation.
Link: https://lkml.kernel.org/r/e0cb8b840c9d1d5a6e84d4f8eff5f3f2022aa10c.17290143…
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reported-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Jann Horn <jannh(a)google.com>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Linus Torvalds <torvalds(a)linuxfoundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/ksm.h | 10 ++++------
kernel/fork.c | 7 ++-----
2 files changed, 6 insertions(+), 11 deletions(-)
--- a/include/linux/ksm.h~fork-only-invoke-khugepaged-ksm-hooks-if-no-error
+++ a/include/linux/ksm.h
@@ -54,12 +54,11 @@ static inline long mm_ksm_zero_pages(str
return atomic_long_read(&mm->ksm_zero_pages);
}
-static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+static inline void ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
{
+ /* Adding mm to ksm is best effort on fork. */
if (test_bit(MMF_VM_MERGEABLE, &oldmm->flags))
- return __ksm_enter(mm);
-
- return 0;
+ __ksm_enter(mm);
}
static inline int ksm_execve(struct mm_struct *mm)
@@ -107,9 +106,8 @@ static inline int ksm_disable(struct mm_
return 0;
}
-static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+static inline void ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
{
- return 0;
}
static inline int ksm_execve(struct mm_struct *mm)
--- a/kernel/fork.c~fork-only-invoke-khugepaged-ksm-hooks-if-no-error
+++ a/kernel/fork.c
@@ -653,11 +653,6 @@ static __latent_entropy int dup_mmap(str
mm->exec_vm = oldmm->exec_vm;
mm->stack_vm = oldmm->stack_vm;
- retval = ksm_fork(mm, oldmm);
- if (retval)
- goto out;
- khugepaged_fork(mm, oldmm);
-
/* Use __mt_dup() to efficiently build an identical maple tree. */
retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
if (unlikely(retval))
@@ -760,6 +755,8 @@ loop_out:
vma_iter_free(&vmi);
if (!retval) {
mt_set_in_rcu(vmi.mas.tree);
+ ksm_fork(mm, oldmm);
+ khugepaged_fork(mm, oldmm);
} else if (mpnt) {
/*
* The entire maple tree has already been duplicated. If the
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook.patch
mm-unconditionally-close-vmas-on-error.patch
mm-refactor-map_deny_write_exec.patch
mm-resolve-faulty-mmap_region-error-path-behaviour.patch
selftests-mm-add-pkey_sighandler_xx-hugetlb_dio-to-gitignore.patch
mm-refactor-mm_access-to-not-return-null.patch
mm-refactor-mm_access-to-not-return-null-fix.patch
mm-madvise-unrestrict-process_madvise-for-current-process.patch
maple_tree-do-not-hash-pointers-on-dump-in-debug-mode.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems-fix.patch
tools-testing-add-additional-vma_internalh-stubs.patch
mm-isolate-mmap-internal-logic-to-mm-vmac.patch
mm-refactor-__mmap_region.patch
mm-defer-second-attempt-at-merge-on-mmap.patch
mm-pagewalk-add-the-ability-to-install-ptes.patch
mm-add-pte_marker_guard-pte-marker.patch
mm-madvise-implement-lightweight-guard-page-mechanism.patch
tools-testing-update-tools-uapi-header-for-mman-commonh.patch
selftests-mm-add-self-tests-for-guard-page-feature.patch
The quilt patch titled
Subject: fork: do not invoke uffd on fork if error occurs
has been removed from the -mm tree. Its filename was
fork-do-not-invoke-uffd-on-fork-if-error-occurs.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: fork: do not invoke uffd on fork if error occurs
Date: Tue, 15 Oct 2024 18:56:05 +0100
Patch series "fork: do not expose incomplete mm on fork".
During fork we may place the virtual memory address space into an
inconsistent state before the fork operation is complete.
In addition, we may encounter an error during the fork operation that
indicates that the virtual memory address space is invalidated.
As a result, we should not be exposing it in any way to external machinery
that might interact with the mm or VMAs, machinery that is not designed to
deal with incomplete state.
We specifically update the fork logic to defer khugepaged and ksm to the
end of the operation and only to be invoked if no error arose, and
disallow uffd from observing fork events should an error have occurred.
This patch (of 2):
Currently on fork we expose the virtual address space of a process to
userland unconditionally if uffd is registered in VMAs, regardless of
whether an error arose in the fork.
This is performed in dup_userfaultfd_complete() which is invoked
unconditionally, and performs two duties - invoking registered handlers
for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down
userfaultfd_fork_ctx objects established in dup_userfaultfd().
This is problematic, because the virtual address space may not yet be
correctly initialised if an error arose.
The change in commit d24062914837 ("fork: use __mt_dup() to duplicate
maple tree in dup_mmap()") makes this more pertinent as we may be in a
state where entries in the maple tree are not yet consistent.
We address this by, on fork error, ensuring that we roll back state that
we would otherwise expect to clean up through the event being handled by
userland and perform the memory freeing duty otherwise performed by
dup_userfaultfd_complete().
We do this by implementing a new function, dup_userfaultfd_fail(), which
performs the same loop, only decrementing reference counts.
Note that we perform mmgrab() on the parent and child mm's, however
userfaultfd_ctx_put() will mmdrop() this once the reference count drops to
zero, so we will avoid memory leaks correctly here.
Link: https://lkml.kernel.org/r/cover.1729014377.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.17290143…
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reported-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Linus Torvalds <torvalds(a)linuxfoundation.org>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/userfaultfd.c | 28 ++++++++++++++++++++++++++++
include/linux/userfaultfd_k.h | 5 +++++
kernel/fork.c | 5 ++++-
3 files changed, 37 insertions(+), 1 deletion(-)
--- a/fs/userfaultfd.c~fork-do-not-invoke-uffd-on-fork-if-error-occurs
+++ a/fs/userfaultfd.c
@@ -692,6 +692,34 @@ void dup_userfaultfd_complete(struct lis
}
}
+void dup_userfaultfd_fail(struct list_head *fcs)
+{
+ struct userfaultfd_fork_ctx *fctx, *n;
+
+ /*
+ * An error has occurred on fork, we will tear memory down, but have
+ * allocated memory for fctx's and raised reference counts for both the
+ * original and child contexts (and on the mm for each as a result).
+ *
+ * These would ordinarily be taken care of by a user handling the event,
+ * but we are no longer doing so, so manually clean up here.
+ *
+ * mm tear down will take care of cleaning up VMA contexts.
+ */
+ list_for_each_entry_safe(fctx, n, fcs, list) {
+ struct userfaultfd_ctx *octx = fctx->orig;
+ struct userfaultfd_ctx *ctx = fctx->new;
+
+ atomic_dec(&octx->mmap_changing);
+ VM_BUG_ON(atomic_read(&octx->mmap_changing) < 0);
+ userfaultfd_ctx_put(octx);
+ userfaultfd_ctx_put(ctx);
+
+ list_del(&fctx->list);
+ kfree(fctx);
+ }
+}
+
void mremap_userfaultfd_prep(struct vm_area_struct *vma,
struct vm_userfaultfd_ctx *vm_ctx)
{
--- a/include/linux/userfaultfd_k.h~fork-do-not-invoke-uffd-on-fork-if-error-occurs
+++ a/include/linux/userfaultfd_k.h
@@ -249,6 +249,7 @@ static inline bool vma_can_userfault(str
extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
extern void dup_userfaultfd_complete(struct list_head *);
+void dup_userfaultfd_fail(struct list_head *);
extern void mremap_userfaultfd_prep(struct vm_area_struct *,
struct vm_userfaultfd_ctx *);
@@ -351,6 +352,10 @@ static inline void dup_userfaultfd_compl
{
}
+static inline void dup_userfaultfd_fail(struct list_head *l)
+{
+}
+
static inline void mremap_userfaultfd_prep(struct vm_area_struct *vma,
struct vm_userfaultfd_ctx *ctx)
{
--- a/kernel/fork.c~fork-do-not-invoke-uffd-on-fork-if-error-occurs
+++ a/kernel/fork.c
@@ -775,7 +775,10 @@ out:
mmap_write_unlock(mm);
flush_tlb_mm(oldmm);
mmap_write_unlock(oldmm);
- dup_userfaultfd_complete(&uf);
+ if (!retval)
+ dup_userfaultfd_complete(&uf);
+ else
+ dup_userfaultfd_fail(&uf);
fail_uprobe_end:
uprobe_end_dup_mmap();
return retval;
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook.patch
mm-unconditionally-close-vmas-on-error.patch
mm-refactor-map_deny_write_exec.patch
mm-resolve-faulty-mmap_region-error-path-behaviour.patch
selftests-mm-add-pkey_sighandler_xx-hugetlb_dio-to-gitignore.patch
mm-refactor-mm_access-to-not-return-null.patch
mm-refactor-mm_access-to-not-return-null-fix.patch
mm-madvise-unrestrict-process_madvise-for-current-process.patch
maple_tree-do-not-hash-pointers-on-dump-in-debug-mode.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems-fix.patch
tools-testing-add-additional-vma_internalh-stubs.patch
mm-isolate-mmap-internal-logic-to-mm-vmac.patch
mm-refactor-__mmap_region.patch
mm-defer-second-attempt-at-merge-on-mmap.patch
mm-pagewalk-add-the-ability-to-install-ptes.patch
mm-add-pte_marker_guard-pte-marker.patch
mm-madvise-implement-lightweight-guard-page-mechanism.patch
tools-testing-update-tools-uapi-header-for-mman-commonh.patch
selftests-mm-add-self-tests-for-guard-page-feature.patch
From: Gustavo Sousa <gustavo.sousa(a)intel.com>
commit 9fc97277eb2d17492de636b68cf7d2f5c4f15c1b upstream.
Starting with Xe_LPD+, although FIA is still used to readout Type-C pin
assignment, part of Type-C support is moved to PICA and programming
PORT_TX_DFLEXDPMLE1(*) registers is not applicable anymore like it was
for previous display IPs (e.g. see BSpec 49190).
v2:
- Mention Bspec 49190 as a reference of instructions for previous
IPs. (Shekhar Chauhan)
- s/Xe_LPDP/Xe_LPD+/ in the commit message. (Matt Roper)
- Update commit message to be more accurate to the changes in the IP.
(Imre Deak)
Bspec: 65750, 65448
Reviewed-by: Shekhar Chauhan <shekhar.chauhan(a)intel.com>
Reviewed-by: Imre Deak <imre.deak(a)intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240625202652.315936-1-gusta…
Signed-off-by: Gustavo Sousa <gustavo.sousa(a)intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi(a)intel.com>
---
drivers/gpu/drm/i915/display/intel_tc.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/i915/display/intel_tc.c b/drivers/gpu/drm/i915/display/intel_tc.c
index 9887967b2ca5c..6f2ee7dbc43b3 100644
--- a/drivers/gpu/drm/i915/display/intel_tc.c
+++ b/drivers/gpu/drm/i915/display/intel_tc.c
@@ -393,6 +393,9 @@ void intel_tc_port_set_fia_lane_count(struct intel_digital_port *dig_port,
bool lane_reversal = dig_port->saved_port_bits & DDI_BUF_PORT_REVERSAL;
u32 val;
+ if (DISPLAY_VER(i915) >= 14)
+ return;
+
drm_WARN_ON(&i915->drm,
lane_reversal && tc->mode != TC_PORT_LEGACY);
--
2.47.0
The patch titled
Subject: lib: string_helpers: fix potential snprintf() output truncation
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
lib-string_helpers-fix-potential-snprintf-output-truncation.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Bartosz Golaszewski <bartosz.golaszewski(a)linaro.org>
Subject: lib: string_helpers: fix potential snprintf() output truncation
Date: Mon, 21 Oct 2024 11:14:17 +0200
The output of ".%03u" with the unsigned int in range [0, 4294966295] may
get truncated if the target buffer is not 12 bytes.
Link: https://lkml.kernel.org/r/20241021091417.37796-1-brgl@bgdev.pl
Fixes: 3c9f3681d0b4 ("[SCSI] lib: add generic helper to print sizes rounded to the correct SI range")
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski(a)linaro.org>
Reviewed-by: Andy Shevchenko <andy(a)kernel.org>
Cc: James E.J. Bottomley <James.Bottomley(a)HansenPartnership.com>
Cc: Kees Cook <kees(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/string_helpers.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/lib/string_helpers.c~lib-string_helpers-fix-potential-snprintf-output-truncation
+++ a/lib/string_helpers.c
@@ -57,7 +57,7 @@ int string_get_size(u64 size, u64 blk_si
static const unsigned int rounding[] = { 500, 50, 5 };
int i = 0, j;
u32 remainder = 0, sf_cap;
- char tmp[8];
+ char tmp[12];
const char *unit;
tmp[0] = '\0';
_
Patches currently in -mm which might be from bartosz.golaszewski(a)linaro.org are
lib-string_helpers-fix-potential-snprintf-output-truncation.patch
The patch titled
Subject: mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-multi-gen-lru-remove-mm_leaf_old-and-mm_nonleaf_total-stats.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats
Date: Sat, 19 Oct 2024 01:29:38 +0000
Patch series "mm: multi-gen LRU: Have secondary MMUs participate in
MM_WALK".
Today, the MM_WALK capability causes MGLRU to clear the young bit from
PMDs and PTEs during the page table walk before eviction, but MGLRU does
not call the clear_young() MMU notifier in this case. By not calling this
notifier, the MM walk takes less time/CPU, but it causes pages that are
accessed mostly through KVM / secondary MMUs to appear younger than they
should be.
We do call the clear_young() notifier today, but only when attempting to
evict the page, so we end up clearing young/accessed information less
frequently for secondary MMUs than for mm PTEs, and therefore they appear
younger and are less likely to be evicted. Therefore, memory that is
*not* being accessed mostly by KVM will be evicted *more* frequently,
worsening performance.
ChromeOS observed a tab-open latency regression when enabling MGLRU with a
setup that involved running a VM:
Tab-open latency histogram (ms)
Version p50 mean p95 p99 max
base 1315 1198 2347 3454 10319
mglru 2559 1311 7399 12060 43758
fix 1119 926 2470 4211 6947
This series replaces the final non-selftest patchs from this series[1],
which introduced a similar change (and a new MMU notifier) with KVM
optimizations. I'll send a separate series (to Sean and Paolo) for the
KVM optimizations.
This series also makes proactive reclaim with MGLRU possible for KVM
memory. I have verified that this functions correctly with the selftest
from [1], but given that that test is a KVM selftest, I'll send it with
the rest of the KVM optimizations later. Andrew, let me know if you'd
like to take the test now anyway.
[1]: https://lore.kernel.org/linux-mm/20240926013506.860253-18-jthoughton@google…
This patch (of 2):
The removed stats, MM_LEAF_OLD and MM_NONLEAF_TOTAL, are not very helpful
and become more complicated to properly compute when adding
test/clear_young() notifiers in MGLRU's mm walk.
Link: https://lkml.kernel.org/r/20241019012940.3656292-1-jthoughton@google.com
Link: https://lkml.kernel.org/r/20241019012940.3656292-2-jthoughton@google.com
Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Signed-off-by: James Houghton <jthoughton(a)google.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: David Matlack <dmatlack(a)google.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: David Stevens <stevensd(a)google.com>
Cc: Oliver Upton <oliver.upton(a)linux.dev>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Sean Christopherson <seanjc(a)google.com>
Cc: Wei Xu <weixugc(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mmzone.h | 2 --
mm/vmscan.c | 14 +++++---------
2 files changed, 5 insertions(+), 11 deletions(-)
--- a/include/linux/mmzone.h~mm-multi-gen-lru-remove-mm_leaf_old-and-mm_nonleaf_total-stats
+++ a/include/linux/mmzone.h
@@ -458,9 +458,7 @@ struct lru_gen_folio {
enum {
MM_LEAF_TOTAL, /* total leaf entries */
- MM_LEAF_OLD, /* old leaf entries */
MM_LEAF_YOUNG, /* young leaf entries */
- MM_NONLEAF_TOTAL, /* total non-leaf entries */
MM_NONLEAF_FOUND, /* non-leaf entries found in Bloom filters */
MM_NONLEAF_ADDED, /* non-leaf entries added to Bloom filters */
NR_MM_STATS
--- a/mm/vmscan.c~mm-multi-gen-lru-remove-mm_leaf_old-and-mm_nonleaf_total-stats
+++ a/mm/vmscan.c
@@ -3399,7 +3399,6 @@ restart:
continue;
if (!pte_young(ptent)) {
- walk->mm_stats[MM_LEAF_OLD]++;
continue;
}
@@ -3552,7 +3551,6 @@ restart:
walk->mm_stats[MM_LEAF_TOTAL]++;
if (!pmd_young(val)) {
- walk->mm_stats[MM_LEAF_OLD]++;
continue;
}
@@ -3564,8 +3562,6 @@ restart:
continue;
}
- walk->mm_stats[MM_NONLEAF_TOTAL]++;
-
if (!walk->force_scan && should_clear_pmd_young()) {
if (!pmd_young(val))
continue;
@@ -5254,11 +5250,11 @@ static void lru_gen_seq_show_full(struct
for (tier = 0; tier < MAX_NR_TIERS; tier++) {
seq_printf(m, " %10d", tier);
for (type = 0; type < ANON_AND_FILE; type++) {
- const char *s = " ";
+ const char *s = "xxx";
unsigned long n[3] = {};
if (seq == max_seq) {
- s = "RT ";
+ s = "RTx";
n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]);
n[1] = READ_ONCE(lrugen->avg_total[type][tier]);
} else if (seq == min_seq[type] || NR_HIST_GENS > 1) {
@@ -5280,14 +5276,14 @@ static void lru_gen_seq_show_full(struct
seq_puts(m, " ");
for (i = 0; i < NR_MM_STATS; i++) {
- const char *s = " ";
+ const char *s = "xxxx";
unsigned long n = 0;
if (seq == max_seq && NR_HIST_GENS == 1) {
- s = "LOYNFA";
+ s = "TYFA";
n = READ_ONCE(mm_state->stats[hist][i]);
} else if (seq != max_seq && NR_HIST_GENS > 1) {
- s = "loynfa";
+ s = "tyfa";
n = READ_ONCE(mm_state->stats[hist][i]);
}
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-allow-set-clear-page_type-again.patch
mm-multi-gen-lru-remove-mm_leaf_old-and-mm_nonleaf_total-stats.patch
mm-multi-gen-lru-use-pteppmdp_clear_young_notify.patch
The patch titled
Subject: mm: allow set/clear page_type again
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-allow-set-clear-page_type-again.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm: allow set/clear page_type again
Date: Sat, 19 Oct 2024 22:22:12 -0600
Some page flags (page->flags) were converted to page types
(page->page_types). A recent example is PG_hugetlb.
From the exclusive writer's perspective, e.g., a thread doing
__folio_set_hugetlb(), there is a difference between the page flag and
type APIs: the former allows the same non-atomic operation to be repeated
whereas the latter does not. For example, calling __folio_set_hugetlb()
twice triggers VM_BUG_ON_FOLIO(), since the second call expects the type
(PG_hugetlb) not to be set previously.
Using add_hugetlb_folio() as an example, it calls __folio_set_hugetlb() in
the following error-handling path. And when that happens, it triggers the
aforementioned VM_BUG_ON_FOLIO().
if (folio_test_hugetlb(folio)) {
rc = hugetlb_vmemmap_restore_folio(h, folio);
if (rc) {
spin_lock_irq(&hugetlb_lock);
add_hugetlb_folio(h, folio, false);
...
It is possible to make hugeTLB comply with the new requirements from the
page type API. However, a straightforward fix would be to just allow the
same page type to be set or cleared again inside the API, to avoid any
changes to its callers.
Link: https://lkml.kernel.org/r/20241020042212.296781-1-yuzhao@google.com
Fixes: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/page-flags.h | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/include/linux/page-flags.h~mm-allow-set-clear-page_type-again
+++ a/include/linux/page-flags.h
@@ -975,12 +975,16 @@ static __always_inline bool folio_test_#
} \
static __always_inline void __folio_set_##fname(struct folio *folio) \
{ \
+ if (folio_test_##fname(folio)) \
+ return; \
VM_BUG_ON_FOLIO(data_race(folio->page.page_type) != UINT_MAX, \
folio); \
folio->page.page_type = (unsigned int)PGTY_##lname << 24; \
} \
static __always_inline void __folio_clear_##fname(struct folio *folio) \
{ \
+ if (folio->page.page_type == UINT_MAX) \
+ return; \
VM_BUG_ON_FOLIO(!folio_test_##fname(folio), folio); \
folio->page.page_type = UINT_MAX; \
}
@@ -993,11 +997,15 @@ static __always_inline int Page##uname(c
} \
static __always_inline void __SetPage##uname(struct page *page) \
{ \
+ if (Page##uname(page)) \
+ return; \
VM_BUG_ON_PAGE(data_race(page->page_type) != UINT_MAX, page); \
page->page_type = (unsigned int)PGTY_##lname << 24; \
} \
static __always_inline void __ClearPage##uname(struct page *page) \
{ \
+ if (page->page_type == UINT_MAX) \
+ return; \
VM_BUG_ON_PAGE(!Page##uname(page), page); \
page->page_type = UINT_MAX; \
}
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-allow-set-clear-page_type-again.patch
The patch titled
Subject: nilfs2: fix potential deadlock with newly created symlinks
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
nilfs2-fix-potential-deadlock-with-newly-created-symlinks.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Subject: nilfs2: fix potential deadlock with newly created symlinks
Date: Sun, 20 Oct 2024 13:51:28 +0900
Syzbot reported that page_symlink(), called by nilfs_symlink(), triggers
memory reclamation involving the filesystem layer, which can result in
circular lock dependencies among the reader/writer semaphore
nilfs->ns_segctor_sem, s_writers percpu_rwsem (intwrite) and the
fs_reclaim pseudo lock.
This is because after commit 21fc61c73c39 ("don't put symlink bodies in
pagecache into highmem"), the gfp flags of the page cache for symbolic
links are overwritten to GFP_KERNEL via inode_nohighmem().
This is not a problem for symlinks read from the backing device, because
the __GFP_FS flag is dropped after inode_nohighmem() is called. However,
when a new symlink is created with nilfs_symlink(), the gfp flags remain
overwritten to GFP_KERNEL. Then, memory allocation called from
page_symlink() etc. triggers memory reclamation including the FS layer,
which may call nilfs_evict_inode() or nilfs_dirty_inode(). And these can
cause a deadlock if they are called while nilfs->ns_segctor_sem is held:
Fix this issue by dropping the __GFP_FS flag from the page cache GFP flags
of newly created symlinks in the same way that nilfs_new_inode() and
__nilfs_read_inode() do, as a workaround until we adopt nofs allocation
scope consistently or improve the locking constraints.
Link: https://lkml.kernel.org/r/20241020050003.4308-1-konishi.ryusuke@gmail.com
Fixes: 21fc61c73c39 ("don't put symlink bodies in pagecache into highmem")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Reported-by: syzbot+9ef37ac20608f4836256(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9ef37ac20608f4836256
Tested-by: syzbot+9ef37ac20608f4836256(a)syzkaller.appspotmail.com
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/nilfs2/namei.c | 3 +++
1 file changed, 3 insertions(+)
--- a/fs/nilfs2/namei.c~nilfs2-fix-potential-deadlock-with-newly-created-symlinks
+++ a/fs/nilfs2/namei.c
@@ -157,6 +157,9 @@ static int nilfs_symlink(struct mnt_idma
/* slow symlink */
inode->i_op = &nilfs_symlink_inode_operations;
inode_nohighmem(inode);
+ mapping_set_gfp_mask(inode->i_mapping,
+ mapping_gfp_constraint(inode->i_mapping,
+ ~__GFP_FS));
inode->i_mapping->a_ops = &nilfs_aops;
err = page_symlink(inode, symname, l);
if (err)
_
Patches currently in -mm which might be from konishi.ryusuke(a)gmail.com are
nilfs2-fix-kernel-bug-due-to-missing-clearing-of-checked-flag.patch
nilfs2-fix-potential-deadlock-with-newly-created-symlinks.patch
The patch titled
Subject: mm/damon/vaddr: fix issue in damon_va_evenly_split_region()
has been added to the -mm mm-unstable branch. Its filename is
mm-damon-vaddr-fix-issue-in-damon_va_evenly_split_region.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Zheng Yejian <zhengyejian(a)huaweicloud.com>
Subject: mm/damon/vaddr: fix issue in damon_va_evenly_split_region()
Date: Tue, 22 Oct 2024 16:39:26 +0800
Patch series "mm/damon/vaddr: Fix issue in
damon_va_evenly_split_region()". v2.
According to the logic of damon_va_evenly_split_region(), currently
following split case would not meet the expectation:
Suppose DAMON_MIN_REGION=0x1000,
Case: Split [0x0, 0x3000) into 2 pieces, then the result would be
acutually 3 regions:
[0x0, 0x1000), [0x1000, 0x2000), [0x2000, 0x3000)
but NOT the expected 2 regions:
[0x0, 0x1000), [0x1000, 0x3000) !!!
The root cause is that when calculating size of each split piece in
damon_va_evenly_split_region():
`sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, DAMON_MIN_REGION);`
both the dividing and the ALIGN_DOWN may cause loss of precision, then
each time split one piece of size 'sz_piece' from origin 'start' to 'end'
would cause more pieces are split out than expected!!!
To fix it, count for each piece split and make sure no more than
'nr_pieces'. In addition, add above case into damon_test_split_evenly().
And add 'nr_piece == 1' check in damon_va_evenly_split_region() for better
code readability and add a corresponding kunit testcase.
This patch (of 2):
According to the logic of damon_va_evenly_split_region(), currently
following split case would not meet the expectation:
Suppose DAMON_MIN_REGION=0x1000,
Case: Split [0x0, 0x3000) into 2 pieces, then the result would be
acutually 3 regions:
[0x0, 0x1000), [0x1000, 0x2000), [0x2000, 0x3000)
but NOT the expected 2 regions:
[0x0, 0x1000), [0x1000, 0x3000) !!!
The root cause is that when calculating size of each split piece in
damon_va_evenly_split_region():
`sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, DAMON_MIN_REGION);`
both the dividing and the ALIGN_DOWN may cause loss of precision,
then each time split one piece of size 'sz_piece' from origin 'start' to
'end' would cause more pieces are split out than expected!!!
To fix it, count for each piece split and make sure no more than
'nr_pieces'. In addition, add above case into damon_test_split_evenly().
After this patch, damon-operations test passed:
# ./tools/testing/kunit/kunit.py run damon-operations
[...]
============== damon-operations (6 subtests) ===============
[PASSED] damon_test_three_regions_in_vmas
[PASSED] damon_test_apply_three_regions1
[PASSED] damon_test_apply_three_regions2
[PASSED] damon_test_apply_three_regions3
[PASSED] damon_test_apply_three_regions4
[PASSED] damon_test_split_evenly
================ [PASSED] damon-operations =================
Link: https://lkml.kernel.org/r/20241022083927.3592237-1-zhengyejian@huaweicloud.…
Link: https://lkml.kernel.org/r/20241022083927.3592237-2-zhengyejian@huaweicloud.…
Fixes: 3f49584b262c ("mm/damon: implement primitives for the virtual memory address spaces")
Signed-off-by: Zheng Yejian <zhengyejian(a)huaweicloud.com>
Reviewed-by: SeongJae Park <sj(a)kernel.org>
Cc: Fernand Sieber <sieberf(a)amazon.com>
Cc: Leonard Foerster <foersleo(a)amazon.de>
Cc: Shakeel Butt <shakeel.butt(a)linux.dev>
Cc: Ye Weihua <yeweihua4(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/damon/tests/vaddr-kunit.h | 1 +
mm/damon/vaddr.c | 4 ++--
2 files changed, 3 insertions(+), 2 deletions(-)
--- a/mm/damon/tests/vaddr-kunit.h~mm-damon-vaddr-fix-issue-in-damon_va_evenly_split_region
+++ a/mm/damon/tests/vaddr-kunit.h
@@ -300,6 +300,7 @@ static void damon_test_split_evenly(stru
damon_test_split_evenly_fail(test, 0, 100, 0);
damon_test_split_evenly_succ(test, 0, 100, 10);
damon_test_split_evenly_succ(test, 5, 59, 5);
+ damon_test_split_evenly_succ(test, 0, 3, 2);
damon_test_split_evenly_fail(test, 5, 6, 2);
}
--- a/mm/damon/vaddr.c~mm-damon-vaddr-fix-issue-in-damon_va_evenly_split_region
+++ a/mm/damon/vaddr.c
@@ -67,6 +67,7 @@ static int damon_va_evenly_split_region(
unsigned long sz_orig, sz_piece, orig_end;
struct damon_region *n = NULL, *next;
unsigned long start;
+ unsigned int i;
if (!r || !nr_pieces)
return -EINVAL;
@@ -80,8 +81,7 @@ static int damon_va_evenly_split_region(
r->ar.end = r->ar.start + sz_piece;
next = damon_next_region(r);
- for (start = r->ar.end; start + sz_piece <= orig_end;
- start += sz_piece) {
+ for (start = r->ar.end, i = 1; i < nr_pieces; start += sz_piece, i++) {
n = damon_new_region(start, start + sz_piece);
if (!n)
return -ENOMEM;
_
Patches currently in -mm which might be from zhengyejian(a)huaweicloud.com are
mm-damon-vaddr-fix-issue-in-damon_va_evenly_split_region.patch
mm-damon-vaddr-add-nr_piece-==-1-check-in-damon_va_evenly_split_region.patch
From: Peter Wang <peter.wang(a)mediatek.com>
When ufshcd_rtc_work calls ufshcd_rpm_put_sync and the pm's
usage_count is 0, it will enter the runtime suspend callback.
However, the runtime suspend callback will wait to flush
ufshcd_rtc_work, causing a deadlock.
Replacing ufshcd_rpm_put_sync with ufshcd_rpm_put can avoid
the deadlock.
Fixes: 6bf999e0eb41 ("scsi: ufs: core: Add UFS RTC support")
Cc: <stable(a)vger.kernel.org> #6.11.x
Signed-off-by: Peter Wang <peter.wang(a)mediatek.com>
Reviewed-by: Bart Van Assche <bvanassche(a)acm.org>
---
drivers/ufs/core/ufshcd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index a63dcf48e59d..f5846598d80e 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -8219,7 +8219,7 @@ static void ufshcd_update_rtc(struct ufs_hba *hba)
err = ufshcd_query_attr(hba, UPIU_QUERY_OPCODE_WRITE_ATTR, QUERY_ATTR_IDN_SECONDS_PASSED,
0, 0, &val);
- ufshcd_rpm_put_sync(hba);
+ ufshcd_rpm_put(hba);
if (err)
dev_err(hba->dev, "%s: Failed to update rtc %d\n", __func__, err);
--
2.18.0
From: Peter Wang <peter.wang(a)mediatek.com>
When ufshcd_rtc_work calls ufshcd_rpm_put_sync and the pm's
usage_count is 0, it will enter the runtime suspend callback.
However, the runtime suspend callback will wait to flush
ufshcd_rtc_work, causing a deadlock.
Replacing ufshcd_rpm_put_sync with ufshcd_rpm_put can avoid
the deadlock.
Fixes: 6bf999e0eb41 ("scsi: ufs: core: Add UFS RTC support")
Cc: <stable(a)vger.kernel.org> 6.11.x
Signed-off-by: Peter Wang <peter.wang(a)mediatek.com>
---
drivers/ufs/core/ufshcd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index a63dcf48e59d..f5846598d80e 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -8219,7 +8219,7 @@ static void ufshcd_update_rtc(struct ufs_hba *hba)
err = ufshcd_query_attr(hba, UPIU_QUERY_OPCODE_WRITE_ATTR, QUERY_ATTR_IDN_SECONDS_PASSED,
0, 0, &val);
- ufshcd_rpm_put_sync(hba);
+ ufshcd_rpm_put(hba);
if (err)
dev_err(hba->dev, "%s: Failed to update rtc %d\n", __func__, err);
--
2.18.0
The patch titled
Subject: kasan: remove vmalloc_percpu test
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
kasan-remove-vmalloc_percpu-test.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Andrey Konovalov <andreyknvl(a)gmail.com>
Subject: kasan: remove vmalloc_percpu test
Date: Tue, 22 Oct 2024 18:07:06 +0200
Commit 1a2473f0cbc0 ("kasan: improve vmalloc tests") added the
vmalloc_percpu KASAN test with the assumption that __alloc_percpu always
uses vmalloc internally, which is tagged by KASAN.
However, __alloc_percpu might allocate memory from the first per-CPU
chunk, which is not allocated via vmalloc(). As a result, the test might
fail.
Remove the test until proper KASAN annotation for the per-CPU allocated
are added; tracked in https://bugzilla.kernel.org/show_bug.cgi?id=215019.
Link: https://lkml.kernel.org/r/20241022160706.38943-1-andrey.konovalov@linux.dev
Fixes: 1a2473f0cbc0 ("kasan: improve vmalloc tests")
Signed-off-by: Andrey Konovalov <andreyknvl(a)gmail.com>
Reported-by: Samuel Holland <samuel.holland(a)sifive.com>
Link: https://lore.kernel.org/all/4a245fff-cc46-44d1-a5f9-fd2f1c3764ae@sifive.com/
Reported-by: Sabyrzhan Tasbolatov <snovitoll(a)gmail.com>
Link: https://lore.kernel.org/all/CACzwLxiWzNqPBp4C1VkaXZ2wDwvY3yZeetCi1TLGFipKW7…
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Sabyrzhan Tasbolatov <snovitoll(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kasan/kasan_test_c.c | 27 ---------------------------
1 file changed, 27 deletions(-)
--- a/mm/kasan/kasan_test_c.c~kasan-remove-vmalloc_percpu-test
+++ a/mm/kasan/kasan_test_c.c
@@ -1810,32 +1810,6 @@ static void vm_map_ram_tags(struct kunit
free_pages((unsigned long)p_ptr, 1);
}
-static void vmalloc_percpu(struct kunit *test)
-{
- char __percpu *ptr;
- int cpu;
-
- /*
- * This test is specifically crafted for the software tag-based mode,
- * the only tag-based mode that poisons percpu mappings.
- */
- KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_KASAN_SW_TAGS);
-
- ptr = __alloc_percpu(PAGE_SIZE, PAGE_SIZE);
-
- for_each_possible_cpu(cpu) {
- char *c_ptr = per_cpu_ptr(ptr, cpu);
-
- KUNIT_EXPECT_GE(test, (u8)get_tag(c_ptr), (u8)KASAN_TAG_MIN);
- KUNIT_EXPECT_LT(test, (u8)get_tag(c_ptr), (u8)KASAN_TAG_KERNEL);
-
- /* Make sure that in-bounds accesses don't crash the kernel. */
- *c_ptr = 0;
- }
-
- free_percpu(ptr);
-}
-
/*
* Check that the assigned pointer tag falls within the [KASAN_TAG_MIN,
* KASAN_TAG_KERNEL) range (note: excluding the match-all tag) for tag-based
@@ -2023,7 +1997,6 @@ static struct kunit_case kasan_kunit_tes
KUNIT_CASE(vmalloc_oob),
KUNIT_CASE(vmap_tags),
KUNIT_CASE(vm_map_ram_tags),
- KUNIT_CASE(vmalloc_percpu),
KUNIT_CASE(match_all_not_assigned),
KUNIT_CASE(match_all_ptr_tag),
KUNIT_CASE(match_all_mem_tag),
_
Patches currently in -mm which might be from andreyknvl(a)gmail.com are
kasan-remove-vmalloc_percpu-test.patch
linux-stable-mirror(a)lists.linaro.org
`A8Y7D �0C
gE5w0B�1A�C7LinkedInSD1�01~D9`A8v84�A2S55�F7l42
N70[B6YD3T0D
e36NF6NBAu35[50�AENF6
IPW30W40
e70�CF
NA7TC1cCF�F0:
N70[B6VFD[B6
Andre Torres
sales.oiuy(a)outlook.com
78.54.***.***
***pcs
General goods,mol...
Spain
zCBS73VDEY0Dm88`6F
� 2024 LinkedIn Corporation, 1000 West Maude Avenue, Sunnyvale, CA 94085.
LinkedIn and the LinkedIn logo are registered trademarks of LinkedIn.
The patch titled
Subject: tools/mm: -Werror fixes in page-types/slabinfo
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
tools-mm-werror-fixes-in-page-types-slabinfo.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Wladislav Wiebe <wladislav.kw(a)gmail.com>
Subject: tools/mm: -Werror fixes in page-types/slabinfo
Date: Tue, 22 Oct 2024 19:21:13 +0200
Commit e6d2c436ff693 ("tools/mm: allow users to provide additional
cflags/ldflags") passes now CFLAGS to Makefile. With this, build systems
with default -Werror enabled found:
slabinfo.c:1300:25: error: ignoring return value of 'chdir'
declared with attribute 'warn_unused_result' [-Werror=unused-result]
������������������������������������������������ chdir("..");
������������������������������������������������ ^~~~~~~~~~~
page-types.c:397:35: error: format '%lu' expects argument of type
'long unsigned int', but argument 2 has type 'uint64_t'
{aka 'long long unsigned int'} [-Werror=format=]
������������������������������������������������ printf("%lu\t", mapcnt0);
���������������������������������������������������������������� ~~^�������� ~~~~~~~
..
Fix page-types by using PRIu64 for uint64_t prints and check in slabinfo
for return code on chdir("..").
Link: https://lkml.kernel.org/r/c1ceb507-94bc-461c-934d-c19b77edd825@gmail.com
Fixes: e6d2c436ff693 ("tools/mm: allow users to provide additional cflags/ldflags")
Signed-off-by: Wladislav Wiebe <wladislav.kw(a)gmail.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Herton R. Krzesinski <herton(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
tools/mm/page-types.c | 9 +++++----
tools/mm/slabinfo.c | 4 +++-
2 files changed, 8 insertions(+), 5 deletions(-)
--- a/tools/mm/page-types.c~tools-mm-werror-fixes-in-page-types-slabinfo
+++ a/tools/mm/page-types.c
@@ -22,6 +22,7 @@
#include <time.h>
#include <setjmp.h>
#include <signal.h>
+#include <inttypes.h>
#include <sys/types.h>
#include <sys/errno.h>
#include <sys/fcntl.h>
@@ -391,9 +392,9 @@ static void show_page_range(unsigned lon
if (opt_file)
printf("%lx\t", voff);
if (opt_list_cgroup)
- printf("@%llu\t", (unsigned long long)cgroup0);
+ printf("@%" PRIu64 "\t", cgroup0);
if (opt_list_mapcnt)
- printf("%lu\t", mapcnt0);
+ printf("%" PRIu64 "\t", mapcnt0);
printf("%lx\t%lx\t%s\n",
index, count, page_flag_name(flags0));
}
@@ -419,9 +420,9 @@ static void show_page(unsigned long voff
if (opt_file)
printf("%lx\t", voffset);
if (opt_list_cgroup)
- printf("@%llu\t", (unsigned long long)cgroup);
+ printf("@%" PRIu64 "\t", cgroup)
if (opt_list_mapcnt)
- printf("%lu\t", mapcnt);
+ printf("%" PRIu64 "\t", mapcnt);
printf("%lx\t%s\n", offset, page_flag_name(flags));
}
--- a/tools/mm/slabinfo.c~tools-mm-werror-fixes-in-page-types-slabinfo
+++ a/tools/mm/slabinfo.c
@@ -1297,7 +1297,9 @@ static void read_slab_dir(void)
slab->cpu_partial_free = get_obj("cpu_partial_free");
slab->alloc_node_mismatch = get_obj("alloc_node_mismatch");
slab->deactivate_bypass = get_obj("deactivate_bypass");
- chdir("..");
+ if (chdir(".."))
+ fatal("Unable to chdir from slab ../%s\n",
+ slab->name);
if (slab->name[0] == ':')
alias_targets++;
slab++;
_
Patches currently in -mm which might be from wladislav.kw(a)gmail.com are
tools-mm-werror-fixes-in-page-types-slabinfo.patch
The patch titled
Subject: ipc: fix memleak if msg_init_ns failed in create_ipc_ns
has been added to the -mm mm-nonmm-unstable branch. Its filename is
ipc-fix-memleak-if-msg_init_ns-failed-in-create_ipc_ns.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-nonmm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Wupeng Ma <mawupeng1(a)huawei.com>
Subject: ipc: fix memleak if msg_init_ns failed in create_ipc_ns
Date: Wed, 23 Oct 2024 17:31:29 +0800
Percpu memory allocation may failed during create_ipc_ns however this
fail is not handled properly since ipc sysctls and mq sysctls is not
released properly. Fix this by release these two resource when failure.
Here is the kmemleak stack when percpu failed:
unreferenced object 0xffff88819de2a600 (size 512):
comm "shmem_2nstest", pid 120711, jiffies 4300542254
hex dump (first 32 bytes):
60 aa 9d 84 ff ff ff ff fc 18 48 b2 84 88 ff ff `.........H.....
04 00 00 00 a4 01 00 00 20 e4 56 81 ff ff ff ff ........ .V.....
backtrace (crc be7cba35):
[<ffffffff81b43f83>] __kmalloc_node_track_caller_noprof+0x333/0x420
[<ffffffff81a52e56>] kmemdup_noprof+0x26/0x50
[<ffffffff821b2f37>] setup_mq_sysctls+0x57/0x1d0
[<ffffffff821b29cc>] copy_ipcs+0x29c/0x3b0
[<ffffffff815d6a10>] create_new_namespaces+0x1d0/0x920
[<ffffffff815d7449>] copy_namespaces+0x2e9/0x3e0
[<ffffffff815458f3>] copy_process+0x29f3/0x7ff0
[<ffffffff8154b080>] kernel_clone+0xc0/0x650
[<ffffffff8154b6b1>] __do_sys_clone+0xa1/0xe0
[<ffffffff843df8ff>] do_syscall_64+0xbf/0x1c0
[<ffffffff846000b0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
Link: https://lkml.kernel.org/r/20241023093129.3074301-1-mawupeng1@huawei.com
Fixes: 72d1e611082e ("ipc/msg: mitigate the lock contention with percpu counter")
Signed-off-by: Ma Wupeng <mawupeng1(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
ipc/namespace.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/ipc/namespace.c~ipc-fix-memleak-if-msg_init_ns-failed-in-create_ipc_ns
+++ a/ipc/namespace.c
@@ -83,13 +83,15 @@ static struct ipc_namespace *create_ipc_
err = msg_init_ns(ns);
if (err)
- goto fail_put;
+ goto fail_ipc;
sem_init_ns(ns);
shm_init_ns(ns);
return ns;
+fail_ipc:
+ retire_ipc_sysctls(ns);
fail_mq:
retire_mq_sysctls(ns);
_
Patches currently in -mm which might be from mawupeng1(a)huawei.com are
ipc-fix-memleak-if-msg_init_ns-failed-in-create_ipc_ns.patch
Hi,
I have tested 6.11.4 with these additional fixes for the xe and i915
drivers:
f0ffa657e9f3913c7921cbd4d876343401f15f52
4551d60299b5ddc2655b6b365a4b92634e14e04f
ecabb5e6ce54711c28706fc794d77adb3ecd0605
2009e808bc3e0df6d4d83e2271bc25ae63a4ac05
e4ac526c440af8aa94d2bdfe6066339dd93b4db2
ab0d6ef864c5fa820e894ee1a07f861e63851664
7929ffce0f8b9c76cb5c2a67d1966beaed20ab61
da9a73b7b25eab574cb9c984fcce0b5e240bdd2c
014125c64d09e58e90dde49fbb57d802a13e2559
4cce34b3835b6f7dc52ee2da95c96b6364bb72e5
a8efd8ce280996fe29f2564f705e96e18da3fa62
f15e5587448989a55cf8b4feaad0df72ca3aa6a0
a9556637a23311dea96f27fa3c3e5bfba0b38ae4
c7085d08c7e53d9aef0cdd4b20798356f6f5d469
eb53e5b933b9ff315087305b3dc931af3067d19c
3e307d6c28e7bc7d94b5699d0ed7fe07df6db094
d34f4f058edf1235c103ca9c921dc54820d14d40
31b42af516afa1e184d1a9f9dd4096c54044269a
7fbad577c82c5dd6db7217855c26f51554e53d85
b2013783c4458a1fe8b25c0b249d2e878bcf6999
c55f79f317ab428ae6d005965bc07e37496f209f
9fc97277eb2d17492de636b68cf7d2f5c4f15c1b
I have them applied locally and could submit that if preferred, but
there were no conflicts (since it also brings some additional patches as
required for fixes to apply), so it should be trivial.
All of these patches are already in upstream. Some of them are brought
as dependency. The ones mentioning "performance changes" are knobs to
follow the hw spec and could be considered as fixes too. These patches
are also enabled downstream in Ubuntu 24.10 in order to allow the new
Lunar Lake and Battlemage to work correctly. They have more patches not
included here, but I may follow up with more depending on the acceptance
of these patches.
thanks
Lucas De Marchi
The patch titled
Subject: mm: resolve faulty mmap_region() error path behaviour
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-resolve-faulty-mmap_region-error-path-behaviour.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: mm: resolve faulty mmap_region() error path behaviour
Date: Wed, 23 Oct 2024 21:38:29 +0100
The mmap_region() function is somewhat terrifying, with spaghetti-like
control flow and numerous means by which issues can arise and incomplete
state, memory leaks and other unpleasantness can occur.
A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently
observed issues with resource leaks and observable inconsistent state.
Taking advantage of previous patches in this series we move a number of
checks earlier in the code, simplifying things by moving the core of the
logic into a static internal function __mmap_region().
Doing this allows us to perform a number of checks up front before we do
any real work, and allows us to unwind the writable unmap check
unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
validation unconditionally also.
We move a number of things here:
1. We preallocate memory for the iterator before we call the file-backed
memory hook, allowing us to exit early and avoid having to perform
complicated and error-prone close/free logic. We carefully free
iterator state on both success and error paths.
2. The enclosing mmap_region() function handles the mapping_map_writable()
logic early. Previously the logic had the mapping_map_writable() at the
point of mapping a newly allocated file-backed VMA, and a matching
mapping_unmap_writable() on success and error paths.
We now do this unconditionally if this is a file-backed, shared writable
mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
doing so does not invalidate the seal check we just performed, and we in
any case always decrement the counter in the wrapper.
We perform a debug assert to ensure a driver does not attempt to do the
opposite.
3. We also move arch_validate_flags() up into the mmap_region()
function. This is only relevant on arm64 and sparc64, and the check is
only meaningful for SPARC with ADI enabled. We explicitly add a warning
for this arch if a driver invalidates this check, though the code ought
eventually to be fixed to eliminate the need for this.
With all of these measures in place, we no longer need to explicitly close
the VMA on error paths, as we place all checks which might fail prior to a
call to any driver mmap hook.
This eliminates an entire class of errors, makes the code easier to reason
about and more robust.
Link: https://lkml.kernel.org/r/6e8deda970b982e1e8ffd876e3cef342c292fbb5.17297152…
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reported-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mmap.c | 121 ++++++++++++++++++++++++++++------------------------
1 file changed, 66 insertions(+), 55 deletions(-)
--- a/mm/mmap.c~mm-resolve-faulty-mmap_region-error-path-behaviour
+++ a/mm/mmap.c
@@ -1357,20 +1357,18 @@ int do_munmap(struct mm_struct *mm, unsi
return do_vmi_munmap(&vmi, mm, start, len, uf, false);
}
-unsigned long mmap_region(struct file *file, unsigned long addr,
+static unsigned long __mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
struct list_head *uf)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
pgoff_t pglen = PHYS_PFN(len);
- struct vm_area_struct *merge;
unsigned long charged = 0;
struct vma_munmap_struct vms;
struct ma_state mas_detach;
struct maple_tree mt_detach;
unsigned long end = addr + len;
- bool writable_file_mapping = false;
int error;
VMA_ITERATOR(vmi, mm, addr);
VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
@@ -1444,28 +1442,26 @@ unsigned long mmap_region(struct file *f
vm_flags_init(vma, vm_flags);
vma->vm_page_prot = vm_get_page_prot(vm_flags);
+ if (vma_iter_prealloc(&vmi, vma)) {
+ error = -ENOMEM;
+ goto free_vma;
+ }
+
if (file) {
vma->vm_file = get_file(file);
error = mmap_file(file, vma);
if (error)
- goto unmap_and_free_vma;
-
- if (vma_is_shared_maywrite(vma)) {
- error = mapping_map_writable(file->f_mapping);
- if (error)
- goto close_and_free_vma;
-
- writable_file_mapping = true;
- }
+ goto unmap_and_free_file_vma;
+ /* Drivers cannot alter the address of the VMA. */
+ WARN_ON_ONCE(addr != vma->vm_start);
/*
- * Expansion is handled above, merging is handled below.
- * Drivers should not alter the address of the VMA.
+ * Drivers should not permit writability when previously it was
+ * disallowed.
*/
- if (WARN_ON((addr != vma->vm_start))) {
- error = -EINVAL;
- goto close_and_free_vma;
- }
+ VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
+ !(vm_flags & VM_MAYWRITE) &&
+ (vma->vm_flags & VM_MAYWRITE));
vma_iter_config(&vmi, addr, end);
/*
@@ -1473,6 +1469,8 @@ unsigned long mmap_region(struct file *f
* vma again as we may succeed this time.
*/
if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
+ struct vm_area_struct *merge;
+
vmg.flags = vma->vm_flags;
/* If this fails, state is reset ready for a reattempt. */
merge = vma_merge_new_range(&vmg);
@@ -1490,7 +1488,7 @@ unsigned long mmap_region(struct file *f
vma = merge;
/* Update vm_flags to pick up the change. */
vm_flags = vma->vm_flags;
- goto unmap_writable;
+ goto file_expanded;
}
vma_iter_config(&vmi, addr, end);
}
@@ -1499,26 +1497,15 @@ unsigned long mmap_region(struct file *f
} else if (vm_flags & VM_SHARED) {
error = shmem_zero_setup(vma);
if (error)
- goto free_vma;
+ goto free_iter_vma;
} else {
vma_set_anonymous(vma);
}
- if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
- error = -EACCES;
- goto close_and_free_vma;
- }
-
- /* Allow architectures to sanity-check the vm_flags */
- if (!arch_validate_flags(vma->vm_flags)) {
- error = -EINVAL;
- goto close_and_free_vma;
- }
-
- if (vma_iter_prealloc(&vmi, vma)) {
- error = -ENOMEM;
- goto close_and_free_vma;
- }
+#ifdef CONFIG_SPARC64
+ /* TODO: Fix SPARC ADI! */
+ WARN_ON_ONCE(!arch_validate_flags(vm_flags));
+#endif
/* Lock the VMA since it is modified after insertion into VMA tree */
vma_start_write(vma);
@@ -1532,10 +1519,7 @@ unsigned long mmap_region(struct file *f
*/
khugepaged_enter_vma(vma, vma->vm_flags);
- /* Once vma denies write, undo our temporary denial count */
-unmap_writable:
- if (writable_file_mapping)
- mapping_unmap_writable(file->f_mapping);
+file_expanded:
file = vma->vm_file;
ksm_add_vma(vma);
expanded:
@@ -1568,23 +1552,17 @@ expanded:
vma_set_page_prot(vma);
- validate_mm(mm);
return addr;
-close_and_free_vma:
- vma_close(vma);
-
- if (file || vma->vm_file) {
-unmap_and_free_vma:
- fput(vma->vm_file);
- vma->vm_file = NULL;
-
- vma_iter_set(&vmi, vma->vm_end);
- /* Undo any partial mapping done by a device driver. */
- unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
- }
- if (writable_file_mapping)
- mapping_unmap_writable(file->f_mapping);
+unmap_and_free_file_vma:
+ fput(vma->vm_file);
+ vma->vm_file = NULL;
+
+ vma_iter_set(&vmi, vma->vm_end);
+ /* Undo any partial mapping done by a device driver. */
+ unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
+free_iter_vma:
+ vma_iter_free(&vmi);
free_vma:
vm_area_free(vma);
unacct_error:
@@ -1594,10 +1572,43 @@ unacct_error:
abort_munmap:
vms_abort_munmap_vmas(&vms, &mas_detach);
gather_failed:
- validate_mm(mm);
return error;
}
+unsigned long mmap_region(struct file *file, unsigned long addr,
+ unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
+ struct list_head *uf)
+{
+ unsigned long ret;
+ bool writable_file_mapping = false;
+
+ /* Check to see if MDWE is applicable. */
+ if (map_deny_write_exec(vm_flags, vm_flags))
+ return -EACCES;
+
+ /* Allow architectures to sanity-check the vm_flags. */
+ if (!arch_validate_flags(vm_flags))
+ return -EINVAL;
+
+ /* Map writable and ensure this isn't a sealed memfd. */
+ if (file && is_shared_maywrite(vm_flags)) {
+ int error = mapping_map_writable(file->f_mapping);
+
+ if (error)
+ return error;
+ writable_file_mapping = true;
+ }
+
+ ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf);
+
+ /* Clear our write mapping regardless of error. */
+ if (writable_file_mapping)
+ mapping_unmap_writable(file->f_mapping);
+
+ validate_mm(current->mm);
+ return ret;
+}
+
static int __vm_munmap(unsigned long start, size_t len, bool unlock)
{
int ret;
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
fork-do-not-invoke-uffd-on-fork-if-error-occurs.patch
fork-only-invoke-khugepaged-ksm-hooks-if-no-error.patch
mm-vma-add-expand-only-vma-merge-mode-and-optimise-do_brk_flags.patch
tools-testing-add-expand-only-mode-vma-test.patch
mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook.patch
mm-unconditionally-close-vmas-on-error.patch
mm-refactor-map_deny_write_exec.patch
mm-resolve-faulty-mmap_region-error-path-behaviour.patch
tools-testing-add-additional-vma_internalh-stubs.patch
mm-isolate-mmap-internal-logic-to-mm-vmac.patch
mm-refactor-__mmap_region.patch
mm-defer-second-attempt-at-merge-on-mmap.patch
selftests-mm-add-pkey_sighandler_xx-hugetlb_dio-to-gitignore.patch
mm-refactor-mm_access-to-not-return-null.patch
mm-refactor-mm_access-to-not-return-null-fix.patch
mm-madvise-unrestrict-process_madvise-for-current-process.patch
maple_tree-do-not-hash-pointers-on-dump-in-debug-mode.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems.patch
The patch titled
Subject: mm: refactor map_deny_write_exec()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-refactor-map_deny_write_exec.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: mm: refactor map_deny_write_exec()
Date: Wed, 23 Oct 2024 21:38:28 +0100
Refactor the map_deny_write_exec() to not unnecessarily require a VMA
parameter but rather to accept VMA flags parameters, which allows us to
use this function early in mmap_region() in a subsequent commit.
While we're here, we refactor the function to be more readable and add
some additional documentation.
Link: https://lkml.kernel.org/r/e82d23cebaa87fe5b767fecf9e730e0059066e63.17297152…
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reported-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Jann Horn <jannh(a)google.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mman.h | 21 ++++++++++++++++++---
mm/mmap.c | 2 +-
mm/mprotect.c | 2 +-
mm/vma.h | 2 +-
4 files changed, 21 insertions(+), 6 deletions(-)
--- a/include/linux/mman.h~mm-refactor-map_deny_write_exec
+++ a/include/linux/mman.h
@@ -188,16 +188,31 @@ static inline bool arch_memory_deny_writ
*
* d) mmap(PROT_READ | PROT_EXEC)
* mmap(PROT_READ | PROT_EXEC | PROT_BTI)
+ *
+ * This is only applicable if the user has set the Memory-Deny-Write-Execute
+ * (MDWE) protection mask for the current process.
+ *
+ * @old specifies the VMA flags the VMA originally possessed, and @new the ones
+ * we propose to set.
+ *
+ * Return: false if proposed change is OK, true if not ok and should be denied.
*/
-static inline bool map_deny_write_exec(struct vm_area_struct *vma, unsigned long vm_flags)
+static inline bool map_deny_write_exec(unsigned long old, unsigned long new)
{
+ /* If MDWE is disabled, we have nothing to deny. */
if (!test_bit(MMF_HAS_MDWE, ¤t->mm->flags))
return false;
- if ((vm_flags & VM_EXEC) && (vm_flags & VM_WRITE))
+ /* If the new VMA is not executable, we have nothing to deny. */
+ if (!(new & VM_EXEC))
+ return false;
+
+ /* Under MDWE we do not accept newly writably executable VMAs... */
+ if (new & VM_WRITE)
return true;
- if (!(vma->vm_flags & VM_EXEC) && (vm_flags & VM_EXEC))
+ /* ...nor previously non-executable VMAs becoming executable. */
+ if (!(old & VM_EXEC))
return true;
return false;
--- a/mm/mmap.c~mm-refactor-map_deny_write_exec
+++ a/mm/mmap.c
@@ -1504,7 +1504,7 @@ unsigned long mmap_region(struct file *f
vma_set_anonymous(vma);
}
- if (map_deny_write_exec(vma, vma->vm_flags)) {
+ if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
error = -EACCES;
goto close_and_free_vma;
}
--- a/mm/mprotect.c~mm-refactor-map_deny_write_exec
+++ a/mm/mprotect.c
@@ -810,7 +810,7 @@ static int do_mprotect_pkey(unsigned lon
break;
}
- if (map_deny_write_exec(vma, newflags)) {
+ if (map_deny_write_exec(vma->vm_flags, newflags)) {
error = -EACCES;
break;
}
--- a/mm/vma.h~mm-refactor-map_deny_write_exec
+++ a/mm/vma.h
@@ -42,7 +42,7 @@ struct vma_munmap_struct {
int vma_count; /* Number of vmas that will be removed */
bool unlock; /* Unlock after the munmap */
bool clear_ptes; /* If there are outstanding PTE to be cleared */
- /* 1 byte hole */
+ /* 2 byte hole */
unsigned long nr_pages; /* Number of pages being removed */
unsigned long locked_vm; /* Number of locked pages */
unsigned long nr_accounted; /* Number of VM_ACCOUNT pages */
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
fork-do-not-invoke-uffd-on-fork-if-error-occurs.patch
fork-only-invoke-khugepaged-ksm-hooks-if-no-error.patch
mm-vma-add-expand-only-vma-merge-mode-and-optimise-do_brk_flags.patch
tools-testing-add-expand-only-mode-vma-test.patch
mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook.patch
mm-unconditionally-close-vmas-on-error.patch
mm-refactor-map_deny_write_exec.patch
mm-resolve-faulty-mmap_region-error-path-behaviour.patch
tools-testing-add-additional-vma_internalh-stubs.patch
mm-isolate-mmap-internal-logic-to-mm-vmac.patch
mm-refactor-__mmap_region.patch
mm-defer-second-attempt-at-merge-on-mmap.patch
selftests-mm-add-pkey_sighandler_xx-hugetlb_dio-to-gitignore.patch
mm-refactor-mm_access-to-not-return-null.patch
mm-refactor-mm_access-to-not-return-null-fix.patch
mm-madvise-unrestrict-process_madvise-for-current-process.patch
maple_tree-do-not-hash-pointers-on-dump-in-debug-mode.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems.patch
The patch titled
Subject: mm: unconditionally close VMAs on error
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-unconditionally-close-vmas-on-error.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: mm: unconditionally close VMAs on error
Date: Wed, 23 Oct 2024 21:38:27 +0100
Incorrect invokation of VMA callbacks when the VMA is no longer in a
consistent state is bug prone and risky to perform.
With regards to the important vm_ops->close() callback We have gone to
great lengths to try to track whether or not we ought to close VMAs.
Rather than doing so and risking making a mistake somewhere, instead
unconditionally close and reset vma->vm_ops to an empty dummy operations
set with a NULL .close operator.
We introduce a new function to do so - vma_close() - and simplify existing
vms logic which tracked whether we needed to close or not.
This simplifies the logic, avoids incorrect double-calling of the .close()
callback and allows us to update error paths to simply call vma_close()
unconditionally - making VMA closure idempotent.
Link: https://lkml.kernel.org/r/72a81a6fb997508db644313a5fdf4d239f620da6.17297152…
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reported-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Reviewed-by: Jann Horn <jannh(a)google.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/internal.h | 18 ++++++++++++++++++
mm/mmap.c | 5 ++---
mm/nommu.c | 3 +--
mm/vma.c | 14 +++++---------
mm/vma.h | 4 +---
5 files changed, 27 insertions(+), 17 deletions(-)
--- a/mm/internal.h~mm-unconditionally-close-vmas-on-error
+++ a/mm/internal.h
@@ -135,6 +135,24 @@ static inline int mmap_file(struct file
return err;
}
+/*
+ * If the VMA has a close hook then close it, and since closing it might leave
+ * it in an inconsistent state which makes the use of any hooks suspect, clear
+ * them down by installing dummy empty hooks.
+ */
+static inline void vma_close(struct vm_area_struct *vma)
+{
+ if (vma->vm_ops && vma->vm_ops->close) {
+ vma->vm_ops->close(vma);
+
+ /*
+ * The mapping is in an inconsistent state, and no further hooks
+ * may be invoked upon it.
+ */
+ vma->vm_ops = &vma_dummy_vm_ops;
+ }
+}
+
#ifdef CONFIG_MMU
/* Flags for folio_pte_batch(). */
--- a/mm/mmap.c~mm-unconditionally-close-vmas-on-error
+++ a/mm/mmap.c
@@ -1572,8 +1572,7 @@ expanded:
return addr;
close_and_free_vma:
- if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close)
- vma->vm_ops->close(vma);
+ vma_close(vma);
if (file || vma->vm_file) {
unmap_and_free_vma:
@@ -1933,7 +1932,7 @@ void exit_mmap(struct mm_struct *mm)
do {
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += vma_pages(vma);
- remove_vma(vma, /* unreachable = */ true, /* closed = */ false);
+ remove_vma(vma, /* unreachable = */ true);
count++;
cond_resched();
vma = vma_next(&vmi);
--- a/mm/nommu.c~mm-unconditionally-close-vmas-on-error
+++ a/mm/nommu.c
@@ -589,8 +589,7 @@ static int delete_vma_from_mm(struct vm_
*/
static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
{
- if (vma->vm_ops && vma->vm_ops->close)
- vma->vm_ops->close(vma);
+ vma_close(vma);
if (vma->vm_file)
fput(vma->vm_file);
put_nommu_region(vma->vm_region);
--- a/mm/vma.c~mm-unconditionally-close-vmas-on-error
+++ a/mm/vma.c
@@ -323,11 +323,10 @@ static bool can_vma_merge_right(struct v
/*
* Close a vm structure and free it.
*/
-void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed)
+void remove_vma(struct vm_area_struct *vma, bool unreachable)
{
might_sleep();
- if (!closed && vma->vm_ops && vma->vm_ops->close)
- vma->vm_ops->close(vma);
+ vma_close(vma);
if (vma->vm_file)
fput(vma->vm_file);
mpol_put(vma_policy(vma));
@@ -1115,9 +1114,7 @@ void vms_clean_up_area(struct vma_munmap
vms_clear_ptes(vms, mas_detach, true);
mas_set(mas_detach, 0);
mas_for_each(mas_detach, vma, ULONG_MAX)
- if (vma->vm_ops && vma->vm_ops->close)
- vma->vm_ops->close(vma);
- vms->closed_vm_ops = true;
+ vma_close(vma);
}
/*
@@ -1160,7 +1157,7 @@ void vms_complete_munmap_vmas(struct vma
/* Remove and clean up vmas */
mas_set(mas_detach, 0);
mas_for_each(mas_detach, vma, ULONG_MAX)
- remove_vma(vma, /* = */ false, vms->closed_vm_ops);
+ remove_vma(vma, /* unreachable = */ false);
vm_unacct_memory(vms->nr_accounted);
validate_mm(mm);
@@ -1684,8 +1681,7 @@ struct vm_area_struct *copy_vma(struct v
return new_vma;
out_vma_link:
- if (new_vma->vm_ops && new_vma->vm_ops->close)
- new_vma->vm_ops->close(new_vma);
+ vma_close(new_vma);
if (new_vma->vm_file)
fput(new_vma->vm_file);
--- a/mm/vma.h~mm-unconditionally-close-vmas-on-error
+++ a/mm/vma.h
@@ -42,7 +42,6 @@ struct vma_munmap_struct {
int vma_count; /* Number of vmas that will be removed */
bool unlock; /* Unlock after the munmap */
bool clear_ptes; /* If there are outstanding PTE to be cleared */
- bool closed_vm_ops; /* call_mmap() was encountered, so vmas may be closed */
/* 1 byte hole */
unsigned long nr_pages; /* Number of pages being removed */
unsigned long locked_vm; /* Number of locked pages */
@@ -198,7 +197,6 @@ static inline void init_vma_munmap(struc
vms->unmap_start = FIRST_USER_ADDRESS;
vms->unmap_end = USER_PGTABLES_CEILING;
vms->clear_ptes = false;
- vms->closed_vm_ops = false;
}
#endif
@@ -269,7 +267,7 @@ int do_vmi_munmap(struct vma_iterator *v
unsigned long start, size_t len, struct list_head *uf,
bool unlock);
-void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed);
+void remove_vma(struct vm_area_struct *vma, bool unreachable);
void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
struct vm_area_struct *prev, struct vm_area_struct *next);
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
fork-do-not-invoke-uffd-on-fork-if-error-occurs.patch
fork-only-invoke-khugepaged-ksm-hooks-if-no-error.patch
mm-vma-add-expand-only-vma-merge-mode-and-optimise-do_brk_flags.patch
tools-testing-add-expand-only-mode-vma-test.patch
mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook.patch
mm-unconditionally-close-vmas-on-error.patch
mm-refactor-map_deny_write_exec.patch
mm-resolve-faulty-mmap_region-error-path-behaviour.patch
tools-testing-add-additional-vma_internalh-stubs.patch
mm-isolate-mmap-internal-logic-to-mm-vmac.patch
mm-refactor-__mmap_region.patch
mm-defer-second-attempt-at-merge-on-mmap.patch
selftests-mm-add-pkey_sighandler_xx-hugetlb_dio-to-gitignore.patch
mm-refactor-mm_access-to-not-return-null.patch
mm-refactor-mm_access-to-not-return-null-fix.patch
mm-madvise-unrestrict-process_madvise-for-current-process.patch
maple_tree-do-not-hash-pointers-on-dump-in-debug-mode.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems.patch
The patch titled
Subject: mm: avoid unsafe VMA hook invocation when error arises on mmap hook
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: mm: avoid unsafe VMA hook invocation when error arises on mmap hook
Date: Wed, 23 Oct 2024 21:38:26 +0100
Patch series "fix error handling in mmap_region() and refactor", v2.
The mmap_region() function is somewhat terrifying, with spaghetti-like
control flow and numerous means by which issues can arise and incomplete
state, memory leaks and other unpleasantness can occur.
A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently observed
issues with resource leaks and observable inconsistent state.
This series goes to great lengths to simplify how mmap_region() works and
to avoid unwinding errors late on in the process of setting up the VMA for
the new mapping, and equally avoids such operations occurring while the VMA
is in an inconsistent state.
The first four patches are intended for backporting to correct the
possibility of people encountering corrupted state while invoking mmap()
which is otherwise at risk of happening.
After this we go further, refactoring the code, placing it in mm/vma.c in
order to make it eventually userland testable, and significantly
simplifying the logic to avoid this issue arising in future.
This patch (of 4):
After an attempted mmap() fails, we are no longer in a situation where we
can safely interact with VMA hooks. This is currently not enforced,
meaning that we need complicated handling to ensure we do not incorrectly
call these hooks.
We can avoid the whole issue by treating the VMA as suspect the moment
that the file->f_ops->mmap() function reports an error by replacing
whatever VMA operations were installed with a dummy empty set of VMA
operations.
We do so through a new helper function internal to mm - mmap_file() -
which is both more logically named than the existing call_mmap() function
and correctly isolates handling of the vm_op reassignment to mm.
All the existing invocations of call_mmap() outside of mm are ultimately
nested within the call_mmap() from mm, which we now replace.
It is therefore safe to leave call_mmap() in place as a convenience
function (and to avoid churn). The invokers are:
ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap()
coda_file_operations -> mmap -> coda_file_mmap()
shm_file_operations -> shm_mmap()
shm_file_operations_huge -> shm_mmap()
dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops
-> i915_gem_dmabuf_mmap()
None of these callers interact with vm_ops or mappings in a problematic way
on error, quickly exiting out.
Link: https://lkml.kernel.org/r/cover.1729715266.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/69f3c04df1ece2b7d402a29451ec19290ff429a4.17297152…
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reported-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Jann Horn <jannh(a)google.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/internal.h | 27 +++++++++++++++++++++++++++
mm/mmap.c | 6 +++---
mm/nommu.c | 4 ++--
3 files changed, 32 insertions(+), 5 deletions(-)
--- a/mm/internal.h~mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook
+++ a/mm/internal.h
@@ -108,6 +108,33 @@ static inline void *folio_raw_mapping(co
return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
}
+/*
+ * This is a file-backed mapping, and is about to be memory mapped - invoke its
+ * mmap hook and safely handle error conditions. On error, VMA hooks will be
+ * mutated.
+ *
+ * @file: File which backs the mapping.
+ * @vma: VMA which we are mapping.
+ *
+ * Returns: 0 if success, error otherwise.
+ */
+static inline int mmap_file(struct file *file, struct vm_area_struct *vma)
+{
+ int err = call_mmap(file, vma);
+
+ if (likely(!err))
+ return 0;
+
+ /*
+ * OK, we tried to call the file hook for mmap(), but an error
+ * arose. The mapping is in an inconsistent state and we most not invoke
+ * any further hooks on it.
+ */
+ vma->vm_ops = &vma_dummy_vm_ops;
+
+ return err;
+}
+
#ifdef CONFIG_MMU
/* Flags for folio_pte_batch(). */
--- a/mm/mmap.c~mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook
+++ a/mm/mmap.c
@@ -1421,7 +1421,7 @@ unsigned long mmap_region(struct file *f
/*
* clear PTEs while the vma is still in the tree so that rmap
* cannot race with the freeing later in the truncate scenario.
- * This is also needed for call_mmap(), which is why vm_ops
+ * This is also needed for mmap_file(), which is why vm_ops
* close function is called.
*/
vms_clean_up_area(&vms, &mas_detach);
@@ -1446,7 +1446,7 @@ unsigned long mmap_region(struct file *f
if (file) {
vma->vm_file = get_file(file);
- error = call_mmap(file, vma);
+ error = mmap_file(file, vma);
if (error)
goto unmap_and_free_vma;
@@ -1469,7 +1469,7 @@ unsigned long mmap_region(struct file *f
vma_iter_config(&vmi, addr, end);
/*
- * If vm_flags changed after call_mmap(), we should try merge
+ * If vm_flags changed after mmap_file(), we should try merge
* vma again as we may succeed this time.
*/
if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
--- a/mm/nommu.c~mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook
+++ a/mm/nommu.c
@@ -885,7 +885,7 @@ static int do_mmap_shared_file(struct vm
{
int ret;
- ret = call_mmap(vma->vm_file, vma);
+ ret = mmap_file(vma->vm_file, vma);
if (ret == 0) {
vma->vm_region->vm_top = vma->vm_region->vm_end;
return 0;
@@ -918,7 +918,7 @@ static int do_mmap_private(struct vm_are
* happy.
*/
if (capabilities & NOMMU_MAP_DIRECT) {
- ret = call_mmap(vma->vm_file, vma);
+ ret = mmap_file(vma->vm_file, vma);
/* shouldn't return success if we're not sharing */
if (WARN_ON_ONCE(!is_nommu_shared_mapping(vma->vm_flags)))
ret = -ENOSYS;
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
fork-do-not-invoke-uffd-on-fork-if-error-occurs.patch
fork-only-invoke-khugepaged-ksm-hooks-if-no-error.patch
mm-vma-add-expand-only-vma-merge-mode-and-optimise-do_brk_flags.patch
tools-testing-add-expand-only-mode-vma-test.patch
mm-avoid-unsafe-vma-hook-invocation-when-error-arises-on-mmap-hook.patch
mm-unconditionally-close-vmas-on-error.patch
mm-refactor-map_deny_write_exec.patch
mm-resolve-faulty-mmap_region-error-path-behaviour.patch
tools-testing-add-additional-vma_internalh-stubs.patch
mm-isolate-mmap-internal-logic-to-mm-vmac.patch
mm-refactor-__mmap_region.patch
mm-defer-second-attempt-at-merge-on-mmap.patch
selftests-mm-add-pkey_sighandler_xx-hugetlb_dio-to-gitignore.patch
mm-refactor-mm_access-to-not-return-null.patch
mm-refactor-mm_access-to-not-return-null-fix.patch
mm-madvise-unrestrict-process_madvise-for-current-process.patch
maple_tree-do-not-hash-pointers-on-dump-in-debug-mode.patch
tools-testing-fix-phys_addr_t-size-on-64-bit-systems.patch
With the transition of pd-mapper into the kernel, the timing was altered
such that on some targets the initial rpmsg_send() requests from
pmic_glink clients would be attempted before the firmware had announced
intents, and the firmware reject intent requests.
Fix this
Signed-off-by: Bjorn Andersson <bjorn.andersson(a)oss.qualcomm.com>
---
Bjorn Andersson (2):
rpmsg: glink: Handle rejected intent request better
soc: qcom: pmic_glink: Handle GLINK intent allocation rejections
drivers/rpmsg/qcom_glink_native.c | 10 +++++++---
drivers/soc/qcom/pmic_glink.c | 18 +++++++++++++++---
2 files changed, 22 insertions(+), 6 deletions(-)
---
base-commit: 42f7652d3eb527d03665b09edac47f85fb600924
change-id: 20241022-pmic-glink-ecancelled-d899a9ca0358
Best regards,
--
Bjorn Andersson <bjorn.andersson(a)oss.qualcomm.com>
From: Jeff Xu <jeffxu(a)chromium.org>
Two fixes for madvise(MADV_DONTNEED) when sealed.
For PROT_NONE mappings, the previous blocking of
madvise(MADV_DONTNEED) is unnecessary. As PROT_NONE already prohibits
memory access, madvise(MADV_DONTNEED) should be allowed to proceed in
order to free the page.
For file-backed, private, read-only memory mappings, we previously did
not block the madvise(MADV_DONTNEED). This was based on
the assumption that the memory's content, being file-backed, could be
retrieved from the file if accessed again. However, this assumption
failed to consider scenarios where a mapping is initially created as
read-write, modified, and subsequently changed to read-only. The newly
introduced VM_WASWRITE flag addresses this oversight.
Reported-by: Pedro Falcato <pedro.falcato(a)gmail.com>
Link:https://lore.kernel.org/all/CABi2SkW2XzuZ2-TunWOVzTEX1qc29LhjfNQ3hD4Ny…
Fixes: 8be7258aad44 ("mseal: add mseal syscall")
Cc: <stable(a)vger.kernel.org> # 6.11.y: 4d1b3416659b: mm: move can_modify_vma to mm/vma.h
Cc: <stable(a)vger.kernel.org> # 6.11.y: 4a2dd02b0916: mm/mprotect: replace can_modify_mm with can_modify_vma
Cc: <stable(a)vger.kernel.org> # 6.11.y: 23c57d1fa2b9: mseal: replace can_modify_mm_madv with a vma variant
Cc: <stable(a)vger.kernel.org> # 6.11.y
Signed-off-by: Jeff Xu <jeffxu(a)chromium.org>
---
include/linux/mm.h | 2 ++
mm/mprotect.c | 3 +++
mm/mseal.c | 42 ++++++++++++++++++++++++++++++++++++------
3 files changed, 41 insertions(+), 6 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4c32003c8404..b402eca2565a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -430,6 +430,8 @@ extern unsigned int kobjsize(const void *objp);
#ifdef CONFIG_64BIT
/* VM is sealed, in vm_flags */
#define VM_SEALED _BITUL(63)
+/* VM was writable */
+#define VM_WASWRITE _BITUL(62)
#endif
/* Bits set in the VMA until the stack is in its final location */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0c5d6d06107d..6397135ca526 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -821,6 +821,9 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
break;
}
+ if ((vma->vm_flags & VM_WRITE) && !(newflags & VM_WRITE))
+ newflags |= VM_WASWRITE;
+
error = security_file_mprotect(vma, reqprot, prot);
if (error)
break;
diff --git a/mm/mseal.c b/mm/mseal.c
index ece977bd21e1..28f28487be17 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -36,12 +36,8 @@ static bool is_madv_discard(int behavior)
return false;
}
-static bool is_ro_anon(struct vm_area_struct *vma)
+static bool anon_is_ro(struct vm_area_struct *vma)
{
- /* check anonymous mapping. */
- if (vma->vm_file || vma->vm_flags & VM_SHARED)
- return false;
-
/*
* check for non-writable:
* PROT=RO or PKRU is not writeable.
@@ -53,6 +49,22 @@ static bool is_ro_anon(struct vm_area_struct *vma)
return false;
}
+static bool vma_is_prot_none(struct vm_area_struct *vma)
+{
+ if ((vma->vm_flags & VM_ACCESS_FLAGS) == VM_NONE)
+ return true;
+
+ return false;
+}
+
+static bool vma_was_writable_turn_readonly(struct vm_area_struct *vma)
+{
+ if (!(vma->vm_flags & VM_WRITE) && vma->vm_flags & VM_WASWRITE)
+ return true;
+
+ return false;
+}
+
/*
* Check if a vma is allowed to be modified by madvise.
*/
@@ -61,7 +73,25 @@ bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
if (!is_madv_discard(behavior))
return true;
- if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
+ /* not sealed */
+ if (likely(can_modify_vma(vma)))
+ return true;
+
+ /* PROT_NONE mapping */
+ if (vma_is_prot_none(vma))
+ return true;
+
+ /* file-backed private mapping */
+ if (vma->vm_file) {
+ /* read-only but was writeable */
+ if (vma_was_writable_turn_readonly(vma))
+ return false;
+
+ return true;
+ }
+
+ /* anonymous mapping is read-only */
+ if (anon_is_ro(vma))
return false;
/* Allow by default. */
--
2.47.0.rc1.288.g06298d1525-goog
tpm2_load_null() has weak and broken error handling:
- The return value of tpm2_create_primary() is ignored.
- Leaks TPM return codes from tpm2_load_context() to the caller.
- If the key name comparison succeeds returns previous error
instead of zero to the caller.
Implement a proper error rollback.
Cc: stable(a)vger.kernel.org # v6.10+
Fixes: eb24c9788cd9 ("tpm: disable the TPM if NULL name changes")
Signed-off-by: Jarkko Sakkinen <jarkko(a)kernel.org>
--
v6:
- Address Stefan's remark:
https://lore.kernel.org/linux-integrity/def4ec2d-584b-405f-9d5e-99267013c3c…
v5:
- Fix the TPM error code leak from tpm2_load_context().
v4:
- No changes.
v3:
- Update log messages. Previously the log message incorrectly stated
on load failure that integrity check had been failed, even tho the
check is done *after* the load operation.
v2:
- Refined the commit message.
- Reverted tpm2_create_primary() changes. They are not required if
tmp_null_key is used as the parameter.
---
drivers/char/tpm/tpm2-sessions.c | 43 +++++++++++++++++---------------
1 file changed, 23 insertions(+), 20 deletions(-)
diff --git a/drivers/char/tpm/tpm2-sessions.c b/drivers/char/tpm/tpm2-sessions.c
index 1e12e0b2492e..bdac11964b55 100644
--- a/drivers/char/tpm/tpm2-sessions.c
+++ b/drivers/char/tpm/tpm2-sessions.c
@@ -915,33 +915,36 @@ static int tpm2_parse_start_auth_session(struct tpm2_auth *auth,
static int tpm2_load_null(struct tpm_chip *chip, u32 *null_key)
{
- int rc;
unsigned int offset = 0; /* dummy offset for null seed context */
u8 name[SHA256_DIGEST_SIZE + 2];
+ u32 tmp_null_key;
+ int rc;
rc = tpm2_load_context(chip, chip->null_key_context, &offset,
- null_key);
- if (rc != -EINVAL)
- return rc;
+ &tmp_null_key);
+ if (rc != -EINVAL) {
+ if (!rc)
+ *null_key = tmp_null_key;
+ goto err;
+ }
- /* an integrity failure may mean the TPM has been reset */
- dev_err(&chip->dev, "NULL key integrity failure!\n");
- /* check the null name against what we know */
- tpm2_create_primary(chip, TPM2_RH_NULL, NULL, name);
- if (memcmp(name, chip->null_key_name, sizeof(name)) == 0)
- /* name unchanged, assume transient integrity failure */
- return rc;
- /*
- * Fatal TPM failure: the NULL seed has actually changed, so
- * the TPM must have been illegally reset. All in-kernel TPM
- * operations will fail because the NULL primary can't be
- * loaded to salt the sessions, but disable the TPM anyway so
- * userspace programmes can't be compromised by it.
- */
- dev_err(&chip->dev, "NULL name has changed, disabling TPM due to interference\n");
+ rc = tpm2_create_primary(chip, TPM2_RH_NULL, &tmp_null_key, name);
+ if (rc)
+ goto err;
+
+ /* Return the null key if the name has not been changed: */
+ if (memcmp(name, chip->null_key_name, sizeof(name)) == 0) {
+ *null_key = tmp_null_key;
+ return 0;
+ }
+
+ /* Deduce from the name change TPM interference: */
+ dev_err(&chip->dev, "the null key integrity check failedh\n");
+ tpm2_flush_context(chip, tmp_null_key);
chip->flags |= TPM_CHIP_FLAG_DISABLE;
- return rc;
+err:
+ return rc ? -ENODEV : 0;
}
/**
--
2.47.0
From: Oliver Neukum <oneukum(a)suse.com>
[ Upstream commit e5876b088ba03a62124266fa20d00e65533c7269 ]
ipheth_sndbulk_callback() can submit carrier_work
as a part of its error handling. That means that
the driver must make sure that the work is cancelled
after it has made sure that no more URB can terminate
with an error condition.
Hence the order of actions in ipheth_close() needs
to be inverted.
Signed-off-by: Oliver Neukum <oneukum(a)suse.com>
Signed-off-by: Foster Snowhill <forst(a)pen.gy>
Tested-by: Georgi Valkov <gvalkov(a)gmail.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/net/usb/ipheth.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/usb/ipheth.c b/drivers/net/usb/ipheth.c
index 73ad78f47763c..7814856636907 100644
--- a/drivers/net/usb/ipheth.c
+++ b/drivers/net/usb/ipheth.c
@@ -353,8 +353,8 @@ static int ipheth_close(struct net_device *net)
{
struct ipheth_device *dev = netdev_priv(net);
- cancel_delayed_work_sync(&dev->carrier_work);
netif_stop_queue(net);
+ cancel_delayed_work_sync(&dev->carrier_work);
return 0;
}
--
2.43.0
commit f956052e00de211b5c9ebaa1958366c23f82ee9e upstream.
font.data may not initialize all memory spaces depending on the implementation
of vc->vc_sw->con_font_get. This may cause info-leak, so to prevent this, it
is safest to modify it to initialize the allocated memory space to 0, and it
generally does not affect the overall performance of the system.
Cc: stable(a)vger.kernel.org
Reported-by: syzbot+955da2d57931604ee691(a)syzkaller.appspotmail.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jeongjun Park <aha310510(a)gmail.com>
Link: https://lore.kernel.org/r/20241010174619.59662-1-aha310510@gmail.com
---
drivers/tty/vt/vt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 5f1183b0b89d..800979e8d5b6 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -4398,7 +4398,7 @@ static int con_font_get(struct vc_data *vc, struct console_font_op *op)
int c;
if (op->data) {
- font.data = kmalloc(max_font_size, GFP_KERNEL);
+ font.data = kzalloc(max_font_size, GFP_KERNEL);
if (!font.data)
return -ENOMEM;
} else
--
From: Dominique Martinet <asmadeus(a)codewreck.org>
[ Upstream commit 38d222b3163f7b7d737e5d999ffc890a12870e36 ]
It's possible for v9fs_fid_find "find by dentry" branch to not turn up
anything despite having an entry set (because e.g. uid doesn't match),
in which case the calling code will generally make an extra lookup
to the server.
In this case we might have had better luck looking by inode, so fall
back to look up by inode if we have one and the lookup by dentry failed.
Message-Id: <20240523210024.1214386-1-asmadeus(a)codewreck.org>
Reviewed-by: Christian Schoenebeck <linux_oss(a)crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus(a)codewreck.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/9p/fid.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/fs/9p/fid.c b/fs/9p/fid.c
index de009a33e0e26..f84412290a30c 100644
--- a/fs/9p/fid.c
+++ b/fs/9p/fid.c
@@ -131,10 +131,9 @@ static struct p9_fid *v9fs_fid_find(struct dentry *dentry, kuid_t uid, int any)
}
}
spin_unlock(&dentry->d_lock);
- } else {
- if (dentry->d_inode)
- ret = v9fs_fid_find_inode(dentry->d_inode, false, uid, any);
}
+ if (!ret && dentry->d_inode)
+ ret = v9fs_fid_find_inode(dentry->d_inode, false, uid, any);
return ret;
}
--
2.43.0
From: Dominique Martinet <asmadeus(a)codewreck.org>
[ Upstream commit 38d222b3163f7b7d737e5d999ffc890a12870e36 ]
It's possible for v9fs_fid_find "find by dentry" branch to not turn up
anything despite having an entry set (because e.g. uid doesn't match),
in which case the calling code will generally make an extra lookup
to the server.
In this case we might have had better luck looking by inode, so fall
back to look up by inode if we have one and the lookup by dentry failed.
Message-Id: <20240523210024.1214386-1-asmadeus(a)codewreck.org>
Reviewed-by: Christian Schoenebeck <linux_oss(a)crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus(a)codewreck.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/9p/fid.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/fs/9p/fid.c b/fs/9p/fid.c
index de009a33e0e26..f84412290a30c 100644
--- a/fs/9p/fid.c
+++ b/fs/9p/fid.c
@@ -131,10 +131,9 @@ static struct p9_fid *v9fs_fid_find(struct dentry *dentry, kuid_t uid, int any)
}
}
spin_unlock(&dentry->d_lock);
- } else {
- if (dentry->d_inode)
- ret = v9fs_fid_find_inode(dentry->d_inode, false, uid, any);
}
+ if (!ret && dentry->d_inode)
+ ret = v9fs_fid_find_inode(dentry->d_inode, false, uid, any);
return ret;
}
--
2.43.0
This problem reported by Clement LE GOFFIC manifest when
using CONFIG_KASAN_IN_VMALLOC and VMAP_STACK:
https://lore.kernel.org/linux-arm-kernel/a1a1d062-f3a2-4d05-9836-3b098de9db…
After some analysis it seems we are missing to sync the
VMALLOC shadow memory in top level PGD to all CPUs.
Add some code to perform this sync, and the bug appears
to go away.
As suggested by Ard, also perform a dummy read from the
shadow memory of the new VMAP_STACK in the low level
assembly.
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
---
Changes in v4:
- Since Kasan is not using header stubs, it is necessary to avoid
kasan_*() calls using ifdef when compiling without KASAN.
- Lift a line aligning the end of vmalloc from Melon Liu's
very similar patch so we have feature parity, credit Melon
as co-developer.
- Include the atomic_read_acquire() patch in the series due
to context requirements.
- Verify that the after the patch the kernel still builds and boots
without Kasan.
- Link to v3: https://lore.kernel.org/r/20241017-arm-kasan-vmalloc-crash-v3-0-d2a34cd5b66…
Changes in v3:
- Collect Mark Rutlands ACK on patch 1
- Change the simplified assembly add r2, ip, lsr #n to the canonical
add r2, r2, ip, lsr #n in patch 2.
- Link to v2: https://lore.kernel.org/r/20241016-arm-kasan-vmalloc-crash-v2-0-0a52fd086ee…
Changes in v2:
- Implement the two helper functions suggested by Russell
making the KASAN PGD copying less messy.
- Link to v1: https://lore.kernel.org/r/20241015-arm-kasan-vmalloc-crash-v1-0-dbb23592ca8…
---
Linus Walleij (3):
ARM: ioremap: Sync PGDs for VMALLOC shadow
ARM: entry: Do a dummy read from VMAP shadow
mm: Pair atomic_set_release() with _read_acquire()
arch/arm/kernel/entry-armv.S | 8 ++++++++
arch/arm/mm/ioremap.c | 35 ++++++++++++++++++++++++++++++-----
2 files changed, 38 insertions(+), 5 deletions(-)
---
base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
change-id: 20241015-arm-kasan-vmalloc-crash-fcbd51416457
Best regards,
--
Linus Walleij <linus.walleij(a)linaro.org>