The patch titled
Subject: lib: alloc_tag_module_unload must wait for pending kfree_rcu calls
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
lib-alloc_tag_module_unload-must-wait-for-pending-kfree_rcu-calls.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Florian Westphal <fw(a)strlen.de>
Subject: lib: alloc_tag_module_unload must wait for pending kfree_rcu calls
Date: Mon, 7 Oct 2024 22:52:24 +0200
Ben Greear reports following splat:
------------[ cut here ]------------
net/netfilter/nf_nat_core.c:1114 module nf_nat func:nf_nat_register_fn has 256 allocated at module unload
WARNING: CPU: 1 PID: 10421 at lib/alloc_tag.c:168 alloc_tag_module_unload+0x22b/0x3f0
Modules linked in: nf_nat(-) btrfs ufs qnx4 hfsplus hfs minix vfat msdos fat
...
Hardware name: Default string Default string/SKYBAY, BIOS 5.12 08/04/2020
RIP: 0010:alloc_tag_module_unload+0x22b/0x3f0
codetag_unload_module+0x19b/0x2a0
? codetag_load_module+0x80/0x80
nf_nat module exit calls kfree_rcu on those addresses, but the free
operation is likely still pending by the time alloc_tag checks for leaks.
Wait for outstanding kfree_rcu operations to complete before checking
resolves this warning.
Reproducer:
unshare -n iptables-nft -t nat -A PREROUTING -p tcp
grep nf_nat /proc/allocinfo # will list 4 allocations
rmmod nft_chain_nat
rmmod nf_nat # will WARN.
Link: https://lkml.kernel.org/r/20241007205236.11847-1-fw@strlen.de
Fixes: a473573964e5 ("lib: code tagging module support")
Signed-off-by: Florian Westphal <fw(a)strlen.de>
Reported-by: Ben Greear <greearb(a)candelatech.com>
Closes: https://lore.kernel.org/netdev/bdaaef9d-4364-4171-b82b-bcfc12e207eb@candela…
Cc: Uladzislau Rezki <urezki(a)gmail.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/codetag.c | 2 ++
1 file changed, 2 insertions(+)
--- a/lib/codetag.c~lib-alloc_tag_module_unload-must-wait-for-pending-kfree_rcu-calls
+++ a/lib/codetag.c
@@ -228,6 +228,8 @@ bool codetag_unload_module(struct module
if (!mod)
return true;
+ kvfree_rcu_barrier();
+
mutex_lock(&codetag_lock);
list_for_each_entry(cttype, &codetag_types, link) {
struct codetag_module *found = NULL;
_
Patches currently in -mm which might be from fw(a)strlen.de are
lib-alloc_tag_module_unload-must-wait-for-pending-kfree_rcu-calls.patch
The patch titled
Subject: mm/mremap: fix move_normal_pmd/retract_page_tables race
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-mremap-fix-move_normal_pmd-retract_page_tables-race.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Jann Horn <jannh(a)google.com>
Subject: mm/mremap: fix move_normal_pmd/retract_page_tables race
Date: Mon, 07 Oct 2024 23:42:04 +0200
In mremap(), move_page_tables() looks at the type of the PMD entry and the
specified address range to figure out by which method the next chunk of
page table entries should be moved.
At that point, the mmap_lock is held in write mode, but no rmap locks are
held yet. For PMD entries that point to page tables and are fully covered
by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
which first takes rmap locks, then does move_normal_pmd().
move_normal_pmd() takes the necessary page table locks at source and
destination, then moves an entire page table from the source to the
destination.
The problem is: The rmap locks, which protect against concurrent page
table removal by retract_page_tables() in the THP code, are only taken
after the PMD entry has been read and it has been decided how to move it.
So we can race as follows (with two processes that have mappings of the
same tmpfs file that is stored on a tmpfs mount with huge=advise); note
that process A accesses page tables through the MM while process B does it
through the file rmap:
process A process B
========= =========
mremap
mremap_to
move_vma
move_page_tables
get_old_pmd
alloc_new_pmd
*** PREEMPT ***
madvise(MADV_COLLAPSE)
do_madvise
madvise_walk_vmas
madvise_vma_behavior
madvise_collapse
hpage_collapse_scan_file
collapse_file
retract_page_tables
i_mmap_lock_read(mapping)
pmdp_collapse_flush
i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, ...)
take_rmap_locks
move_normal_pmd
drop_rmap_locks
When this happens, move_normal_pmd() can end up creating bogus PMD entries
in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect
depends on arch-specific and machine-specific details; on x86, you can end
up with physical page 0 mapped as a page table, which is likely
exploitable for user->kernel privilege escalation.
Fix the race by letting process B recheck that the PMD still points to a
page table after the rmap locks have been taken. Otherwise, we bail and
let the caller fall back to the PTE-level copying path, which will then
bail immediately at the pmd_none() check.
Bug reachability: Reaching this bug requires that you can create
shmem/file THP mappings - anonymous THP uses different code that doesn't
zap stuff under rmap locks. File THP is gated on an experimental config
flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need
shmem THP to hit this bug. As far as I know, getting shmem THP normally
requires that you can mount your own tmpfs with the right mount flags,
which would require creating your own user+mount namespace; though I don't
know if some distros maybe enable shmem THP by default or something like
that.
Bug impact: This issue can likely be used for user->kernel privilege
escalation when it is reachable.
Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5…
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Closes: https://project-zero.issues.chromium.org/371047675
Co-developed-by: David Hildenbrand <david(a)redhat.com>
Signed-off-by: Jann Horn <jannh(a)google.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Joel Fernandes <joel(a)joelfernandes.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mremap.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
--- a/mm/mremap.c~mm-mremap-fix-move_normal_pmd-retract_page_tables-race
+++ a/mm/mremap.c
@@ -238,6 +238,7 @@ static bool move_normal_pmd(struct vm_ar
{
spinlock_t *old_ptl, *new_ptl;
struct mm_struct *mm = vma->vm_mm;
+ bool res = false;
pmd_t pmd;
if (!arch_supports_page_table_move())
@@ -277,19 +278,25 @@ static bool move_normal_pmd(struct vm_ar
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
- /* Clear the pmd */
pmd = *old_pmd;
+
+ /* Racing with collapse? */
+ if (unlikely(!pmd_present(pmd) || pmd_leaf(pmd)))
+ goto out_unlock;
+ /* Clear the pmd */
pmd_clear(old_pmd);
+ res = true;
VM_BUG_ON(!pmd_none(*new_pmd));
pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
+out_unlock:
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
- return true;
+ return res;
}
#else
static inline bool move_normal_pmd(struct vm_area_struct *vma,
_
Patches currently in -mm which might be from jannh(a)google.com are
mm-enforce-a-minimal-stack-gap-even-against-inaccessible-vmas.patch
mm-mremap-fix-move_normal_pmd-retract_page_tables-race.patch
The quilt patch titled
Subject: mm/mremap: prevent racing change of old pmd type
has been removed from the -mm tree. Its filename was
mm-mremap-prevent-racing-change-of-old-pmd-type.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: Jann Horn <jannh(a)google.com>
Subject: mm/mremap: prevent racing change of old pmd type
Date: Wed, 02 Oct 2024 23:07:06 +0200
Prevent move_normal_pmd() in mremap() from racing with
retract_page_tables() in MADVISE_COLLAPSE such that
pmd_populate(mm, new_pmd, pmd_pgtable(pmd))
operates on an empty source pmd, causing creation of a new pmd which maps
physical address 0 as a page table.
This bug is only reachable if either CONFIG_READ_ONLY_THP_FOR_FS is set or
THP shmem is usable. (Unprivileged namespaces can be used to set up a
tmpfs that can contain THP shmem pages with "huge=advise".)
If userspace triggers this bug *in multiple processes*, this could likely
be used to create stale TLB entries pointing to freed pages or cause
kernel UAF by breaking an invariant the rmap code relies on.
Fix it by moving the rmap locking up so that it covers the span from
reading the PMD entry to moving the page table.
Link: https://lkml.kernel.org/r/20241002-move_normal_pmd-vs-collapse-fix-v1-1-782…
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Signed-off-by: Jann Horn <jannh(a)google.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mremap.c | 68 +++++++++++++++++++++++++++-----------------------
1 file changed, 38 insertions(+), 30 deletions(-)
--- a/mm/mremap.c~mm-mremap-prevent-racing-change-of-old-pmd-type
+++ a/mm/mremap.c
@@ -136,17 +136,17 @@ static pte_t move_soft_dirty_pte(pte_t p
static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
unsigned long old_addr, unsigned long old_end,
struct vm_area_struct *new_vma, pmd_t *new_pmd,
- unsigned long new_addr, bool need_rmap_locks)
+ unsigned long new_addr)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *old_pte, *new_pte, pte;
spinlock_t *old_ptl, *new_ptl;
bool force_flush = false;
unsigned long len = old_end - old_addr;
- int err = 0;
/*
- * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
+ * When need_rmap_locks is true in the caller, we are holding the
+ * i_mmap_rwsem and anon_vma
* locks to ensure that rmap will always observe either the old or the
* new ptes. This is the easiest way to avoid races with
* truncate_pagecache(), page migration, etc...
@@ -163,23 +163,18 @@ static int move_ptes(struct vm_area_stru
* serialize access to individual ptes, but only rmap traversal
* order guarantees that we won't miss both the old and new ptes).
*/
- if (need_rmap_locks)
- take_rmap_locks(vma);
/*
* We don't have to worry about the ordering of src and dst
* pte locks because exclusive mmap_lock prevents deadlock.
*/
old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
- if (!old_pte) {
- err = -EAGAIN;
- goto out;
- }
+ if (!old_pte)
+ return -EAGAIN;
new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
if (!new_pte) {
pte_unmap_unlock(old_pte, old_ptl);
- err = -EAGAIN;
- goto out;
+ return -EAGAIN;
}
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
@@ -217,10 +212,7 @@ static int move_ptes(struct vm_area_stru
spin_unlock(new_ptl);
pte_unmap(new_pte - 1);
pte_unmap_unlock(old_pte - 1, old_ptl);
-out:
- if (need_rmap_locks)
- drop_rmap_locks(vma);
- return err;
+ return 0;
}
#ifndef arch_supports_page_table_move
@@ -447,17 +439,14 @@ static __always_inline unsigned long get
/*
* Attempts to speedup the move by moving entry at the level corresponding to
* pgt_entry. Returns true if the move was successful, else false.
+ * rmap locks are held by the caller.
*/
static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
unsigned long old_addr, unsigned long new_addr,
- void *old_entry, void *new_entry, bool need_rmap_locks)
+ void *old_entry, void *new_entry)
{
bool moved = false;
- /* See comment in move_ptes() */
- if (need_rmap_locks)
- take_rmap_locks(vma);
-
switch (entry) {
case NORMAL_PMD:
moved = move_normal_pmd(vma, old_addr, new_addr, old_entry,
@@ -483,9 +472,6 @@ static bool move_pgt_entry(enum pgt_entr
break;
}
- if (need_rmap_locks)
- drop_rmap_locks(vma);
-
return moved;
}
@@ -550,6 +536,7 @@ unsigned long move_page_tables(struct vm
struct mmu_notifier_range range;
pmd_t *old_pmd, *new_pmd;
pud_t *old_pud, *new_pud;
+ int move_res;
if (!len)
return 0;
@@ -573,6 +560,12 @@ unsigned long move_page_tables(struct vm
old_addr, old_end);
mmu_notifier_invalidate_range_start(&range);
+ /*
+ * Hold rmap locks to ensure the type of the old PUD/PMD entry doesn't
+ * change under us due to khugepaged or folio splitting.
+ */
+ take_rmap_locks(vma);
+
for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
/*
@@ -590,14 +583,14 @@ unsigned long move_page_tables(struct vm
if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
if (extent == HPAGE_PUD_SIZE) {
move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr,
- old_pud, new_pud, need_rmap_locks);
+ old_pud, new_pud);
/* We ignore and continue on error? */
continue;
}
} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
- old_pud, new_pud, true))
+ old_pud, new_pud))
continue;
}
@@ -613,7 +606,7 @@ again:
pmd_devmap(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE &&
move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr,
- old_pmd, new_pmd, need_rmap_locks))
+ old_pmd, new_pmd))
continue;
split_huge_pmd(vma, old_pmd, old_addr);
} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PMD) &&
@@ -623,17 +616,32 @@ again:
* moving at the PMD level if possible.
*/
if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr,
- old_pmd, new_pmd, true))
+ old_pmd, new_pmd))
continue;
}
if (pmd_none(*old_pmd))
continue;
- if (pte_alloc(new_vma->vm_mm, new_pmd))
+
+ /*
+ * Temporarily drop the rmap locks while we do a potentially
+ * slow move_ptes() operation, unless move_ptes() wants them
+ * held (see comment inside there).
+ */
+ if (!need_rmap_locks)
+ drop_rmap_locks(vma);
+ if (pte_alloc(new_vma->vm_mm, new_pmd)) {
+ if (!need_rmap_locks)
+ take_rmap_locks(vma);
break;
- if (move_ptes(vma, old_pmd, old_addr, old_addr + extent,
- new_vma, new_pmd, new_addr, need_rmap_locks) < 0)
+ }
+ move_res = move_ptes(vma, old_pmd, old_addr, old_addr + extent,
+ new_vma, new_pmd, new_addr);
+ if (!need_rmap_locks)
+ take_rmap_locks(vma);
+ if (move_res < 0)
goto again;
}
+ drop_rmap_locks(vma);
mmu_notifier_invalidate_range_end(&range);
_
Patches currently in -mm which might be from jannh(a)google.com are
mm-enforce-a-minimal-stack-gap-even-against-inaccessible-vmas.patch
mm-mremap-fix-move_normal_pmd-retract_page_tables-race.patch
Commit c938ab4da0eb ("net: phy: Manual remove LEDs to ensure correct
ordering") correctly fixed a problem with using devm_ but missed
removing the LED entry from the LEDs list.
This cause kernel panic on specific scenario where the port for the PHY
is torn down and up and the kmod for the PHY is removed.
On setting the port down the first time, the assosiacted LEDs are
correctly unregistered. The associated kmod for the PHY is now removed.
The kmod is now added again and the port is now put up, the associated LED
are registered again.
On putting the port down again for the second time after these step, the
LED list now have 4 elements. With the first 2 already unregistered
previously and the 2 new one registered again.
This cause a kernel panic as the first 2 element should have been
removed.
Fix this by correctly removing the element when LED is unregistered.
Reported-by: Daniel Golle <daniel(a)makrotopia.org>
Tested-by: Daniel Golle <daniel(a)makrotopia.org>
Cc: stable(a)vger.kernel.org
Fixes: c938ab4da0eb ("net: phy: Manual remove LEDs to ensure correct ordering")
Signed-off-by: Christian Marangi <ansuelsmth(a)gmail.com>
Reviewed-by: Andrew Lunn <andrew(a)lunn.ch>
---
Changes v2:
- Drop second patch
- Add Reviewed-by tag
drivers/net/phy/phy_device.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 560e338b307a..499797646580 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -3326,10 +3326,11 @@ static __maybe_unused int phy_led_hw_is_supported(struct led_classdev *led_cdev,
static void phy_leds_unregister(struct phy_device *phydev)
{
- struct phy_led *phyled;
+ struct phy_led *phyled, *tmp;
- list_for_each_entry(phyled, &phydev->leds, list) {
+ list_for_each_entry_safe(phyled, tmp, &phydev->leds, list) {
led_classdev_unregister(&phyled->led_cdev);
+ list_del(&phyled->list);
}
}
--
2.45.2
The patch titled
Subject: mm: enforce a minimal stack gap even against inaccessible VMAs
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-enforce-a-minimal-stack-gap-even-against-inaccessible-vmas.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Jann Horn <jannh(a)google.com>
Subject: mm: enforce a minimal stack gap even against inaccessible VMAs
Date: Tue, 08 Oct 2024 00:55:39 +0200
As explained in the comment block this change adds, we can't tell what
userspace's intent is when the stack grows towards an inaccessible VMA.
I have a (highly contrived) C testcase for 32-bit x86 userspace with glibc
that mixes malloc(), pthread creation, and recursion in just the right way
such that the main stack overflows into malloc() arena memory.
I don't know of any specific scenario where this is actually exploitable,
but it seems like it could be a security problem for sufficiently unlucky
userspace.
I believe we should ensure that, as long as code is compiled with
something like -fstack-check, a stack overflow in it can never cause the
main stack to overflow into adjacent heap memory.
My fix effectively reverts the behavior for !vma_is_accessible() VMAs to
the behavior before commit 1be7107fbe18 ("mm: larger stack guard gap,
between vmas"), so I think it should be a fairly safe change even in case
A.
Link: https://lkml.kernel.org/r/20241008-stack-gap-inaccessible-v1-1-848d4d891f21…
Fixes: 561b5e0709e4 ("mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stack")
Signed-off-by: Jann Horn <jannh(a)google.com>
Cc: Ben Hutchings <ben(a)decadent.org.uk>
Cc: Helge Deller <deller(a)gmx.de>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Willy Tarreau <w(a)1wt.eu>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mmap.c | 53 +++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 46 insertions(+), 7 deletions(-)
--- a/mm/mmap.c~mm-enforce-a-minimal-stack-gap-even-against-inaccessible-vmas
+++ a/mm/mmap.c
@@ -1064,10 +1064,12 @@ static int expand_upwards(struct vm_area
gap_addr = TASK_SIZE;
next = find_vma_intersection(mm, vma->vm_end, gap_addr);
- if (next && vma_is_accessible(next)) {
- if (!(next->vm_flags & VM_GROWSUP))
+ if (next && !(next->vm_flags & VM_GROWSUP)) {
+ /* see comments in expand_downwards() */
+ if (vma_is_accessible(prev))
+ return -ENOMEM;
+ if (address == next->vm_start)
return -ENOMEM;
- /* Check that both stack segments have the same anon_vma? */
}
if (next)
@@ -1155,10 +1157,47 @@ int expand_downwards(struct vm_area_stru
/* Enforce stack_guard_gap */
prev = vma_prev(&vmi);
/* Check that both stack segments have the same anon_vma? */
- if (prev) {
- if (!(prev->vm_flags & VM_GROWSDOWN) &&
- vma_is_accessible(prev) &&
- (address - prev->vm_end < stack_guard_gap))
+ if (prev && !(prev->vm_flags & VM_GROWSDOWN) &&
+ (address - prev->vm_end < stack_guard_gap)) {
+ /*
+ * If the previous VMA is accessible, this is the normal case
+ * where the main stack is growing down towards some unrelated
+ * VMA. Enforce the full stack guard gap.
+ */
+ if (vma_is_accessible(prev))
+ return -ENOMEM;
+
+ /*
+ * If the previous VMA is not accessible, we have a problem:
+ * We can't tell what userspace's intent is.
+ *
+ * Case A:
+ * Maybe userspace wants to use the previous VMA as a
+ * "guard region" at the bottom of the main stack, in which case
+ * userspace wants us to grow the stack until it is adjacent to
+ * the guard region. Apparently some Java runtime environments
+ * and Rust do that?
+ * That is kind of ugly, and in that case userspace really ought
+ * to ensure that the stack is fully expanded immediately, but
+ * we have to handle this case.
+ *
+ * Case B:
+ * But maybe the previous VMA is entirely unrelated to the stack
+ * and is only *temporarily* PROT_NONE. For example, glibc
+ * malloc arenas create a big PROT_NONE region and then
+ * progressively mark parts of it as writable.
+ * In that case, we must not let the stack become adjacent to
+ * the previous VMA. Otherwise, after the region later becomes
+ * writable, a stack overflow will cause the stack to grow into
+ * the previous VMA, and we won't have any stack gap to protect
+ * against this.
+ *
+ * As an ugly tradeoff, enforce a single-page gap.
+ * A single page will hopefully be small enough to not be
+ * noticed in case A, while providing the same level of
+ * protection in case B that normal userspace threads get.
+ */
+ if (address == prev->vm_end)
return -ENOMEM;
}
_
Patches currently in -mm which might be from jannh(a)google.com are
mm-mremap-prevent-racing-change-of-old-pmd-type.patch
mm-enforce-a-minimal-stack-gap-even-against-inaccessible-vmas.patch
We have recently noticed the exact same KASAN splat as in commit
6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket
creation fails"). The problem is that commit did not fully address the
problem, as some pf->create implementations do not use sk_common_release
in their error paths.
For example, we can use the same reproducer as in the above commit, but
changing ping to arping. arping uses AF_PACKET socket and if packet_create
fails, it will just sk_free the allocated sk object.
While we could chase all the pf->create implementations and make sure they
NULL the freed sk object on error from the socket, we can't guarantee
future protocols will not make the same mistake.
So it is easier to just explicitly NULL the sk pointer upon return from
pf->create in __sock_create. We do know that pf->create always releases the
allocated sk object on error, so if the pointer is not NULL, it is
definitely dangling.
Fixes: 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails")
Signed-off-by: Ignat Korchagin <ignat(a)cloudflare.com>
Cc: stable(a)vger.kernel.org
---
net/core/sock.c | 3 ---
net/socket.c | 7 ++++++-
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/net/core/sock.c b/net/core/sock.c
index 039be95c40cf..e6e04081949c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3819,9 +3819,6 @@ void sk_common_release(struct sock *sk)
sk->sk_prot->unhash(sk);
- if (sk->sk_socket)
- sk->sk_socket->sk = NULL;
-
/*
* In this point socket cannot receive new packets, but it is possible
* that some packets are in flight because some CPU runs receiver and
diff --git a/net/socket.c b/net/socket.c
index 601ad74930ef..042451f01c65 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1574,8 +1574,13 @@ int __sock_create(struct net *net, int family, int type, int protocol,
rcu_read_unlock();
err = pf->create(net, sock, protocol, kern);
- if (err < 0)
+ if (err < 0) {
+ /* ->create should release the allocated sock->sk object on error
+ * but it may leave the dangling pointer
+ */
+ sock->sk = NULL;
goto out_module_put;
+ }
/*
* Now to bump the refcnt of the [loadable] module that owns this
--
2.39.5
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 45bb63ed20e02ae146336412889fe5450316a84f
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024100728-graves-septic-4380@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
45bb63ed20e0 ("nfsd: fix delegation_blocked() to block correctly for at least 30 seconds")
b3f255ef6bff ("nfsd: use ktime_get_seconds() for timestamps")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45bb63ed20e02ae146336412889fe5450316a84f Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb(a)suse.de>
Date: Mon, 9 Sep 2024 15:06:36 +1000
Subject: [PATCH] nfsd: fix delegation_blocked() to block correctly for at
least 30 seconds
The pair of bloom filtered used by delegation_blocked() was intended to
block delegations on given filehandles for between 30 and 60 seconds. A
new filehandle would be recorded in the "new" bit set. That would then
be switch to the "old" bit set between 0 and 30 seconds later, and it
would remain as the "old" bit set for 30 seconds.
Unfortunately the code intended to clear the old bit set once it reached
30 seconds old, preparing it to be the next new bit set, instead cleared
the *new* bit set before switching it to be the old bit set. This means
that the "old" bit set is always empty and delegations are blocked
between 0 and 30 seconds.
This patch updates bd->new before clearing the set with that index,
instead of afterwards.
Reported-by: Olga Kornievskaia <okorniev(a)redhat.com>
Cc: stable(a)vger.kernel.org
Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
Signed-off-by: NeilBrown <neilb(a)suse.de>
Reviewed-by: Benjamin Coddington <bcodding(a)redhat.com>
Reviewed-by: Jeff Layton <jlayton(a)kernel.org>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index cb5a9ab451c5..ac1859c7cc9d 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
* When a delegation is recalled, the filehandle is stored in the "new"
* filter.
* Every 30 seconds we swap the filters and clear the "new" one,
- * unless both are empty of course.
+ * unless both are empty of course. This results in delegations for a
+ * given filehandle being blocked for between 30 and 60 seconds.
*
* Each filter is 256 bits. We hash the filehandle to 32bit and use the
* low 3 bytes as hash-table indices.
@@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
if (ktime_get_seconds() - bd->swap_time > 30) {
bd->entries -= bd->old_entries;
bd->old_entries = bd->entries;
+ bd->new = 1-bd->new;
memset(bd->set[bd->new], 0,
sizeof(bd->set[0]));
- bd->new = 1-bd->new;
bd->swap_time = ktime_get_seconds();
}
spin_unlock(&blocked_delegations_lock);
commit 45bb63ed20e02ae146336412889fe5450316a84f
The pair of bloom filtered used by delegation_blocked() was intended to
block delegations on given filehandles for between 30 and 60 seconds. A
new filehandle would be recorded in the "new" bit set. That would then
be switch to the "old" bit set between 0 and 30 seconds later, and it
would remain as the "old" bit set for 30 seconds.
Unfortunately the code intended to clear the old bit set once it reached
30 seconds old, preparing it to be the next new bit set, instead cleared
the *new* bit set before switching it to be the old bit set. This means
that the "old" bit set is always empty and delegations are blocked
between 0 and 30 seconds.
This patch updates bd->new before clearing the set with that index,
instead of afterwards.
Reported-by: Olga Kornievskaia <okorniev(a)redhat.com>
Cc: stable(a)vger.kernel.org
Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
Signed-off-by: NeilBrown <neilb(a)suse.de>
Reviewed-by: Benjamin Coddington <bcodding(a)redhat.com>
Reviewed-by: Jeff Layton <jlayton(a)kernel.org>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
---
fs/nfsd/nfs4state.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 7ac644d64ab1..d45487d82d44 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -743,7 +743,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
* When a delegation is recalled, the filehandle is stored in the "new"
* filter.
* Every 30 seconds we swap the filters and clear the "new" one,
- * unless both are empty of course.
+ * unless both are empty of course. This results in delegations for a
+ * given filehandle being blocked for between 30 and 60 seconds.
*
* Each filter is 256 bits. We hash the filehandle to 32bit and use the
* low 3 bytes as hash-table indices.
@@ -772,9 +773,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
if (seconds_since_boot() - bd->swap_time > 30) {
bd->entries -= bd->old_entries;
bd->old_entries = bd->entries;
+ bd->new = 1-bd->new;
memset(bd->set[bd->new], 0,
sizeof(bd->set[0]));
- bd->new = 1-bd->new;
bd->swap_time = seconds_since_boot();
}
spin_unlock(&blocked_delegations_lock);
base-commit: de2cffe297563c815c840cfa14b77a0868b61e53
--
2.46.0
We have recently noticed the exact same KASAN splat as in commit
6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket
creation fails"). The problem is that commit did not fully address the
problem, as some pf->create implementations do not use sk_common_release
in their error paths.
For example, we can use the same reproducer as in the above commit, but
changing ping to arping. arping uses AF_PACKET socket and if packet_create
fails, it will just sk_free the allocated sk object.
While we could chase all the pf->create implementations and make sure they
NULL the freed sk object on error from the socket, we can't guarantee
future protocols will not make the same mistake.
So it is easier to just explicitly NULL the sk pointer upon return from
pf->create in __sock_create. We do know that pf->create always releases the
allocated sk object on error, so if the pointer is not NULL, it is
definitely dangling.
Fixes: 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails")
Signed-off-by: Ignat Korchagin <ignat(a)cloudflare.com>
Cc: stable(a)vger.kernel.org
---
net/socket.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/socket.c b/net/socket.c
index 7b046dd3e9a7..19afac3c2de9 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1575,8 +1575,13 @@ int __sock_create(struct net *net, int family, int type, int protocol,
rcu_read_unlock();
err = pf->create(net, sock, protocol, kern);
- if (err < 0)
+ if (err < 0) {
+ /* ->create should release the allocated sock->sk object on error
+ * but it may leave the dangling pointer
+ */
+ sock->sk = NULL;
goto out_module_put;
+ }
/*
* Now to bump the refcnt of the [loadable] module that owns this
--
2.39.5