Re: [RFC PATCH 3/4] mm: madvise: implement lightweight guard page mechanism

14 Oct 2024

On Fri, Oct 11, 2024 at 01:55:42PM -0700, Suren Baghdasaryan wrote:
...
On Fri, Oct 11, 2024 at 11:12 AM Jann Horn jannh@google.com wrote:
...
On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes
lorenzo.stoakes@oracle.com wrote:
...
Implement a new lightweight guard page feature, that is regions of userland
virtual memory that, when accessed, cause a fatal signal to arise.
[...]
...

arch/alpha/include/uapi/asm/mman.h     |   3 +
 arch/mips/include/uapi/asm/mman.h      |   3 +
 arch/parisc/include/uapi/asm/mman.h    |   3 +
 arch/xtensa/include/uapi/asm/mman.h    |   3 +
 include/uapi/asm-generic/mman-common.h |   3 +
I kinda wonder if we could start moving the parts of those headers
that are the same for all architectures to include/uapi/linux/mman.h
instead... but that's maybe out of scope for this series.
[...]
...

diff --git a/mm/madvise.c b/mm/madvise.c
index e871a72a6c32..7216e10723ae 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
        case MADV_POPULATE_READ:
        case MADV_POPULATE_WRITE:
        case MADV_COLLAPSE:

  case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */



What does poisoning need a write lock for? anon_vma_prepare() doesn't
need it (it only needs mmap_lock held for reading),
zap_page_range_single() doesn't need it, and pagewalk also doesn't
need it as long as the range being walked is covered by a VMA, which
it is...
I see you set PGWALK_WRLOCK in guard_poison_walk_ops with a comment
saying "We might need to install an anon_vma" - is that referring to
an older version of the patch where the anon_vma_prepare() call was
inside the pagewalk callback or something like that? Either way,
anon_vma_prepare() doesn't need write locks (it can't, it has to work
from the page fault handling path).
I was wondering about that too and I can't find any reason for
write-locking the mm for this operation. PGWALK_WRLOCK should also be
changed to PGWALK_RDLOCK as we are not modifying the VMA.
Indeed, as I said to Jann you're right and I was in error to use this, will
change!
...
BTW, I'm testing your patchset on Android and so far it is stable!
Thanks!
As there is no significant conceptual pushback to this series, I will
un-RFC and post a version with fixes for the issues Jann raised, as well as
a fix for some xtensa et al. issues with header includes.
...
...
...
            return 0;
    default:
            /* be safe, default to 1. list exceptions explicitly */

[...]
...
+static long madvise_guard_poison(struct vm_area_struct *vma,

                           struct vm_area_struct **prev,


                           unsigned long start, unsigned long end)



+{

  long err;


  bool retried = false;



  *prev = vma;


  if (!is_valid_guard_vma(vma, /* allow_locked = */false))


          return -EINVAL;



  /*


   * Optimistically try to install the guard poison pages first. If any


   * non-guard pages are encountered, give up and zap the range before


   * trying again.


   */


  while (true) {


          unsigned long num_installed = 0;



          /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */


          err = walk_page_range_mm(vma->vm_mm, start, end,


                                   &guard_poison_walk_ops,


                                   &num_installed);


          /*


           * If we install poison markers, then the range is no longer


           * empty from a page table perspective and therefore it's


           * appropriate to have an anon_vma.


           *


           * This ensures that on fork, we copy page tables correctly.


           */


          if (err >= 0 && num_installed > 0) {


                  int err_anon = anon_vma_prepare(vma);



I'd move this up, to before we create poison PTEs. There's no harm in
attaching an anon_vma to the VMA even if the rest of the operation
fails; and I think it would be weird to have error paths that don't
attach an anon_vma even though they .
...

                  if (err_anon)


                          err = err_anon;


          }



          if (err <= 0)


                  return err;



          if (!retried)


                  /*


                   * OK some of the range have non-guard pages mapped, zap


                   * them. This leaves existing guard pages in place.


                   */


                  zap_page_range_single(vma, start, end - start, NULL);


          else


                  /*


                   * If we reach here, then there is a racing fault that


                   * has populated the PTE after we zapped. Give up and


                   * let the user know to try again.


                   */


                  return -EAGAIN;



Hmm, yeah, it would be nice if we could avoid telling userspace to
loop on -EAGAIN but I guess we don't have any particularly good
options here? Well, we could bail out with -EINTR if a (fatal?) signal
is pending and otherwise keep looping... if we'd tell userspace "try
again on -EAGAIN", we might as well do that in the kernel...
(Personally I would put curly braces around these branches because
they occupy multiple lines, though the coding style doesn't explicitly
say that, so I guess maybe it's a matter of personal preference...
adding curly braces here would match what is done, for example, in
relocate_vma_down().)
...

          retried = true;


  }



+}



+static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr,

                              unsigned long next, struct mm_walk *walk)



+{

  pte_t ptent = ptep_get(pte);



  if (is_guard_pte_marker(ptent)) {


          /* Simply clear the PTE marker. */


          pte_clear_not_present_full(walk->mm, addr, pte, true);



I think that last parameter probably should be "false"? The sparc code
calls it "fullmm", which is a term the MM code uses when talking about
operations that remove all mappings in the entire mm_struct because
the process has died, which allows using some faster special-case
version of TLB shootdown or something along those lines.
...

          update_mmu_cache(walk->vma, addr, pte);


  }



  return 0;



+}



+static const struct mm_walk_ops guard_unpoison_walk_ops = {

  .pte_entry              = guard_unpoison_pte_entry,


  .walk_lock              = PGWALK_RDLOCK,



+};
It is a _little_ weird that unpoisoning creates page tables when they
don't already exist, which will also prevent creating THP entries on
fault in such areas afterwards... but I guess it doesn't really matter
given that poisoning has that effect, too, and you probably usually
won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned
before... so I guess this is not an actionable comment.

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC PATCH 3/4] mm: madvise: implement lightweight guard page mechanism