On Fri, Oct 11, 2024 at 01:55:42PM -0700, Suren Baghdasaryan wrote:
On Fri, Oct 11, 2024 at 11:12 AM Jann Horn jannh@google.com wrote:
On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise.
[...]
arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/uapi/asm-generic/mman-common.h | 3 +
I kinda wonder if we could start moving the parts of those headers that are the same for all architectures to include/uapi/linux/mman.h instead... but that's maybe out of scope for this series.
[...]
diff --git a/mm/madvise.c b/mm/madvise.c index e871a72a6c32..7216e10723ae 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE:
case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */
What does poisoning need a write lock for? anon_vma_prepare() doesn't need it (it only needs mmap_lock held for reading), zap_page_range_single() doesn't need it, and pagewalk also doesn't need it as long as the range being walked is covered by a VMA, which it is...
I see you set PGWALK_WRLOCK in guard_poison_walk_ops with a comment saying "We might need to install an anon_vma" - is that referring to an older version of the patch where the anon_vma_prepare() call was inside the pagewalk callback or something like that? Either way, anon_vma_prepare() doesn't need write locks (it can't, it has to work from the page fault handling path).
I was wondering about that too and I can't find any reason for write-locking the mm for this operation. PGWALK_WRLOCK should also be changed to PGWALK_RDLOCK as we are not modifying the VMA.
Indeed, as I said to Jann you're right and I was in error to use this, will change!
BTW, I'm testing your patchset on Android and so far it is stable!
Thanks!
As there is no significant conceptual pushback to this series, I will un-RFC and post a version with fixes for the issues Jann raised, as well as a fix for some xtensa et al. issues with header includes.
return 0; default: /* be safe, default to 1. list exceptions explicitly */
[...]
+static long madvise_guard_poison(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
+{
long err;
bool retried = false;
*prev = vma;
if (!is_valid_guard_vma(vma, /* allow_locked = */false))
return -EINVAL;
/*
* Optimistically try to install the guard poison pages first. If any
* non-guard pages are encountered, give up and zap the range before
* trying again.
*/
while (true) {
unsigned long num_installed = 0;
/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
err = walk_page_range_mm(vma->vm_mm, start, end,
&guard_poison_walk_ops,
&num_installed);
/*
* If we install poison markers, then the range is no longer
* empty from a page table perspective and therefore it's
* appropriate to have an anon_vma.
*
* This ensures that on fork, we copy page tables correctly.
*/
if (err >= 0 && num_installed > 0) {
int err_anon = anon_vma_prepare(vma);
I'd move this up, to before we create poison PTEs. There's no harm in attaching an anon_vma to the VMA even if the rest of the operation fails; and I think it would be weird to have error paths that don't attach an anon_vma even though they .
if (err_anon)
err = err_anon;
}
if (err <= 0)
return err;
if (!retried)
/*
* OK some of the range have non-guard pages mapped, zap
* them. This leaves existing guard pages in place.
*/
zap_page_range_single(vma, start, end - start, NULL);
else
/*
* If we reach here, then there is a racing fault that
* has populated the PTE after we zapped. Give up and
* let the user know to try again.
*/
return -EAGAIN;
Hmm, yeah, it would be nice if we could avoid telling userspace to loop on -EAGAIN but I guess we don't have any particularly good options here? Well, we could bail out with -EINTR if a (fatal?) signal is pending and otherwise keep looping... if we'd tell userspace "try again on -EAGAIN", we might as well do that in the kernel...
(Personally I would put curly braces around these branches because they occupy multiple lines, though the coding style doesn't explicitly say that, so I guess maybe it's a matter of personal preference... adding curly braces here would match what is done, for example, in relocate_vma_down().)
retried = true;
}
+}
+static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr,
unsigned long next, struct mm_walk *walk)
+{
pte_t ptent = ptep_get(pte);
if (is_guard_pte_marker(ptent)) {
/* Simply clear the PTE marker. */
pte_clear_not_present_full(walk->mm, addr, pte, true);
I think that last parameter probably should be "false"? The sparc code calls it "fullmm", which is a term the MM code uses when talking about operations that remove all mappings in the entire mm_struct because the process has died, which allows using some faster special-case version of TLB shootdown or something along those lines.
update_mmu_cache(walk->vma, addr, pte);
}
return 0;
+}
+static const struct mm_walk_ops guard_unpoison_walk_ops = {
.pte_entry = guard_unpoison_pte_entry,
.walk_lock = PGWALK_RDLOCK,
+};
It is a _little_ weird that unpoisoning creates page tables when they don't already exist, which will also prevent creating THP entries on fault in such areas afterwards... but I guess it doesn't really matter given that poisoning has that effect, too, and you probably usually won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned before... so I guess this is not an actionable comment.