Userland library functions such as allocators and threading implementations often require regions of memory to act as 'guard pages' - mappings which, when accessed, result in a fatal signal being sent to the accessing process.
The current means by which these are implemented is via a PROT_NONE mmap() mapping, which provides the required semantics however incur an overhead of a VMA for each such region.
With a great many processes and threads, this can rapidly add up and incur a significant memory penalty. It also has the added problem of preventing merges that might otherwise be permitted.
This series takes a different approach - an idea suggested by Vlasimil Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the provenance becomes a little tricky to ascertain after this - please forgive any omissions!) - rather than locating the guard pages at the VMA layer, instead placing them in page tables mapping the required ranges.
Early testing of the prototype version of this code suggests a 5 times speed up in memory mapping invocations (in conjunction with use of process_madvise()) and a 13% reduction in VMAs on an entirely idle android system and unoptimised code.
We expect with optimisation and a loaded system with a larger number of guard pages this could significantly increase, but in any case these numbers are encouraging.
This way, rather than having separate VMAs specifying which parts of a range are guard pages, instead we have a VMA spanning the entire range of memory a user is permitted to access and including ranges which are to be 'guarded'.
After mapping this, a user can specify which parts of the range should result in a fatal signal when accessed.
By restricting the ability to specify guard pages to memory mapped by existing VMAs, we can rely on the mappings being torn down when the mappings are ultimately unmapped and everything works simply as if the memory were not faulted in, from the point of view of the containing VMAs.
This mechanism in effect poisons memory ranges similar to hardware memory poisoning, only it is an entirely software-controlled form of poisoning.
Any poisoned region of memory is also able to 'unpoisoned', that is, to have its poison markers removed.
The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this poisoning.
Poisoning can be performed across multiple VMAs and any existing mappings will be cleared, that is zapped, before installing the poisoned page table mappings.
There is no concept of 'nested' poisoning, multiple attempts to poison a range will, after the first poisoning, have no effect.
Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned memory, so a user can safely unpoison a range of memory and clear only poison page table mappings leaving the rest intact.
The actual mechanism by which the page table entries are specified makes use of existing logic - PTE markers, which are used for the userfaultfd UFFDIO_POISON mechanism.
Unfortunately PTE_MARKER_POISONED is not suited for the guard page mechanism as it results in VM_FAULT_HWPOISON semantics in the fault handler, so we add our own specific PTE_MARKER_GUARD and adapt existing logic to handle it.
We also extend the generic page walk mechanism to allow for installation of PTEs (carefully restricted to memory management logic only to prevent unwanted abuse).
We ensure that zapping performed by, for instance, MADV_DONTNEED, does not remove guard poison markers, nor does forking (except when VM_WIPEONFORK is specified for a VMA which implies a total removal of memory characteristics).
It's important to note that the guard page implementation is emphatically NOT a security feature, so a user can remove the poisoning if they wish. We simply implement it in such a way as to provide the least surprising behaviour.
An extensive set of self-tests are provided which ensure behaviour is as expected and additionally self-documents expected behaviour of poisoned ranges.
Suggested-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com
v2 * The macros in kselftest_harness.h seem to be broken - __EXPECT() is terminated by '} while (0); OPTIONAL_HANDLER(_assert)' meaning it is not safe in single line if / else or for /which blocks, however working around this results in checkpatch producing invalid warnings, as reported by Shuah. * Fixing these macros is out of scope for this series, so compromise and instead rewrite test blocks so as to use multiple lines by separating out a decl in most cases. This has the side effect of, for the most part, making things more readable. * Heavily document the use of the volatile keyword - we can't avoid checkpatch complaining about this, so we explain it, as reported by Shuah. * Updated commit message to highlight that we skip tests we lack permissions for, as reported by Shuah. * Replaced a perror() with ksft_exit_fail_perror(), as reported by Shuah. * Added user friendly messages to cases where tests are skipped due to lack of permissions, as reported by Shuah. * Update the tool header to include the new MADV_GUARD_POISON/UNPOISON defines and directly include asm-generic/mman.h to get the platform-neutral versions to ensure we import them. * Finally fixed Vlastimil's email address in Suggested-by tags from suze to suse, as reported by Vlastimil. * Added linux-api to cc list, as reported by Vlastimil.
v1 * Un-RFC'd as appears no major objections to approach but rather debate on implementation. * Fixed issue with arches which need mmu_context.h and tlbfush.h. header imports in pagewalker logic to be able to use update_mmu_cache() as reported by the kernel test bot. * Added comments in page walker logic to clarify who can use ops->install_pte and why as well as adding a check_ops_valid() helper function, as suggested by Christoph. * Pass false in full parameter in pte_clear_not_present_full() as suggested by Jann. * Stopped erroneously requiring a write lock for the poison operation as suggested by Jann and Suren. * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be consistent with how this is used elsewhere in the kernel as suggested by Jann. * Avoid returning -EAGAIN if we are raced on page faults, just keep looping and duck out if a fatal signal is pending or a conditional reschedule is needed, as suggested by Jann. * Avoid needlessly splitting huge PUDs and PMDs by specifying ACTION_CONTINUE, as suggested by Jann. https://lore.kernel.org/all/cover.1729196871.git.lorenzo.stoakes@oracle.com/
RFC https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (5): mm: pagewalk: add the ability to install PTEs mm: add PTE_MARKER_GUARD PTE marker mm: madvise: implement lightweight guard page mechanism tools: testing: update tools UAPI header for mman-common.h selftests/mm: add self tests for guard page feature
arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/linux/mm_inline.h | 2 +- include/linux/pagewalk.h | 18 +- include/linux/swapops.h | 26 +- include/uapi/asm-generic/mman-common.h | 3 + mm/hugetlb.c | 3 + mm/internal.h | 6 + mm/madvise.c | 168 +++ mm/memory.c | 18 +- mm/mprotect.c | 3 +- mm/mseal.c | 1 + mm/pagewalk.c | 200 ++- tools/include/uapi/asm-generic/mman-common.h | 3 + tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1228 ++++++++++++++++++ 19 files changed, 1627 insertions(+), 66 deletions(-) create mode 100644 tools/testing/selftests/mm/guard-pages.c
-- 2.47.0
The existing generic pagewalk logic permits the walking of page tables, invoking callbacks at individual page table levels via user-provided mm_walk_ops callbacks.
This is useful for traversing existing page table entries, but precludes the ability to establish new ones.
Existing mechanism for performing a walk which also installs page table entries if necessary are heavily duplicated throughout the kernel, each with semantic differences from one another and largely unavailable for use elsewhere.
Rather than add yet another implementation, we extend the generic pagewalk logic to enable the installation of page table entries by adding a new install_pte() callback in mm_walk_ops. If this is specified, then upon encountering a missing page table entry, we allocate and install a new one and continue the traversal.
If a THP huge page is encountered, we make use of existing logic to split it. Then once we reach the PTE level, we invoke the install_pte() callback which provides a PTE entry to install. We do not support hugetlb at this stage.
If this function returns an error, or an allocation fails during the operation, we abort the operation altogether. It is up to the caller to deal appropriately with partially populated page table ranges.
If install_pte() is defined, the semantics of pte_entry() change - this callback is then only invoked if the entry already exists. This is a useful property, as it allows a caller to handle existing PTEs while installing new ones where necessary in the specified range.
If install_pte() is not defined, then there is no functional difference to this patch, so all existing logic will work precisely as it did before.
As we only permit the installation of PTEs where a mapping does not already exist there is no need for TLB management, however we do invoke update_mmu_cache() for architectures which require manual maintenance of mappings for other CPUs.
We explicitly do not allow the existing page walk API to expose this feature as it is dangerous and intended for internal mm use only. Therefore we provide a new walk_page_range_mm() function exposed only to mm/internal.h.
Reviewed-by: Jann Horn jannh@google.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- include/linux/pagewalk.h | 18 +++- mm/internal.h | 6 ++ mm/pagewalk.c | 200 ++++++++++++++++++++++++++++----------- 3 files changed, 165 insertions(+), 59 deletions(-)
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index f5eb5a32aeed..9700a29f8afb 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -25,12 +25,15 @@ enum page_walk_lock { * this handler is required to be able to handle * pmd_trans_huge() pmds. They may simply choose to * split_huge_page() instead of handling it explicitly. - * @pte_entry: if set, called for each PTE (lowest-level) entry, - * including empty ones + * @pte_entry: if set, called for each PTE (lowest-level) entry + * including empty ones, except if @install_pte is set. + * If @install_pte is set, @pte_entry is called only for + * existing PTEs. * @pte_hole: if set, called for each hole at all levels, * depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD. * Any folded depths (where PTRS_PER_P?D is equal to 1) - * are skipped. + * are skipped. If @install_pte is specified, this will + * not trigger for any populated ranges. * @hugetlb_entry: if set, called for each hugetlb entry. This hook * function is called with the vma lock held, in order to * protect against a concurrent freeing of the pte_t* or @@ -51,6 +54,13 @@ enum page_walk_lock { * @pre_vma: if set, called before starting walk on a non-null vma. * @post_vma: if set, called after a walk on a non-null vma, provided * that @pre_vma and the vma walk succeeded. + * @install_pte: if set, missing page table entries are installed and + * thus all levels are always walked in the specified + * range. This callback is then invoked at the PTE level + * (having split any THP pages prior), providing the PTE to + * install. If allocations fail, the walk is aborted. This + * operation is only available for userland memory. Not + * usable for hugetlb ranges. * * p?d_entry callbacks are called even if those levels are folded on a * particular architecture/configuration. @@ -76,6 +86,8 @@ struct mm_walk_ops { int (*pre_vma)(unsigned long start, unsigned long end, struct mm_walk *walk); void (*post_vma)(struct mm_walk *walk); + int (*install_pte)(unsigned long addr, unsigned long next, + pte_t *ptep, struct mm_walk *walk); enum page_walk_lock walk_lock; };
diff --git a/mm/internal.h b/mm/internal.h index 508f7802dd2b..fb1fb0c984e4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -12,6 +12,7 @@ #include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/pagemap.h> +#include <linux/pagewalk.h> #include <linux/rmap.h> #include <linux/swap.h> #include <linux/swapops.h> @@ -1451,4 +1452,9 @@ static inline void accept_page(struct page *page) } #endif /* CONFIG_UNACCEPTED_MEMORY */
+/* pagewalk.c */ +int walk_page_range_mm(struct mm_struct *mm, unsigned long start, + unsigned long end, const struct mm_walk_ops *ops, + void *private); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 5f9f01532e67..261cd5f2de38 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -3,9 +3,14 @@ #include <linux/highmem.h> #include <linux/sched.h> #include <linux/hugetlb.h> +#include <linux/mmu_context.h> #include <linux/swap.h> #include <linux/swapops.h>
+#include <asm/tlbflush.h> + +#include "internal.h" + /* * We want to know the real level where a entry is located ignoring any * folding of levels which may be happening. For example if p4d is folded then @@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr, int err = 0;
for (;;) { - err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk); - if (err) - break; + if (ops->install_pte && pte_none(ptep_get(pte))) { + pte_t new_pte; + + err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte, + walk); + if (err) + break; + + set_pte_at(walk->mm, addr, pte, new_pte); + /* Non-present before, so for arches that need it. */ + if (!WARN_ON_ONCE(walk->no_vma)) + update_mmu_cache(walk->vma, addr, pte); + } else { + err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk); + if (err) + break; + } if (addr >= end - PAGE_SIZE) break; addr += PAGE_SIZE; @@ -89,11 +108,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, again: next = pmd_addr_end(addr, end); if (pmd_none(*pmd)) { - if (ops->pte_hole) + if (ops->install_pte) + err = __pte_alloc(walk->mm, pmd); + else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break; - continue; + if (!ops->install_pte) + continue; }
walk->action = ACTION_SUBTREE; @@ -116,7 +138,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, */ if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) || walk->action == ACTION_CONTINUE || - !(ops->pte_entry)) + !(ops->pte_entry || ops->install_pte)) continue;
if (walk->vma) @@ -148,11 +170,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, again: next = pud_addr_end(addr, end); if (pud_none(*pud)) { - if (ops->pte_hole) + if (ops->install_pte) + err = __pmd_alloc(walk->mm, pud, addr); + else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break; - continue; + if (!ops->install_pte) + continue; }
walk->action = ACTION_SUBTREE; @@ -167,7 +192,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) || walk->action == ACTION_CONTINUE || - !(ops->pmd_entry || ops->pte_entry)) + !(ops->pmd_entry || ops->pte_entry || ops->install_pte)) continue;
if (walk->vma) @@ -196,18 +221,22 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, do { next = p4d_addr_end(addr, end); if (p4d_none_or_clear_bad(p4d)) { - if (ops->pte_hole) + if (ops->install_pte) + err = __pud_alloc(walk->mm, p4d, addr); + else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break; - continue; + if (!ops->install_pte) + continue; } if (ops->p4d_entry) { err = ops->p4d_entry(p4d, addr, next, walk); if (err) break; } - if (ops->pud_entry || ops->pmd_entry || ops->pte_entry) + if (ops->pud_entry || ops->pmd_entry || ops->pte_entry || + ops->install_pte) err = walk_pud_range(p4d, addr, next, walk); if (err) break; @@ -231,18 +260,22 @@ static int walk_pgd_range(unsigned long addr, unsigned long end, do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) { - if (ops->pte_hole) + if (ops->install_pte) + err = __p4d_alloc(walk->mm, pgd, addr); + else if (ops->pte_hole) err = ops->pte_hole(addr, next, 0, walk); if (err) break; - continue; + if (!ops->install_pte) + continue; } if (ops->pgd_entry) { err = ops->pgd_entry(pgd, addr, next, walk); if (err) break; } - if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry) + if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || + ops->pte_entry || ops->install_pte) err = walk_p4d_range(pgd, addr, next, walk); if (err) break; @@ -334,6 +367,11 @@ static int __walk_page_range(unsigned long start, unsigned long end, int err = 0; struct vm_area_struct *vma = walk->vma; const struct mm_walk_ops *ops = walk->ops; + bool is_hugetlb = is_vm_hugetlb_page(vma); + + /* We do not support hugetlb PTE installation. */ + if (ops->install_pte && is_hugetlb) + return -EINVAL;
if (ops->pre_vma) { err = ops->pre_vma(start, end, walk); @@ -341,7 +379,7 @@ static int __walk_page_range(unsigned long start, unsigned long end, return err; }
- if (is_vm_hugetlb_page(vma)) { + if (is_hugetlb) { if (ops->hugetlb_entry) err = walk_hugetlb_range(start, end, walk); } else @@ -380,47 +418,14 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma, #endif }
-/** - * walk_page_range - walk page table with caller specific callbacks - * @mm: mm_struct representing the target process of page table walk - * @start: start address of the virtual address range - * @end: end address of the virtual address range - * @ops: operation to call during the walk - * @private: private data for callbacks' usage - * - * Recursively walk the page table tree of the process represented by @mm - * within the virtual address range [@start, @end). During walking, we can do - * some caller-specific works for each entry, by setting up pmd_entry(), - * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these - * callbacks, the associated entries/pages are just ignored. - * The return values of these callbacks are commonly defined like below: - * - * - 0 : succeeded to handle the current entry, and if you don't reach the - * end address yet, continue to walk. - * - >0 : succeeded to handle the current entry, and return to the caller - * with caller specific value. - * - <0 : failed to handle the current entry, and return to the caller - * with error code. - * - * Before starting to walk page table, some callers want to check whether - * they really want to walk over the current vma, typically by checking - * its vm_flags. walk_page_test() and @ops->test_walk() are used for this - * purpose. - * - * If operations need to be staged before and committed after a vma is walked, - * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(), - * since it is intended to handle commit-type operations, can't return any - * errors. - * - * struct mm_walk keeps current values of some common data like vma and pmd, - * which are useful for the access from callbacks. If you want to pass some - * caller-specific data to callbacks, @private should be helpful. +/* + * See the comment for walk_page_range(), this performs the heavy lifting of the + * operation, only sets no restrictions on how the walk proceeds. * - * Locking: - * Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock, - * because these function traverse vma list and/or access to vma's data. + * We usually restrict the ability to install PTEs, but this functionality is + * available to internal memory management code and provided in mm/internal.h. */ -int walk_page_range(struct mm_struct *mm, unsigned long start, +int walk_page_range_mm(struct mm_struct *mm, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, void *private) { @@ -479,6 +484,80 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, return err; }
+/* + * Determine if the walk operations specified are permitted to be used for a + * page table walk. + * + * This check is performed on all functions which are parameterised by walk + * operations and exposed in include/linux/pagewalk.h. + * + * Internal memory management code can use the walk_page_range_mm() function to + * be able to use all page walking operations. + */ +static bool check_ops_valid(const struct mm_walk_ops *ops) +{ + /* + * The installation of PTEs is solely under the control of memory + * management logic and subject to many subtle locking, security and + * cache considerations so we cannot permit other users to do so, and + * certainly not for exported symbols. + */ + if (ops->install_pte) + return false; + + return true; +} + +/** + * walk_page_range - walk page table with caller specific callbacks + * @mm: mm_struct representing the target process of page table walk + * @start: start address of the virtual address range + * @end: end address of the virtual address range + * @ops: operation to call during the walk + * @private: private data for callbacks' usage + * + * Recursively walk the page table tree of the process represented by @mm + * within the virtual address range [@start, @end). During walking, we can do + * some caller-specific works for each entry, by setting up pmd_entry(), + * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these + * callbacks, the associated entries/pages are just ignored. + * The return values of these callbacks are commonly defined like below: + * + * - 0 : succeeded to handle the current entry, and if you don't reach the + * end address yet, continue to walk. + * - >0 : succeeded to handle the current entry, and return to the caller + * with caller specific value. + * - <0 : failed to handle the current entry, and return to the caller + * with error code. + * + * Before starting to walk page table, some callers want to check whether + * they really want to walk over the current vma, typically by checking + * its vm_flags. walk_page_test() and @ops->test_walk() are used for this + * purpose. + * + * If operations need to be staged before and committed after a vma is walked, + * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(), + * since it is intended to handle commit-type operations, can't return any + * errors. + * + * struct mm_walk keeps current values of some common data like vma and pmd, + * which are useful for the access from callbacks. If you want to pass some + * caller-specific data to callbacks, @private should be helpful. + * + * Locking: + * Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock, + * because these function traverse vma list and/or access to vma's data. + */ +int walk_page_range(struct mm_struct *mm, unsigned long start, + unsigned long end, const struct mm_walk_ops *ops, + void *private) +{ + if (!check_ops_valid(ops)) + return -EINVAL; + + return walk_page_range_mm(mm, start, end, ops, private); +} + /** * walk_page_range_novma - walk a range of pagetables not backed by a vma * @mm: mm_struct representing the target process of page table walk @@ -494,7 +573,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, * walking the kernel pages tables or page tables for firmware. * * Note: Be careful to walk the kernel pages tables, the caller may be need to - * take other effective approache (mmap lock may be insufficient) to prevent + * take other effective approaches (mmap lock may be insufficient) to prevent * the intermediate kernel page tables belonging to the specified address range * from being freed (e.g. memory hot-remove). */ @@ -513,6 +592,8 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
if (start >= end || !walk.mm) return -EINVAL; + if (!check_ops_valid(ops)) + return -EINVAL;
/* * 1) For walking the user virtual address space: @@ -556,6 +637,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, return -EINVAL; if (start < vma->vm_start || end > vma->vm_end) return -EINVAL; + if (!check_ops_valid(ops)) + return -EINVAL;
process_mm_walk_lock(walk.mm, ops->walk_lock); process_vma_walk_lock(vma, ops->walk_lock); @@ -574,6 +657,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
if (!walk.mm) return -EINVAL; + if (!check_ops_valid(ops)) + return -EINVAL;
process_mm_walk_lock(walk.mm, ops->walk_lock); process_vma_walk_lock(vma, ops->walk_lock); @@ -623,6 +708,9 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, unsigned long start_addr, end_addr; int err = 0;
+ if (!check_ops_valid(ops)) + return -EINVAL; + lockdep_assert_held(&mapping->i_mmap_rwsem); vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index, first_index + nr - 1) {
On 10/20/24 18:20, Lorenzo Stoakes wrote:
The existing generic pagewalk logic permits the walking of page tables, invoking callbacks at individual page table levels via user-provided mm_walk_ops callbacks.
This is useful for traversing existing page table entries, but precludes the ability to establish new ones.
Existing mechanism for performing a walk which also installs page table entries if necessary are heavily duplicated throughout the kernel, each with semantic differences from one another and largely unavailable for use elsewhere.
Rather than add yet another implementation, we extend the generic pagewalk logic to enable the installation of page table entries by adding a new install_pte() callback in mm_walk_ops. If this is specified, then upon encountering a missing page table entry, we allocate and install a new one and continue the traversal.
If a THP huge page is encountered, we make use of existing logic to split it. Then once we reach the PTE level, we invoke the install_pte() callback which provides a PTE entry to install. We do not support hugetlb at this stage.
If this function returns an error, or an allocation fails during the operation, we abort the operation altogether. It is up to the caller to deal appropriately with partially populated page table ranges.
If install_pte() is defined, the semantics of pte_entry() change - this callback is then only invoked if the entry already exists. This is a useful property, as it allows a caller to handle existing PTEs while installing new ones where necessary in the specified range.
If install_pte() is not defined, then there is no functional difference to this patch, so all existing logic will work precisely as it did before.
As we only permit the installation of PTEs where a mapping does not already exist there is no need for TLB management, however we do invoke update_mmu_cache() for architectures which require manual maintenance of mappings for other CPUs.
We explicitly do not allow the existing page walk API to expose this feature as it is dangerous and intended for internal mm use only. Therefore we provide a new walk_page_range_mm() function exposed only to mm/internal.h.
Reviewed-by: Jann Horn jannh@google.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
<snip>
/*
- We want to know the real level where a entry is located ignoring any
- folding of levels which may be happening. For example if p4d is folded then
@@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr, int err = 0; for (;;) {
err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
if (err)
break;
if (ops->install_pte && pte_none(ptep_get(pte))) {
pte_t new_pte;
err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte,
walk);
if (err)
break;
set_pte_at(walk->mm, addr, pte, new_pte);
While the guard pages install ptes unconditionally, maybe some install_pte handler implementation would sometimes want to skip, should ve define an error code that means its skipped and just continue instead of set_pte_at()? Or leave it until such handler appears.
/* Non-present before, so for arches that need it. */
if (!WARN_ON_ONCE(walk->no_vma))
update_mmu_cache(walk->vma, addr, pte);
} else {
err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
if (err)
break;
if (addr >= end - PAGE_SIZE) break; addr += PAGE_SIZE;}
@@ -89,11 +108,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, again: next = pmd_addr_end(addr, end); if (pmd_none(*pmd)) {
if (ops->pte_hole)
if (ops->install_pte)
err = __pte_alloc(walk->mm, pmd);
else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break;
continue;
if (!ops->install_pte)
}continue;
walk->action = ACTION_SUBTREE; @@ -116,7 +138,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, */ if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) || walk->action == ACTION_CONTINUE ||
!(ops->pte_entry))
!(ops->pte_entry || ops->install_pte)) continue;
BTW, I find it hard to read this condition even before your patch, oh well. But if I read it correctly, does it mean we're going to split a pmd-mapped THP if we have a install_pte handler? But is that really necessary if the pmd splitting results in all ptes populated, and thus the install_pte handler can't do anything with any pte anyway?
if (walk->vma)
@@ -148,11 +170,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, again: next = pud_addr_end(addr, end); if (pud_none(*pud)) {
if (ops->pte_hole)
if (ops->install_pte)
err = __pmd_alloc(walk->mm, pud, addr);
else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break;
continue;
if (!ops->install_pte)
}continue;
walk->action = ACTION_SUBTREE; @@ -167,7 +192,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) || walk->action == ACTION_CONTINUE ||
!(ops->pmd_entry || ops->pte_entry))
!(ops->pmd_entry || ops->pte_entry || ops->install_pte)) continue;
Ditto?
On Mon, Oct 21, 2024 at 03:27:55PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
The existing generic pagewalk logic permits the walking of page tables, invoking callbacks at individual page table levels via user-provided mm_walk_ops callbacks.
This is useful for traversing existing page table entries, but precludes the ability to establish new ones.
Existing mechanism for performing a walk which also installs page table entries if necessary are heavily duplicated throughout the kernel, each with semantic differences from one another and largely unavailable for use elsewhere.
Rather than add yet another implementation, we extend the generic pagewalk logic to enable the installation of page table entries by adding a new install_pte() callback in mm_walk_ops. If this is specified, then upon encountering a missing page table entry, we allocate and install a new one and continue the traversal.
If a THP huge page is encountered, we make use of existing logic to split it. Then once we reach the PTE level, we invoke the install_pte() callback which provides a PTE entry to install. We do not support hugetlb at this stage.
If this function returns an error, or an allocation fails during the operation, we abort the operation altogether. It is up to the caller to deal appropriately with partially populated page table ranges.
If install_pte() is defined, the semantics of pte_entry() change - this callback is then only invoked if the entry already exists. This is a useful property, as it allows a caller to handle existing PTEs while installing new ones where necessary in the specified range.
If install_pte() is not defined, then there is no functional difference to this patch, so all existing logic will work precisely as it did before.
As we only permit the installation of PTEs where a mapping does not already exist there is no need for TLB management, however we do invoke update_mmu_cache() for architectures which require manual maintenance of mappings for other CPUs.
We explicitly do not allow the existing page walk API to expose this feature as it is dangerous and intended for internal mm use only. Therefore we provide a new walk_page_range_mm() function exposed only to mm/internal.h.
Reviewed-by: Jann Horn jannh@google.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
<snip>
/*
- We want to know the real level where a entry is located ignoring any
- folding of levels which may be happening. For example if p4d is folded then
@@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr, int err = 0;
for (;;) {
err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
if (err)
break;
if (ops->install_pte && pte_none(ptep_get(pte))) {
pte_t new_pte;
err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte,
walk);
if (err)
break;
set_pte_at(walk->mm, addr, pte, new_pte);
While the guard pages install ptes unconditionally, maybe some install_pte handler implementation would sometimes want to skip, should ve define an error code that means its skipped and just continue instead of set_pte_at()? Or leave it until such handler appears.
I'm not sure under what circumstances you'd want to skip though precisely? There's nothing populated, and the user is defining the range in which to install a PTE if nothing is populated.
If they wanted more complicated handling they could do multiple, smaller calls. Things are inherently racey with these walks so there's no benefit in doing everything at once.
/* Non-present before, so for arches that need it. */
if (!WARN_ON_ONCE(walk->no_vma))
update_mmu_cache(walk->vma, addr, pte);
} else {
err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
if (err)
break;
if (addr >= end - PAGE_SIZE) break; addr += PAGE_SIZE;}
@@ -89,11 +108,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, again: next = pmd_addr_end(addr, end); if (pmd_none(*pmd)) {
if (ops->pte_hole)
if (ops->install_pte)
err = __pte_alloc(walk->mm, pmd);
else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break;
continue;
if (!ops->install_pte)
continue;
}
walk->action = ACTION_SUBTREE;
@@ -116,7 +138,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, */ if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) || walk->action == ACTION_CONTINUE ||
!(ops->pte_entry))
!(ops->pte_entry || ops->install_pte)) continue;
BTW, I find it hard to read this condition even before your patch, oh well.
Agreed, this badly needs refactoring, but felt out of scope for this change.
But if I read it correctly, does it mean we're going to split a pmd-mapped THP if we have a install_pte handler? But is that really necessary if the pmd splitting results in all ptes populated, and thus the install_pte handler can't do anything with any pte anyway?
Yes... however nothing else here in the logic has special handling for transhuge pages AND there is already an interface provided to prevent this if you want, which we use in commit 3/5, that is to provide pud, pmd walkers that set walk->action = ACTION_CONTINUE if transhuge.
Having said that, it kind of sucks that we are doing a useless split here. Hmm. In the pte_entry() case you might very well want to split and do something with the PTE. With the install you are only interested if it's non-present...
It's not _completely_ infeasible that a user would want this, but it's very unlikely.
OK so yeah let's add it and clean up this expression while we're at it, will fix on respin.
if (walk->vma)
@@ -148,11 +170,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, again: next = pud_addr_end(addr, end); if (pud_none(*pud)) {
if (ops->pte_hole)
if (ops->install_pte)
err = __pmd_alloc(walk->mm, pud, addr);
else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break;
continue;
if (!ops->install_pte)
continue;
}
walk->action = ACTION_SUBTREE;
@@ -167,7 +192,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) || walk->action == ACTION_CONTINUE ||
!(ops->pmd_entry || ops->pte_entry))
!(ops->pmd_entry || ops->pte_entry || ops->install_pte)) continue;
Ditto?
Add a new PTE marker that results in any access causing the accessing process to segfault.
This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range.
This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination.
Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect.
We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or the like.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- include/linux/mm_inline.h | 2 +- include/linux/swapops.h | 26 ++++++++++++++++++++++++-- mm/hugetlb.c | 3 +++ mm/memory.c | 18 +++++++++++++++--- 4 files changed, 43 insertions(+), 6 deletions(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 355cf46a01a6..1b6a917fffa4 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -544,7 +544,7 @@ static inline pte_marker copy_pte_marker( { pte_marker srcm = pte_marker_get(entry); /* Always copy error entries. */ - pte_marker dstm = srcm & PTE_MARKER_POISONED; + pte_marker dstm = srcm & (PTE_MARKER_POISONED | PTE_MARKER_GUARD);
/* Only copy PTE markers if UFFD register matches. */ if ((srcm & PTE_MARKER_UFFD_WP) && userfaultfd_wp(dst_vma)) diff --git a/include/linux/swapops.h b/include/linux/swapops.h index cb468e418ea1..4d0606df0791 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -426,9 +426,15 @@ typedef unsigned long pte_marker; * "Poisoned" here is meant in the very general sense of "future accesses are * invalid", instead of referring very specifically to hardware memory errors. * This marker is meant to represent any of various different causes of this. + * + * Note that, when encountered by the faulting logic, PTEs with this marker will + * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error + * logic. */ #define PTE_MARKER_POISONED BIT(1) -#define PTE_MARKER_MASK (BIT(2) - 1) +/* Indicates that, on fault, this PTE will case a SIGSEGV signal to be sent. */ +#define PTE_MARKER_GUARD BIT(2) +#define PTE_MARKER_MASK (BIT(3) - 1)
static inline swp_entry_t make_pte_marker_entry(pte_marker marker) { @@ -461,9 +467,25 @@ static inline swp_entry_t make_poisoned_swp_entry(void) }
static inline int is_poisoned_swp_entry(swp_entry_t entry) +{ + /* + * We treat guard pages as poisoned too as these have the same semantics + * as poisoned ranges, only with different fault handling. + */ + return is_pte_marker_entry(entry) && + (pte_marker_get(entry) & + (PTE_MARKER_POISONED | PTE_MARKER_GUARD)); +} + +static inline swp_entry_t make_guard_swp_entry(void) +{ + return make_pte_marker_entry(PTE_MARKER_GUARD); +} + +static inline int is_guard_swp_entry(swp_entry_t entry) { return is_pte_marker_entry(entry) && - (pte_marker_get(entry) & PTE_MARKER_POISONED); + (pte_marker_get(entry) & PTE_MARKER_GUARD); }
/* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 906294ac85dc..50e3f6ed73ac 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6353,6 +6353,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, ret = VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); goto out_mutex; + } else if (marker & PTE_MARKER_GUARD) { + ret = VM_FAULT_SIGSEGV; + goto out_mutex; } }
diff --git a/mm/memory.c b/mm/memory.c index 0f614523b9f4..551455cd453f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1455,7 +1455,7 @@ static inline bool should_zap_folio(struct zap_details *details, return !folio_test_anon(folio); }
-static inline bool zap_drop_file_uffd_wp(struct zap_details *details) +static inline bool zap_drop_markers(struct zap_details *details) { if (!details) return false; @@ -1476,7 +1476,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, if (vma_is_anonymous(vma)) return;
- if (zap_drop_file_uffd_wp(details)) + if (zap_drop_markers(details)) return;
for (;;) { @@ -1671,7 +1671,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, * drop the marker if explicitly requested. */ if (!vma_is_anonymous(vma) && - !zap_drop_file_uffd_wp(details)) + !zap_drop_markers(details)) + continue; + } else if (is_guard_swp_entry(entry)) { + /* + * Ordinary zapping should not remove guard PTE + * markers. Only do so if we should remove PTE markers + * in general. + */ + if (!zap_drop_markers(details)) continue; } else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { @@ -4003,6 +4011,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) if (marker & PTE_MARKER_POISONED) return VM_FAULT_HWPOISON;
+ /* Hitting a guard page is always a fatal condition. */ + if (marker & PTE_MARKER_GUARD) + return VM_FAULT_SIGSEGV; + if (pte_marker_entry_uffd_wp(entry)) return pte_marker_handle_uffd_wp(vmf);
On 10/20/24 18:20, Lorenzo Stoakes wrote:
Add a new PTE marker that results in any access causing the accessing process to segfault.
Should we distinguish it from other segfaults? Is there a way? I can see memory protection keys use SEGV_PKUERR, but no idea if we have any free values.
This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range.
This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination.
Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect.
We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or the like.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Acked-by: Vlastimil Babka vbabka@suse.cz
A nit below:
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 906294ac85dc..50e3f6ed73ac 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6353,6 +6353,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, ret = VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); goto out_mutex;
} else if (marker & PTE_MARKER_GUARD) {
ret = VM_FAULT_SIGSEGV;
goto out_mutex;
Given we don't support hugetlb, should we WARN_ON_ONCE() if such unexpected marker appears there?
} }
diff --git a/mm/memory.c b/mm/memory.c index 0f614523b9f4..551455cd453f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1455,7 +1455,7 @@ static inline bool should_zap_folio(struct zap_details *details, return !folio_test_anon(folio); } -static inline bool zap_drop_file_uffd_wp(struct zap_details *details) +static inline bool zap_drop_markers(struct zap_details *details) { if (!details) return false; @@ -1476,7 +1476,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, if (vma_is_anonymous(vma)) return;
- if (zap_drop_file_uffd_wp(details))
- if (zap_drop_markers(details)) return;
for (;;) { @@ -1671,7 +1671,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, * drop the marker if explicitly requested. */ if (!vma_is_anonymous(vma) &&
!zap_drop_file_uffd_wp(details))
!zap_drop_markers(details))
continue;
} else if (is_guard_swp_entry(entry)) {
/*
* Ordinary zapping should not remove guard PTE
* markers. Only do so if we should remove PTE markers
* in general.
*/
} else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) {if (!zap_drop_markers(details)) continue;
@@ -4003,6 +4011,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) if (marker & PTE_MARKER_POISONED) return VM_FAULT_HWPOISON;
- /* Hitting a guard page is always a fatal condition. */
- if (marker & PTE_MARKER_GUARD)
return VM_FAULT_SIGSEGV;
- if (pte_marker_entry_uffd_wp(entry)) return pte_marker_handle_uffd_wp(vmf);
On Mon, Oct 21, 2024 at 03:45:31PM +0200, Vlastimil Babka wrote: [snip]
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Acked-by: Vlastimil Babka vbabka@suse.cz
Thanks!
A nit below:
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 906294ac85dc..50e3f6ed73ac 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6353,6 +6353,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, ret = VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); goto out_mutex;
} else if (marker & PTE_MARKER_GUARD) {
ret = VM_FAULT_SIGSEGV;
goto out_mutex;
Given we don't support hugetlb, should we WARN_ON_ONCE() if such unexpected marker appears there?
Yes agreed, will add.
[snip]
On Mon, Oct 21, 2024 at 03:45:31PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
Add a new PTE marker that results in any access causing the accessing process to segfault.
Should we distinguish it from other segfaults? Is there a way? I can see memory protection keys use SEGV_PKUERR, but no idea if we have any free values.
Wasn't even aware that existed!!
I'm not sure a process can do anything particularly useful with this information though? Hitting a guard page would indicate a programming error rather than something that might allow meaningful feedback to a user like memory protection keys.
Do you think there's enough value int his to warrant digging in? And indeed I imagine bits are in short supply for this and would need a strong argument to get... so yeah I don't think too worthwhile most likely!
Thanks for the suggestion though!
This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range.
This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination.
Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect.
We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or the like.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Acked-by: Vlastimil Babka vbabka@suse.cz
A nit below:
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 906294ac85dc..50e3f6ed73ac 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6353,6 +6353,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, ret = VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); goto out_mutex;
} else if (marker & PTE_MARKER_GUARD) {
ret = VM_FAULT_SIGSEGV;
goto out_mutex;
Given we don't support hugetlb, should we WARN_ON_ONCE() if such unexpected marker appears there?
} }
diff --git a/mm/memory.c b/mm/memory.c index 0f614523b9f4..551455cd453f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1455,7 +1455,7 @@ static inline bool should_zap_folio(struct zap_details *details, return !folio_test_anon(folio); }
-static inline bool zap_drop_file_uffd_wp(struct zap_details *details) +static inline bool zap_drop_markers(struct zap_details *details) { if (!details) return false; @@ -1476,7 +1476,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, if (vma_is_anonymous(vma)) return;
- if (zap_drop_file_uffd_wp(details))
if (zap_drop_markers(details)) return;
for (;;) {
@@ -1671,7 +1671,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, * drop the marker if explicitly requested. */ if (!vma_is_anonymous(vma) &&
!zap_drop_file_uffd_wp(details))
!zap_drop_markers(details))
continue;
} else if (is_guard_swp_entry(entry)) {
/*
* Ordinary zapping should not remove guard PTE
* markers. Only do so if we should remove PTE markers
* in general.
*/
} else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) {if (!zap_drop_markers(details)) continue;
@@ -4003,6 +4011,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) if (marker & PTE_MARKER_POISONED) return VM_FAULT_HWPOISON;
- /* Hitting a guard page is always a fatal condition. */
- if (marker & PTE_MARKER_GUARD)
return VM_FAULT_SIGSEGV;
- if (pte_marker_entry_uffd_wp(entry)) return pte_marker_handle_uffd_wp(vmf);
+cc Dave Hansen
On Mon, Oct 21, 2024 at 09:42:53PM +0100, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 03:45:31PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
Add a new PTE marker that results in any access causing the accessing process to segfault.
Should we distinguish it from other segfaults? Is there a way? I can see memory protection keys use SEGV_PKUERR, but no idea if we have any free values.
Wasn't even aware that existed!!
I'm not sure a process can do anything particularly useful with this information though? Hitting a guard page would indicate a programming error rather than something that might allow meaningful feedback to a user like memory protection keys.
Do you think there's enough value int his to warrant digging in? And indeed I imagine bits are in short supply for this and would need a strong argument to get... so yeah I don't think too worthwhile most likely!
Thanks for the suggestion though!
To put it on list - Dave Hansen commented on IRC that it would be safer to avoid this for now due to this being an ABI change, and reasonable to perhaps add it later if required, so that seems a sensible way forward.
Thanks!
[snip]
On 10/21/24 14:13, Lorenzo Stoakes wrote:
Do you think there's enough value int his to warrant digging in? And indeed I imagine bits are in short supply for this and would need a strong argument to get... so yeah I don't think too worthwhile most likely!
Thanks for the suggestion though!
To put it on list - Dave Hansen commented on IRC that it would be safer to avoid this for now due to this being an ABI change, and reasonable to perhaps add it later if required, so that seems a sensible way forward.
We added SEGV_PKUERR because we really did expect signal handlers to want to do something new and special, including consuming si_pkey. Old signal handlers would probably be pretty confused.
So, yeah, if it's not crystal clear that new signal handler code is needed, than I'd punt on adding a new SEGV_* code for now.
BTW, SEGV_* codes are sequentially assigned. It isn't a bitfield and there are no space constraints. We've only used a dozen or so of them and ->si_code is an int.
On 10/20/24 18:20, Lorenzo Stoakes wrote:
Add a new PTE marker that results in any access causing the accessing process to segfault.
This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range.
This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination.
Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect.
We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or the like.
Since I would have personally put MADV_FREE among "or the like" here, it's surprising to me that it in fact it's tearing down the guard entries now. Is that intentional? It should be at least mentioned very explicitly. But I'd really argue against it, as MADV_FREE is to me a weaker form of MADV_DONTNEED - the existing pages are not zapped immediately but prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why shouldn't MADV_FREE too?
Seems to me rather currently an artifact of MADV_FREE implementation - if it encounters hwpoison entries it will tear them down because why not, we have detected a hw memory error and are lucky the program wants to discard the pages and not access them, so best use the opportunity and get rid of the PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly could).
But to extend this to guard PTEs which are result of an explicit userspace action feels wrong, unless the semantics is the same for MADV_DONTEED. The semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave the same?
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
include/linux/mm_inline.h | 2 +- include/linux/swapops.h | 26 ++++++++++++++++++++++++-- mm/hugetlb.c | 3 +++ mm/memory.c | 18 +++++++++++++++--- 4 files changed, 43 insertions(+), 6 deletions(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 355cf46a01a6..1b6a917fffa4 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -544,7 +544,7 @@ static inline pte_marker copy_pte_marker( { pte_marker srcm = pte_marker_get(entry); /* Always copy error entries. */
- pte_marker dstm = srcm & PTE_MARKER_POISONED;
- pte_marker dstm = srcm & (PTE_MARKER_POISONED | PTE_MARKER_GUARD);
/* Only copy PTE markers if UFFD register matches. */ if ((srcm & PTE_MARKER_UFFD_WP) && userfaultfd_wp(dst_vma)) diff --git a/include/linux/swapops.h b/include/linux/swapops.h index cb468e418ea1..4d0606df0791 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -426,9 +426,15 @@ typedef unsigned long pte_marker;
- "Poisoned" here is meant in the very general sense of "future accesses are
- invalid", instead of referring very specifically to hardware memory errors.
- This marker is meant to represent any of various different causes of this.
- Note that, when encountered by the faulting logic, PTEs with this marker will
- result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error
*/
- logic.
#define PTE_MARKER_POISONED BIT(1) -#define PTE_MARKER_MASK (BIT(2) - 1) +/* Indicates that, on fault, this PTE will case a SIGSEGV signal to be sent. */ +#define PTE_MARKER_GUARD BIT(2) +#define PTE_MARKER_MASK (BIT(3) - 1) static inline swp_entry_t make_pte_marker_entry(pte_marker marker) { @@ -461,9 +467,25 @@ static inline swp_entry_t make_poisoned_swp_entry(void) } static inline int is_poisoned_swp_entry(swp_entry_t entry) +{
- /*
* We treat guard pages as poisoned too as these have the same semantics
* as poisoned ranges, only with different fault handling.
*/
- return is_pte_marker_entry(entry) &&
(pte_marker_get(entry) &
(PTE_MARKER_POISONED | PTE_MARKER_GUARD));
+}
+static inline swp_entry_t make_guard_swp_entry(void) +{
- return make_pte_marker_entry(PTE_MARKER_GUARD);
+}
+static inline int is_guard_swp_entry(swp_entry_t entry) { return is_pte_marker_entry(entry) &&
(pte_marker_get(entry) & PTE_MARKER_POISONED);
(pte_marker_get(entry) & PTE_MARKER_GUARD);
} /* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 906294ac85dc..50e3f6ed73ac 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6353,6 +6353,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, ret = VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); goto out_mutex;
} else if (marker & PTE_MARKER_GUARD) {
ret = VM_FAULT_SIGSEGV;
}goto out_mutex; }
diff --git a/mm/memory.c b/mm/memory.c index 0f614523b9f4..551455cd453f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1455,7 +1455,7 @@ static inline bool should_zap_folio(struct zap_details *details, return !folio_test_anon(folio); } -static inline bool zap_drop_file_uffd_wp(struct zap_details *details) +static inline bool zap_drop_markers(struct zap_details *details) { if (!details) return false; @@ -1476,7 +1476,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, if (vma_is_anonymous(vma)) return;
- if (zap_drop_file_uffd_wp(details))
- if (zap_drop_markers(details)) return;
for (;;) { @@ -1671,7 +1671,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, * drop the marker if explicitly requested. */ if (!vma_is_anonymous(vma) &&
!zap_drop_file_uffd_wp(details))
!zap_drop_markers(details))
continue;
} else if (is_guard_swp_entry(entry)) {
/*
* Ordinary zapping should not remove guard PTE
* markers. Only do so if we should remove PTE markers
* in general.
*/
} else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) {if (!zap_drop_markers(details)) continue;
@@ -4003,6 +4011,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) if (marker & PTE_MARKER_POISONED) return VM_FAULT_HWPOISON;
- /* Hitting a guard page is always a fatal condition. */
- if (marker & PTE_MARKER_GUARD)
return VM_FAULT_SIGSEGV;
- if (pte_marker_entry_uffd_wp(entry)) return pte_marker_handle_uffd_wp(vmf);
On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
Add a new PTE marker that results in any access causing the accessing process to segfault.
This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range.
This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination.
Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect.
We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or the like.
Since I would have personally put MADV_FREE among "or the like" here, it's surprising to me that it in fact it's tearing down the guard entries now. Is that intentional? It should be at least mentioned very explicitly. But I'd really argue against it, as MADV_FREE is to me a weaker form of MADV_DONTNEED - the existing pages are not zapped immediately but prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why shouldn't MADV_FREE too?
That is not, as I understand it, what MADV_FREE is, semantically. From the man pages:
MADV_FREE (since Linux 4.5)
The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs.
MADV_DONTNEED
Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.)
MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we don't expect to use it in the near future'.
Seems to me rather currently an artifact of MADV_FREE implementation - if it encounters hwpoison entries it will tear them down because why not, we have detected a hw memory error and are lucky the program wants to discard the pages and not access them, so best use the opportunity and get rid of the PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly could).
Right, but we explicitly do not tear them down in the case of MADV_DONTNEED which matches the description in the manpages that the user _might_ come back to the range, whereas MADV_FREE means they are truly done but just don't want the overhead of actually unmapping at this point.
Seems to be this is moreso that MADV_FREE is a not-really-as-efficient version of what Rik wants to do with his MADV_LAZYFREE thing.
But to extend this to guard PTEs which are result of an explicit userspace action feels wrong, unless the semantics is the same for MADV_DONTEED. The semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave the same?
My understanding from the above is that MADV_FREE is a softer version of munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a 'revert state to when I first mapped this stuff because I'm done with it for now but might use it later'.
Obviously happy to be corrected if my understanding of this is off-the-mark but this is what motivated me to do it this way!
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
include/linux/mm_inline.h | 2 +- include/linux/swapops.h | 26 ++++++++++++++++++++++++-- mm/hugetlb.c | 3 +++ mm/memory.c | 18 +++++++++++++++--- 4 files changed, 43 insertions(+), 6 deletions(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 355cf46a01a6..1b6a917fffa4 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -544,7 +544,7 @@ static inline pte_marker copy_pte_marker( { pte_marker srcm = pte_marker_get(entry); /* Always copy error entries. */
- pte_marker dstm = srcm & PTE_MARKER_POISONED;
pte_marker dstm = srcm & (PTE_MARKER_POISONED | PTE_MARKER_GUARD);
/* Only copy PTE markers if UFFD register matches. */ if ((srcm & PTE_MARKER_UFFD_WP) && userfaultfd_wp(dst_vma))
diff --git a/include/linux/swapops.h b/include/linux/swapops.h index cb468e418ea1..4d0606df0791 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -426,9 +426,15 @@ typedef unsigned long pte_marker;
- "Poisoned" here is meant in the very general sense of "future accesses are
- invalid", instead of referring very specifically to hardware memory errors.
- This marker is meant to represent any of various different causes of this.
- Note that, when encountered by the faulting logic, PTEs with this marker will
- result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error
*/
- logic.
#define PTE_MARKER_POISONED BIT(1) -#define PTE_MARKER_MASK (BIT(2) - 1) +/* Indicates that, on fault, this PTE will case a SIGSEGV signal to be sent. */ +#define PTE_MARKER_GUARD BIT(2) +#define PTE_MARKER_MASK (BIT(3) - 1)
static inline swp_entry_t make_pte_marker_entry(pte_marker marker) { @@ -461,9 +467,25 @@ static inline swp_entry_t make_poisoned_swp_entry(void) }
static inline int is_poisoned_swp_entry(swp_entry_t entry) +{
- /*
* We treat guard pages as poisoned too as these have the same semantics
* as poisoned ranges, only with different fault handling.
*/
- return is_pte_marker_entry(entry) &&
(pte_marker_get(entry) &
(PTE_MARKER_POISONED | PTE_MARKER_GUARD));
+}
+static inline swp_entry_t make_guard_swp_entry(void) +{
- return make_pte_marker_entry(PTE_MARKER_GUARD);
+}
+static inline int is_guard_swp_entry(swp_entry_t entry) { return is_pte_marker_entry(entry) &&
(pte_marker_get(entry) & PTE_MARKER_POISONED);
(pte_marker_get(entry) & PTE_MARKER_GUARD);
}
/* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 906294ac85dc..50e3f6ed73ac 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6353,6 +6353,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, ret = VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); goto out_mutex;
} else if (marker & PTE_MARKER_GUARD) {
ret = VM_FAULT_SIGSEGV;
}goto out_mutex; }
diff --git a/mm/memory.c b/mm/memory.c index 0f614523b9f4..551455cd453f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1455,7 +1455,7 @@ static inline bool should_zap_folio(struct zap_details *details, return !folio_test_anon(folio); }
-static inline bool zap_drop_file_uffd_wp(struct zap_details *details) +static inline bool zap_drop_markers(struct zap_details *details) { if (!details) return false; @@ -1476,7 +1476,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, if (vma_is_anonymous(vma)) return;
- if (zap_drop_file_uffd_wp(details))
if (zap_drop_markers(details)) return;
for (;;) {
@@ -1671,7 +1671,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, * drop the marker if explicitly requested. */ if (!vma_is_anonymous(vma) &&
!zap_drop_file_uffd_wp(details))
!zap_drop_markers(details))
continue;
} else if (is_guard_swp_entry(entry)) {
/*
* Ordinary zapping should not remove guard PTE
* markers. Only do so if we should remove PTE markers
* in general.
*/
} else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) {if (!zap_drop_markers(details)) continue;
@@ -4003,6 +4011,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) if (marker & PTE_MARKER_POISONED) return VM_FAULT_HWPOISON;
- /* Hitting a guard page is always a fatal condition. */
- if (marker & PTE_MARKER_GUARD)
return VM_FAULT_SIGSEGV;
- if (pte_marker_entry_uffd_wp(entry)) return pte_marker_handle_uffd_wp(vmf);
On 10/21/24 16:33, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
Add a new PTE marker that results in any access causing the accessing process to segfault.
This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range.
This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination.
Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect.
We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or the like.
Since I would have personally put MADV_FREE among "or the like" here, it's surprising to me that it in fact it's tearing down the guard entries now. Is that intentional? It should be at least mentioned very explicitly. But I'd really argue against it, as MADV_FREE is to me a weaker form of MADV_DONTNEED - the existing pages are not zapped immediately but prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why shouldn't MADV_FREE too?
That is not, as I understand it, what MADV_FREE is, semantically. From the man pages:
MADV_FREE (since Linux 4.5) The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. MADV_DONTNEED Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.)
MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we don't expect to use it in the near future'.
I think the description gives a wrong impression. What I think matters it what happens (limited to anon private case as MADV_FREE doesn't support any other)
MADV_DONTNEED - pages discarded immediately, further access gives new zero-filled pages
MADV_FREE - pages prioritized for discarding, if that happens before next write, it gets zero-filled page on next access, but a write done soon enough can cancel the upcoming discard.
In that sense, MADV_FREE is a weaker form of DONTNEED, no?
Seems to me rather currently an artifact of MADV_FREE implementation - if it encounters hwpoison entries it will tear them down because why not, we have detected a hw memory error and are lucky the program wants to discard the pages and not access them, so best use the opportunity and get rid of the PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly could).
Right, but we explicitly do not tear them down in the case of MADV_DONTNEED which matches the description in the manpages that the user _might_ come back to the range, whereas MADV_FREE means they are truly done but just don't want the overhead of actually unmapping at this point.
But it's also defined what happens if user comes back to the range after a MADV_FREE. I think the overhead saved happens in the case of actually coming back soon enough to prevent the discard. With MADV_DONTNEED its immediate and unconditional.
Seems to be this is moreso that MADV_FREE is a not-really-as-efficient version of what Rik wants to do with his MADV_LAZYFREE thing.
I think that further optimizes MADV_FREE, which is already more optimized than MADV_DONTNEED.
But to extend this to guard PTEs which are result of an explicit userspace action feels wrong, unless the semantics is the same for MADV_DONTEED. The semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave the same?
My understanding from the above is that MADV_FREE is a softer version of munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a 'revert state to when I first mapped this stuff because I'm done with it for now but might use it later'.
From the implementation I get the opposite understanding. Neither tears down the vma like a proper unmap(). MADV_DONTNEED zaps page tables immediately, MADV_FREE effectively too but with a delay depending on memory pressure.
On Mon, Oct 21, 2024 at 04:54:06PM +0200, Vlastimil Babka wrote:
On 10/21/24 16:33, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
Add a new PTE marker that results in any access causing the accessing process to segfault.
This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range.
This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination.
Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect.
We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or the like.
Since I would have personally put MADV_FREE among "or the like" here, it's surprising to me that it in fact it's tearing down the guard entries now. Is that intentional? It should be at least mentioned very explicitly. But I'd really argue against it, as MADV_FREE is to me a weaker form of MADV_DONTNEED - the existing pages are not zapped immediately but prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why shouldn't MADV_FREE too?
That is not, as I understand it, what MADV_FREE is, semantically. From the man pages:
MADV_FREE (since Linux 4.5) The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. MADV_DONTNEED Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.)
MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we don't expect to use it in the near future'.
I think the description gives a wrong impression. What I think matters it what happens (limited to anon private case as MADV_FREE doesn't support any other)
MADV_DONTNEED - pages discarded immediately, further access gives new zero-filled pages
MADV_FREE - pages prioritized for discarding, if that happens before next write, it gets zero-filled page on next access, but a write done soon enough can cancel the upcoming discard.
In that sense, MADV_FREE is a weaker form of DONTNEED, no?
Seems to me rather currently an artifact of MADV_FREE implementation - if it encounters hwpoison entries it will tear them down because why not, we have detected a hw memory error and are lucky the program wants to discard the pages and not access them, so best use the opportunity and get rid of the PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly could).
Right, but we explicitly do not tear them down in the case of MADV_DONTNEED which matches the description in the manpages that the user _might_ come back to the range, whereas MADV_FREE means they are truly done but just don't want the overhead of actually unmapping at this point.
But it's also defined what happens if user comes back to the range after a MADV_FREE. I think the overhead saved happens in the case of actually coming back soon enough to prevent the discard. With MADV_DONTNEED its immediate and unconditional.
Seems to be this is moreso that MADV_FREE is a not-really-as-efficient version of what Rik wants to do with his MADV_LAZYFREE thing.
I think that further optimizes MADV_FREE, which is already more optimized than MADV_DONTNEED.
But to extend this to guard PTEs which are result of an explicit userspace action feels wrong, unless the semantics is the same for MADV_DONTEED. The semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave the same?
My understanding from the above is that MADV_FREE is a softer version of munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a 'revert state to when I first mapped this stuff because I'm done with it for now but might use it later'.
From the implementation I get the opposite understanding. Neither tears down the vma like a proper unmap(). MADV_DONTNEED zaps page tables immediately, MADV_FREE effectively too but with a delay depending on memory pressure.
OK so based on IRC chat I think the conclusion here is TL;DR yes we have to change this, you're right :)
To summarise for on-list:
* MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the ability to be 'cancelled' if you write to the memory. Also, after the freeing is complete, you can write to the memory to reuse it, the mapping is still there.
* For hardware poison markers it makes sense to drop them as you're effectively saying 'I am done with this range that is now unbacked and expect to get an empty page should I use it now'. UFFD WP I am not sure about but presumably also fine.
* However, guard pages are different - if you 'cancel' and you are left with a block of memory allocated to you by a pthread or userland allocator implementation, you don't want to then no longer be protected from overrunning into other thread memory.
So, while I implemented this based on a. the semantics as apparently expressed in the man page and b. existing PTE marker behaviour, it seems that it would actually be problematic to do so.
So yeah, I'll change that in v3!
On Mon, Oct 21, 2024 at 04:33:19PM +0100, Lorenzo Stoakes wrote: [snip]
- For hardware poison markers it makes sense to drop them as you're effectively saying 'I am done with this range that is now unbacked and expect to get an empty page should I use it now'. UFFD WP I am not sure about but presumably also fine.
[snip]
CORRECTION: (sorry multitasking atm now I have a non-melted CPU) UFFD WP is safe from MADV_FREE as it checks is_poisoned_swp_entry() which this patch updates to include guard pages, a change I will undo in v3. So disregard comment on UFFD WP here.
On 21.10.24 17:33, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 04:54:06PM +0200, Vlastimil Babka wrote:
On 10/21/24 16:33, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
Add a new PTE marker that results in any access causing the accessing process to segfault.
This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range.
This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination.
Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect.
We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or the like.
Since I would have personally put MADV_FREE among "or the like" here, it's surprising to me that it in fact it's tearing down the guard entries now. Is that intentional? It should be at least mentioned very explicitly. But I'd really argue against it, as MADV_FREE is to me a weaker form of MADV_DONTNEED - the existing pages are not zapped immediately but prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why shouldn't MADV_FREE too?
That is not, as I understand it, what MADV_FREE is, semantically. From the man pages:
MADV_FREE (since Linux 4.5) The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. MADV_DONTNEED Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.)
MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we don't expect to use it in the near future'.
I think the description gives a wrong impression. What I think matters it what happens (limited to anon private case as MADV_FREE doesn't support any other)
MADV_DONTNEED - pages discarded immediately, further access gives new zero-filled pages
MADV_FREE - pages prioritized for discarding, if that happens before next write, it gets zero-filled page on next access, but a write done soon enough can cancel the upcoming discard.
In that sense, MADV_FREE is a weaker form of DONTNEED, no?
Seems to me rather currently an artifact of MADV_FREE implementation - if it encounters hwpoison entries it will tear them down because why not, we have detected a hw memory error and are lucky the program wants to discard the pages and not access them, so best use the opportunity and get rid of the PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly could).
Right, but we explicitly do not tear them down in the case of MADV_DONTNEED which matches the description in the manpages that the user _might_ come back to the range, whereas MADV_FREE means they are truly done but just don't want the overhead of actually unmapping at this point.
But it's also defined what happens if user comes back to the range after a MADV_FREE. I think the overhead saved happens in the case of actually coming back soon enough to prevent the discard. With MADV_DONTNEED its immediate and unconditional.
Seems to be this is moreso that MADV_FREE is a not-really-as-efficient version of what Rik wants to do with his MADV_LAZYFREE thing.
I think that further optimizes MADV_FREE, which is already more optimized than MADV_DONTNEED.
But to extend this to guard PTEs which are result of an explicit userspace action feels wrong, unless the semantics is the same for MADV_DONTEED. The semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave the same?
My understanding from the above is that MADV_FREE is a softer version of munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a 'revert state to when I first mapped this stuff because I'm done with it for now but might use it later'.
From the implementation I get the opposite understanding. Neither tears down the vma like a proper unmap(). MADV_DONTNEED zaps page tables immediately, MADV_FREE effectively too but with a delay depending on memory pressure.
OK so based on IRC chat I think the conclusion here is TL;DR yes we have to change this, you're right :)
To summarise for on-list:
MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the ability to be 'cancelled' if you write to the memory. Also, after the freeing is complete, you can write to the memory to reuse it, the mapping is still there.
For hardware poison markers it makes sense to drop them as you're effectively saying 'I am done with this range that is now unbacked and expect to get an empty page should I use it now'. UFFD WP I am not sure about but presumably also fine.
However, guard pages are different - if you 'cancel' and you are left with a block of memory allocated to you by a pthread or userland allocator implementation, you don't want to then no longer be protected from overrunning into other thread memory.
Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or error? It sounds like a usage "error" to me (in contrast to munmap()).
On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote: [snip]
To summarise for on-list:
MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the ability to be 'cancelled' if you write to the memory. Also, after the freeing is complete, you can write to the memory to reuse it, the mapping is still there.
For hardware poison markers it makes sense to drop them as you're effectively saying 'I am done with this range that is now unbacked and expect to get an empty page should I use it now'. UFFD WP I am not sure about but presumably also fine.
However, guard pages are different - if you 'cancel' and you are left with a block of memory allocated to you by a pthread or userland allocator implementation, you don't want to then no longer be protected from overrunning into other thread memory.
Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or error? It sounds like a usage "error" to me (in contrast to munmap()).
It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in place, from v3 we will do the same for MADV_FREE.
I'm not sure I'd say it's an error per se, as somebody might have a use case where they want to zap over a range but keep guard pages, perhaps an allocator or something?
Also the existing logic is that existing markers (HW poison, uffd-simulated HW poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and no error on MADV_FREE either, so it'd be consistent with existing behaviour.
Also semantically you are achieving what the calls expect you are freeing the ranges since the guard page regions are unbacked so are already freed... so yeah I don't think an error really makes sense here.
We might also be limiting use cases by assuming they might _only_ be used for allocators and such.
-- Cheers,
David / dhildenb
On 21.10.24 18:23, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote: [snip]
To summarise for on-list:
MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the ability to be 'cancelled' if you write to the memory. Also, after the freeing is complete, you can write to the memory to reuse it, the mapping is still there.
For hardware poison markers it makes sense to drop them as you're effectively saying 'I am done with this range that is now unbacked and expect to get an empty page should I use it now'. UFFD WP I am not sure about but presumably also fine.
However, guard pages are different - if you 'cancel' and you are left with a block of memory allocated to you by a pthread or userland allocator implementation, you don't want to then no longer be protected from overrunning into other thread memory.
Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or error? It sounds like a usage "error" to me (in contrast to munmap()).
It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in place, from v3 we will do the same for MADV_FREE.
I'm not sure I'd say it's an error per se, as somebody might have a use case where they want to zap over a range but keep guard pages, perhaps an allocator or something?
Hm, not sure I see use for that.
Staring at madvise_walk_vmas(), we return ENOMEM on VMA holes, but would process PROT_NONE. So current behavior is at least consistent with PROT_NONE handling (where something could be mapped, though).
No strong opinion.
Also the existing logic is that existing markers (HW poison, uffd-simulated HW poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and no error on MADV_FREE either, so it'd be consistent with existing behaviour.
HW poison / uffd-simulated HW poison are expected to be zapped: it's just like a mapped page with HWPOISON. So that is correct. UFFD-WP behavior is ... weird. Would not expect MADV_DONTNEED to zap uffd-wp entries.
Also semantically you are achieving what the calls expect you are freeing the ranges since the guard page regions are unbacked so are already freed... so yeah I don't think an error really makes sense here.
I you compare it to a VMA hole, it make sense to fail. If we treat it like PROT_NONE, it make sense to skip them.
We might also be limiting use cases by assuming they might _only_ be used for allocators and such.
I don't buy that as an argument, sorry :)
"Let's map the kernel writable into all user space because otherwise we might be limiting use cases"
:P
On Mon, Oct 21, 2024 at 06:44:04PM +0200, David Hildenbrand wrote:
On 21.10.24 18:23, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote: [snip]
To summarise for on-list:
MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the ability to be 'cancelled' if you write to the memory. Also, after the freeing is complete, you can write to the memory to reuse it, the mapping is still there.
For hardware poison markers it makes sense to drop them as you're effectively saying 'I am done with this range that is now unbacked and expect to get an empty page should I use it now'. UFFD WP I am not sure about but presumably also fine.
However, guard pages are different - if you 'cancel' and you are left with a block of memory allocated to you by a pthread or userland allocator implementation, you don't want to then no longer be protected from overrunning into other thread memory.
Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or error? It sounds like a usage "error" to me (in contrast to munmap()).
It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in place, from v3 we will do the same for MADV_FREE.
I'm not sure I'd say it's an error per se, as somebody might have a use case where they want to zap over a range but keep guard pages, perhaps an allocator or something?
Hm, not sure I see use for that.
Staring at madvise_walk_vmas(), we return ENOMEM on VMA holes, but would process PROT_NONE. So current behavior is at least consistent with PROT_NONE handling (where something could be mapped, though).
Err, the handling of holes is terrible, yes we return ENOMEM, but we _carry out the whole procedure_ then return an error, an error _indistinguishable from an error arising from any of the individual parts_.
Which is just, awful.
No strong opinion.
Well you used up your strong opinion on the naming ;)
Also the existing logic is that existing markers (HW poison, uffd-simulated HW poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and no error on MADV_FREE either, so it'd be consistent with existing behaviour.
HW poison / uffd-simulated HW poison are expected to be zapped: it's just like a mapped page with HWPOISON. So that is correct.
Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I mean the MADV flags are a confusing mess generally, as per Vlasta's comments which to begin with I strongly disagreed with then, discussing further, realsed that no this is just a bit insane and had driven _me_ insane.
UFFD-WP behavior is ... weird. Would not expect MADV_DONTNEED to zap uffd-wp entries.
Also semantically you are achieving what the calls expect you are freeing the ranges since the guard page regions are unbacked so are already freed... so yeah I don't think an error really makes sense here.
I you compare it to a VMA hole, it make sense to fail. If we treat it like PROT_NONE, it make sense to skip them.
We might also be limiting use cases by assuming they might _only_ be used for allocators and such.
I don't buy that as an argument, sorry :)
"Let's map the kernel writable into all user space because otherwise we might be limiting use cases"
That's a great idea! Patch series incoming, 1st April 2025... :>)
:P
-- Cheers,
David / dhildenb
Overall I think just always leaving in place except on remedy err sorry sorry unpoison and munmap and not returning an error if encountered elsewhere (other than, of course, GUP) is the right way forward and most in line with user expectation and practical usage.
On 21.10.24 18:51, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 06:44:04PM +0200, David Hildenbrand wrote:
On 21.10.24 18:23, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote: [snip]
To summarise for on-list:
MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the ability to be 'cancelled' if you write to the memory. Also, after the freeing is complete, you can write to the memory to reuse it, the mapping is still there.
For hardware poison markers it makes sense to drop them as you're effectively saying 'I am done with this range that is now unbacked and expect to get an empty page should I use it now'. UFFD WP I am not sure about but presumably also fine.
However, guard pages are different - if you 'cancel' and you are left with a block of memory allocated to you by a pthread or userland allocator implementation, you don't want to then no longer be protected from overrunning into other thread memory.
Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or error? It sounds like a usage "error" to me (in contrast to munmap()).
It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in place, from v3 we will do the same for MADV_FREE.
I'm not sure I'd say it's an error per se, as somebody might have a use case where they want to zap over a range but keep guard pages, perhaps an allocator or something?
Hm, not sure I see use for that.
Staring at madvise_walk_vmas(), we return ENOMEM on VMA holes, but would process PROT_NONE. So current behavior is at least consistent with PROT_NONE handling (where something could be mapped, though).
Err, the handling of holes is terrible, yes we return ENOMEM, but we _carry out the whole procedure_ then return an error, an error _indistinguishable from an error arising from any of the individual parts_.
Which is just, awful.
Yes, absolutely. I don't know why we decided to continue. And why we return ENOMEM ...
No strong opinion.
Well you used up your strong opinion on the naming ;)
He, and I am out of energy for this year ;)
In retrospective, "install or remove a guard PTE" is just much better than anything else ...
So I should never have been mislead to suggest poison/unpoison as a replacement for poison/remedy :P
Also the existing logic is that existing markers (HW poison, uffd-simulated HW poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and no error on MADV_FREE either, so it'd be consistent with existing behaviour.
HW poison / uffd-simulated HW poison are expected to be zapped: it's just like a mapped page with HWPOISON. So that is correct.
Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I
Huh?
madvise_dontneed_single_vma()->zap_page_range_single(details=NULL)->unmap_single_vma(details=NULL) ... zap_pte_range()
} else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { if (!should_zap_cows(details)) continue; ...
Should just zap them.
What am I missing?
mean the MADV flags are a confusing mess generally, as per Vlasta's comments which to begin with I strongly disagreed with then, discussing further, realsed that no this is just a bit insane and had driven _me_ insane.
UFFD-WP behavior is ... weird. Would not expect MADV_DONTNEED to zap uffd-wp entries.
Also semantically you are achieving what the calls expect you are freeing the ranges since the guard page regions are unbacked so are already freed... so yeah I don't think an error really makes sense here.
I you compare it to a VMA hole, it make sense to fail. If we treat it like PROT_NONE, it make sense to skip them.
We might also be limiting use cases by assuming they might _only_ be used for allocators and such.
I don't buy that as an argument, sorry :)
"Let's map the kernel writable into all user space because otherwise we might be limiting use cases"
That's a great idea! Patch series incoming, 1st April 2025... :>)
:) Just flip the bit on x86 and we're done!
:P
-- Cheers,
David / dhildenb
Overall I think just always leaving in place except on remedy err sorry sorry unpoison and munmap and not returning an error if encountered elsewhere (other than, of course, GUP) is the right way forward and most in line with user expectation and practical usage.
Fine with me, make sure to document that is behaves like a PROT_NONE VMA, not like a memory hole, except when something would trigger a fault (GUP etc).
On Mon, Oct 21, 2024 at 07:00:53PM +0200, David Hildenbrand wrote:
On 21.10.24 18:51, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 06:44:04PM +0200, David Hildenbrand wrote:
On 21.10.24 18:23, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote: [snip]
To summarise for on-list:
MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the ability to be 'cancelled' if you write to the memory. Also, after the freeing is complete, you can write to the memory to reuse it, the mapping is still there.
For hardware poison markers it makes sense to drop them as you're effectively saying 'I am done with this range that is now unbacked and expect to get an empty page should I use it now'. UFFD WP I am not sure about but presumably also fine.
However, guard pages are different - if you 'cancel' and you are left with a block of memory allocated to you by a pthread or userland allocator implementation, you don't want to then no longer be protected from overrunning into other thread memory.
Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or error? It sounds like a usage "error" to me (in contrast to munmap()).
It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in place, from v3 we will do the same for MADV_FREE.
I'm not sure I'd say it's an error per se, as somebody might have a use case where they want to zap over a range but keep guard pages, perhaps an allocator or something?
Hm, not sure I see use for that.
Staring at madvise_walk_vmas(), we return ENOMEM on VMA holes, but would process PROT_NONE. So current behavior is at least consistent with PROT_NONE handling (where something could be mapped, though).
Err, the handling of holes is terrible, yes we return ENOMEM, but we _carry out the whole procedure_ then return an error, an error _indistinguishable from an error arising from any of the individual parts_.
Which is just, awful.
Yes, absolutely. I don't know why we decided to continue. And why we return ENOMEM ...
Anyway UAPI so no turning back now :)
No strong opinion.
Well you used up your strong opinion on the naming ;)
He, and I am out of energy for this year ;)
Haha understandable...
In retrospective, "install or remove a guard PTE" is just much better than anything else ...
So I should never have been mislead to suggest poison/unpoison as a replacement for poison/remedy :P
You know the reason ;)
Also the existing logic is that existing markers (HW poison, uffd-simulated HW poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and no error on MADV_FREE either, so it'd be consistent with existing behaviour.
HW poison / uffd-simulated HW poison are expected to be zapped: it's just like a mapped page with HWPOISON. So that is correct.
Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I
Huh?
madvise_dontneed_single_vma()->zap_page_range_single(details=NULL)->unmap_single_vma(details=NULL) ... zap_pte_range()
} else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { if (!should_zap_cows(details)) continue; ...
Should just zap them.
What am I missing?
Yeah ok it's me who's missing something here, I hadn't noticed details == NULL so should_zap_cows() is true, my mistake!
In any case we explicitly add code here to prevent guard pages from going. I will correct everything where I wrongly say otherwise, doh!
mean the MADV flags are a confusing mess generally, as per Vlasta's comments which to begin with I strongly disagreed with then, discussing further, realsed that no this is just a bit insane and had driven _me_ insane.
UFFD-WP behavior is ... weird. Would not expect MADV_DONTNEED to zap uffd-wp entries.
Also semantically you are achieving what the calls expect you are freeing the ranges since the guard page regions are unbacked so are already freed... so yeah I don't think an error really makes sense here.
I you compare it to a VMA hole, it make sense to fail. If we treat it like PROT_NONE, it make sense to skip them.
We might also be limiting use cases by assuming they might _only_ be used for allocators and such.
I don't buy that as an argument, sorry :)
"Let's map the kernel writable into all user space because otherwise we might be limiting use cases"
That's a great idea! Patch series incoming, 1st April 2025... :>)
:) Just flip the bit on x86 and we're done!
;)
:P
-- Cheers,
David / dhildenb
Overall I think just always leaving in place except on remedy err sorry sorry unpoison and munmap and not returning an error if encountered elsewhere (other than, of course, GUP) is the right way forward and most in line with user expectation and practical usage.
Fine with me, make sure to document that is behaves like a PROT_NONE VMA, not like a memory hole, except when something would trigger a fault (GUP etc).
Ack will make sure to document.
-- Cheers,
David / dhildenb
Also the existing logic is that existing markers (HW poison, uffd-simulated HW poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and no error on MADV_FREE either, so it'd be consistent with existing behaviour.
HW poison / uffd-simulated HW poison are expected to be zapped: it's just like a mapped page with HWPOISON. So that is correct.
Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I
Huh?
madvise_dontneed_single_vma()->zap_page_range_single(details=NULL)->unmap_single_vma(details=NULL) ... zap_pte_range()
} else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { if (!should_zap_cows(details)) continue; ...
Should just zap them.
What am I missing?
Yeah ok it's me who's missing something here, I hadn't noticed details == NULL so should_zap_cows() is true, my mistake!
It's confusing code ... I once had a patch to call it "should_zap_anon_folios" etc, but it would only slightly make it clearer what is actually happening.
In any case we explicitly add code here to prevent guard pages from going. I will correct everything where I wrongly say otherwise, doh!
Right, that's also one of the reasons I don't think "poison" terminology is the best fit, because there are som subtle differences. At least the marker does not mention "poison" but PTE_MARKER_GUARD, which I think is a very good naming :)
On 10/21/24 19:14, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 07:00:53PM +0200, David Hildenbrand wrote:
Also the existing logic is that existing markers (HW poison, uffd-simulated HW poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and no error on MADV_FREE either, so it'd be consistent with existing behaviour.
HW poison / uffd-simulated HW poison are expected to be zapped: it's just like a mapped page with HWPOISON. So that is correct.
Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I
Huh?
madvise_dontneed_single_vma()->zap_page_range_single(details=NULL)->unmap_single_vma(details=NULL) ... zap_pte_range()
} else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { if (!should_zap_cows(details)) continue; ...
Should just zap them.
What am I missing?
Yeah ok it's me who's missing something here, I hadn't noticed details == NULL so should_zap_cows() is true, my mistake!
Well, good to know it's consistent then. As I've explained I see why zapping actual hwpoison makes sense for MADV_DONTNEED/MADV_FREE. That it's done also for uffd poison is not completely clear, but maybe it was just easier to implement. But it doesn't mean we have to do the same for GUARD PTEs. Either behavior of zap/ignore/error could be valid, we just have to pick one and then live with it as it can't change :) Zapping guards on DONTNEED/FREE seems wrong to me, so it's between error (and potentially catching some misuse) and ignore (potentially more performant in case somebody wants to DOTNEED/FREE an area that contains scattered guards).
And the impossibility to meaningfully unwind on errors in the middle of the operation (unless we pre-scan for guards) isn't exactly nice, so maybe just ignore, i.e. the current approach?
In any case we explicitly add code here to prevent guard pages from going. I will correct everything where I wrongly say otherwise, doh!
mean the MADV flags are a confusing mess generally, as per Vlasta's comments which to begin with I strongly disagreed with then, discussing further, realsed that no this is just a bit insane and had driven _me_ insane.
UFFD-WP behavior is ... weird. Would not expect MADV_DONTNEED to zap uffd-wp entries.
Also semantically you are achieving what the calls expect you are freeing the ranges since the guard page regions are unbacked so are already freed... so yeah I don't think an error really makes sense here.
I you compare it to a VMA hole, it make sense to fail. If we treat it like PROT_NONE, it make sense to skip them.
We might also be limiting use cases by assuming they might _only_ be used for allocators and such.
I don't buy that as an argument, sorry :)
"Let's map the kernel writable into all user space because otherwise we might be limiting use cases"
That's a great idea! Patch series incoming, 1st April 2025... :>)
:) Just flip the bit on x86 and we're done!
;)
:P
-- Cheers,
David / dhildenb
Overall I think just always leaving in place except on remedy err sorry sorry unpoison and munmap and not returning an error if encountered elsewhere (other than, of course, GUP) is the right way forward and most in line with user expectation and practical usage.
Fine with me, make sure to document that is behaves like a PROT_NONE VMA, not like a memory hole, except when something would trigger a fault (GUP etc).
Ack will make sure to document.
-- Cheers,
David / dhildenb
On 21.10.24 19:26, Vlastimil Babka wrote:
On 10/21/24 19:14, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 07:00:53PM +0200, David Hildenbrand wrote:
Also the existing logic is that existing markers (HW poison, uffd-simulated HW poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and no error on MADV_FREE either, so it'd be consistent with existing behaviour.
HW poison / uffd-simulated HW poison are expected to be zapped: it's just like a mapped page with HWPOISON. So that is correct.
Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I
Huh?
madvise_dontneed_single_vma()->zap_page_range_single(details=NULL)->unmap_single_vma(details=NULL) ... zap_pte_range()
} else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { if (!should_zap_cows(details)) continue; ...
Should just zap them.
What am I missing?
Yeah ok it's me who's missing something here, I hadn't noticed details == NULL so should_zap_cows() is true, my mistake!
Well, good to know it's consistent then. As I've explained I see why zapping actual hwpoison makes sense for MADV_DONTNEED/MADV_FREE. That it's done also for uffd poison is not completely clear, but maybe it was just easier to implement.
Note that in VM context "uffd poison" really just is "this was hwpoison on the source VM, so we mimic that on the destination VM, because the data *is* lost" -- so you want the exact same behavior.
For example, when a VM reboots you might just want to ZAP these hwpoison entries, and get fresh pages on next access.
So to me it makes sense that they are treated equally.
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise.
Currently users must establish PROT_NONE ranges to achieve this.
However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs.
In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges.
The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - poisoning PTEs such that accesses to them cause a fault followed by a SIGSGEV signal being raised.
This is achieved through the PTE marker mechanism, which a previous commit in this series extended to permit this to be done, installed via the generic page walking logic, also extended by a prior commit for this purpose.
These poison ranges are established with MADV_GUARD_POISON, and if the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect).
Any existing poison entries will be left untouched. There is no nesting of poisoned pages.
Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather unexpected behaviour, but are cleared on process teardown or unmapping of memory ranges.
Ranges can have the poison property removed by MADV_GUARD_UNPOISON - 'remedying' the poisoning. The ranges over which this is applied, should they contain non-poison entries, will be untouched, only poison entries will be cleared.
We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive).
Suggested-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/uapi/asm-generic/mman-common.h | 3 + mm/madvise.c | 168 +++++++++++++++++++++++++ mm/mprotect.c | 3 +- mm/mseal.c | 1 + 8 files changed, 186 insertions(+), 1 deletion(-)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 763929e814e9..71e13f27742d 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -78,6 +78,9 @@
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
+#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index 9c48d9a21aa0..1a2222322f77 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -105,6 +105,9 @@
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
+#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 68c44f99bc93..380905522397 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -75,6 +75,9 @@ #define MADV_HWPOISON 100 /* poison a page for testing */ #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */
+#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index 1ff0c858544f..e8d5affceb28 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -113,6 +113,9 @@
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
+#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..5dfd3d442de4 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -79,6 +79,9 @@
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
+#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0
diff --git a/mm/madvise.c b/mm/madvise.c index e871a72a6c32..7b9a357b84d2 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -60,6 +60,8 @@ static int madvise_need_mmap_write(int behavior) case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE: + case MADV_GUARD_POISON: + case MADV_GUARD_UNPOISON: return 0; default: /* be safe, default to 1. list exceptions explicitly */ @@ -1017,6 +1019,166 @@ static long madvise_remove(struct vm_area_struct *vma, return error; }
+static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) +{ + vm_flags_t disallowed = VM_SPECIAL | VM_HUGETLB; + + /* + * A user could lock after poisoning but that's fine, as they'd not be + * able to fault in. The issue arises when we try to zap existing locked + * VMAs. We don't want to do that. + */ + if (!allow_locked) + disallowed |= VM_LOCKED; + + if (!vma_is_anonymous(vma)) + return false; + + if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE) + return false; + + return true; +} + +static bool is_guard_pte_marker(pte_t ptent) +{ + return is_pte_marker(ptent) && + is_guard_swp_entry(pte_to_swp_entry(ptent)); +} + +static int guard_poison_pud_entry(pud_t *pud, unsigned long addr, unsigned long next, + struct mm_walk *walk) +{ + pud_t pudval = pudp_get(pud); + + /* Do not split a huge pud - we do nothing with these so just ignore. */ + if (pud_trans_huge(pudval) || pud_devmap(pudval)) + walk->action = ACTION_CONTINUE; + + return 0; +} + +static int guard_poison_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, + struct mm_walk *walk) +{ + pmd_t pmdval = pmdp_get(pmd); + + /* Do not split a huge pmd - we do nothing with these so just ignore. */ + if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) + walk->action = ACTION_CONTINUE; + + return 0; +} + +static int guard_poison_pte_entry(pte_t *pte, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pte_t pteval = ptep_get(pte); + + /* + * If not a guard marker, simply abort the operation. We return a value + * > 0 indicating a non-error abort. + */ + return !is_guard_pte_marker(pteval); +} + +static int guard_poison_install_pte(unsigned long addr, unsigned long next, + pte_t *ptep, struct mm_walk *walk) +{ + /* Simply install a PTE marker, this causes segfault on access. */ + *ptep = make_pte_marker(PTE_MARKER_GUARD); + + return 0; +} + +static const struct mm_walk_ops guard_poison_walk_ops = { + .pud_entry = guard_poison_pud_entry, + .pmd_entry = guard_poison_pmd_entry, + .pte_entry = guard_poison_pte_entry, + .install_pte = guard_poison_install_pte, + .walk_lock = PGWALK_RDLOCK, +}; + +static long madvise_guard_poison(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + long err; + + *prev = vma; + if (!is_valid_guard_vma(vma, /* allow_locked = */false)) + return -EINVAL; + + /* + * If we install poison markers, then the range is no longer + * empty from a page table perspective and therefore it's + * appropriate to have an anon_vma. + * + * This ensures that on fork, we copy page tables correctly. + */ + err = anon_vma_prepare(vma); + if (err) + return err; + + /* + * Optimistically try to install the guard poison pages first. If any + * non-guard pages are encountered, give up and zap the range before + * trying again. + */ + while (true) { + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ + err = walk_page_range_mm(vma->vm_mm, start, end, + &guard_poison_walk_ops, NULL); + if (err <= 0) + return err; + + /* + * OK some of the range have non-guard pages mapped, zap + * them. This leaves existing guard pages in place. + */ + zap_page_range_single(vma, start, end - start, NULL); + + if (fatal_signal_pending(current)) + return -EINTR; + cond_resched(); + } +} + +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pte_t ptent = ptep_get(pte); + + if (is_guard_pte_marker(ptent)) { + /* Simply clear the PTE marker. */ + pte_clear_not_present_full(walk->mm, addr, pte, false); + update_mmu_cache(walk->vma, addr, pte); + } + + return 0; +} + +static const struct mm_walk_ops guard_unpoison_walk_ops = { + .pte_entry = guard_unpoison_pte_entry, + .walk_lock = PGWALK_RDLOCK, +}; + +static long madvise_guard_unpoison(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + *prev = vma; + /* + * We're ok with unpoisoning mlock()'d ranges, as this is a + * non-destructive action. + */ + if (!is_valid_guard_vma(vma, /* allow_locked = */true)) + return -EINVAL; + + return walk_page_range(vma->vm_mm, start, end, + &guard_unpoison_walk_ops, NULL); +} + /* * Apply an madvise behavior to a region of a vma. madvise_update_vma * will handle splitting a vm area into separate areas, each area with its own @@ -1098,6 +1260,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, break; case MADV_COLLAPSE: return madvise_collapse(vma, prev, start, end); + case MADV_GUARD_POISON: + return madvise_guard_poison(vma, prev, start, end); + case MADV_GUARD_UNPOISON: + return madvise_guard_unpoison(vma, prev, start, end); }
anon_name = anon_vma_name(vma); @@ -1197,6 +1363,8 @@ madvise_behavior_valid(int behavior) case MADV_DODUMP: case MADV_WIPEONFORK: case MADV_KEEPONFORK: + case MADV_GUARD_POISON: + case MADV_GUARD_UNPOISON: #ifdef CONFIG_MEMORY_FAILURE case MADV_SOFT_OFFLINE: case MADV_HWPOISON: diff --git a/mm/mprotect.c b/mm/mprotect.c index 0c5d6d06107d..d0e3ebfadef8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -236,7 +236,8 @@ static long change_pte_range(struct mmu_gather *tlb, } else if (is_pte_marker_entry(entry)) { /* * Ignore error swap entries unconditionally, - * because any access should sigbus anyway. + * because any access should sigbus/sigsegv + * anyway. */ if (is_poisoned_swp_entry(entry)) continue; diff --git a/mm/mseal.c b/mm/mseal.c index ece977bd21e1..21bf5534bcf5 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -30,6 +30,7 @@ static bool is_madv_discard(int behavior) case MADV_REMOVE: case MADV_DONTFORK: case MADV_WIPEONFORK: + case MADV_GUARD_POISON: return true; }
On 20.10.24 18:20, Lorenzo Stoakes wrote:
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise.
Currently users must establish PROT_NONE ranges to achieve this.
However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs.
In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges.
The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - poisoning PTEs such that accesses to them cause a fault followed by a SIGSGEV signal being raised.
This is achieved through the PTE marker mechanism, which a previous commit in this series extended to permit this to be done, installed via the generic page walking logic, also extended by a prior commit for this purpose.
These poison ranges are established with MADV_GUARD_POISON, and if the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect).
Any existing poison entries will be left untouched. There is no nesting of poisoned pages.
Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather unexpected behaviour, but are cleared on process teardown or unmapping of memory ranges.
Ranges can have the poison property removed by MADV_GUARD_UNPOISON - 'remedying' the poisoning. The ranges over which this is applied, should they contain non-poison entries, will be untouched, only poison entries will be cleared.
We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive).
Suggested-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/uapi/asm-generic/mman-common.h | 3 + mm/madvise.c | 168 +++++++++++++++++++++++++ mm/mprotect.c | 3 +- mm/mseal.c | 1 + 8 files changed, 186 insertions(+), 1 deletion(-)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 763929e814e9..71e13f27742d 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -78,6 +78,9 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */
Just to raise it here: MADV_GUARD_INSTALL / MADV_GUARD_REMOVE or sth. like that would have been even clearer, at least to me.
But no strong opinion, just if somebody else reading along was wondering about the same.
I'm hoping to find more time to have a closer look at this this week, but in general, the concept sounds reasonable to me.
On Mon, Oct 21, 2024 at 07:05:27PM +0200, David Hildenbrand wrote:
On 20.10.24 18:20, Lorenzo Stoakes wrote:
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise.
Currently users must establish PROT_NONE ranges to achieve this.
However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs.
In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges.
The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - poisoning PTEs such that accesses to them cause a fault followed by a SIGSGEV signal being raised.
This is achieved through the PTE marker mechanism, which a previous commit in this series extended to permit this to be done, installed via the generic page walking logic, also extended by a prior commit for this purpose.
These poison ranges are established with MADV_GUARD_POISON, and if the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect).
Any existing poison entries will be left untouched. There is no nesting of poisoned pages.
Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather unexpected behaviour, but are cleared on process teardown or unmapping of memory ranges.
Ranges can have the poison property removed by MADV_GUARD_UNPOISON - 'remedying' the poisoning. The ranges over which this is applied, should they contain non-poison entries, will be untouched, only poison entries will be cleared.
We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive).
Suggested-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/uapi/asm-generic/mman-common.h | 3 + mm/madvise.c | 168 +++++++++++++++++++++++++ mm/mprotect.c | 3 +- mm/mseal.c | 1 + 8 files changed, 186 insertions(+), 1 deletion(-)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 763929e814e9..71e13f27742d 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -78,6 +78,9 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */
Just to raise it here: MADV_GUARD_INSTALL / MADV_GUARD_REMOVE or sth. like that would have been even clearer, at least to me.
:)
It still feels like poisoning to me because we're explicitly putting something in the page tables to make a range have different fault behaviour like a HW poisoning, and 'installing' suggests backing or something like this, I think that's more confusing.
But no strong opinion, just if somebody else reading along was wondering about the same.
I'm hoping to find more time to have a closer look at this this week, but in general, the concept sounds reasonable to me.
Thanks!
-- Cheers,
David / dhildenb
On 21.10.24 19:15, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 07:05:27PM +0200, David Hildenbrand wrote:
On 20.10.24 18:20, Lorenzo Stoakes wrote:
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise.
Currently users must establish PROT_NONE ranges to achieve this.
However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs.
In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges.
The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - poisoning PTEs such that accesses to them cause a fault followed by a SIGSGEV signal being raised.
This is achieved through the PTE marker mechanism, which a previous commit in this series extended to permit this to be done, installed via the generic page walking logic, also extended by a prior commit for this purpose.
These poison ranges are established with MADV_GUARD_POISON, and if the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect).
Any existing poison entries will be left untouched. There is no nesting of poisoned pages.
Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather unexpected behaviour, but are cleared on process teardown or unmapping of memory ranges.
Ranges can have the poison property removed by MADV_GUARD_UNPOISON - 'remedying' the poisoning. The ranges over which this is applied, should they contain non-poison entries, will be untouched, only poison entries will be cleared.
We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive).
Suggested-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/uapi/asm-generic/mman-common.h | 3 + mm/madvise.c | 168 +++++++++++++++++++++++++ mm/mprotect.c | 3 +- mm/mseal.c | 1 + 8 files changed, 186 insertions(+), 1 deletion(-)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 763929e814e9..71e13f27742d 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -78,6 +78,9 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */
Just to raise it here: MADV_GUARD_INSTALL / MADV_GUARD_REMOVE or sth. like that would have been even clearer, at least to me.
:)
It still feels like poisoning to me because we're explicitly putting something in the page tables to make a range have different fault behaviour like a HW poisoning, and 'installing' suggests backing or something like this, I think that's more confusing.
I connect "poison" to "SIGBUS" and "corrupt memory state", not to "there is nothing and there must not be anything". Thus my thinking. But again, not the end of the world, just wanted to raise it ...
But no strong opinion, just if somebody else reading along was wondering about the same.
I'm hoping to find more time to have a closer look at this this week, but in general, the concept sounds reasonable to me.
Thanks!
Thank you for implementing this and up-streaming it!
On 10/21/24 10:23 AM, David Hildenbrand wrote:
On 21.10.24 19:15, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 07:05:27PM +0200, David Hildenbrand wrote:
On 20.10.24 18:20, Lorenzo Stoakes wrote:
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise.
Currently users must establish PROT_NONE ranges to achieve this.
However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs.
In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges.
The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - poisoning PTEs such that accesses to them cause a fault followed by a SIGSGEV signal being raised.
This is achieved through the PTE marker mechanism, which a previous commit in this series extended to permit this to be done, installed via the generic page walking logic, also extended by a prior commit for this purpose.
These poison ranges are established with MADV_GUARD_POISON, and if the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect).
Any existing poison entries will be left untouched. There is no nesting of poisoned pages.
Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather unexpected behaviour, but are cleared on process teardown or unmapping of memory ranges.
Ranges can have the poison property removed by MADV_GUARD_UNPOISON - 'remedying' the poisoning. The ranges over which this is applied, should they contain non-poison entries, will be untouched, only poison entries will be cleared.
We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive).
Suggested-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
arch/alpha/include/uapi/asm/mman.h    |  3 +   arch/mips/include/uapi/asm/mman.h     |  3 +   arch/parisc/include/uapi/asm/mman.h   |  3 +   arch/xtensa/include/uapi/asm/mman.h   |  3 +   include/uapi/asm-generic/mman-common.h |  3 +   mm/madvise.c                          | 168 +++++++++++++++++++++++++   mm/mprotect.c                         |  3 +-   mm/mseal.c                            |  1 +   8 files changed, 186 insertions(+), 1 deletion(-)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 763929e814e9..71e13f27742d 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -78,6 +78,9 @@ Â Â #define MADV_COLLAPSEÂ Â Â 25Â Â Â Â Â Â Â /* Synchronous hugepage collapse */ +#define MADV_GUARD_POISON 102Â Â Â Â Â Â Â /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103Â Â Â Â Â Â Â /* revoke guard poisoning */
Just to raise it here: MADV_GUARD_INSTALL / MADV_GUARD_REMOVE or sth. like that would have been even clearer, at least to me.
Yes, I think so.
:)
It still feels like poisoning to me because we're explicitly putting something in the page tables to make a range have different fault behaviour like a HW poisoning, and 'installing' suggests backing or something like this, I think that's more confusing.
I connect "poison" to "SIGBUS" and "corrupt memory state", not to "there is nothing and there must not be anything". Thus my thinking. But again, not the end of the world, just wanted to raise it ...
"Poison" is used so far for fairly distinct things, and I'd very much like to avoid extending its meaning to guard pages. It makes the other things less unique, and it misses a naming and classification opportunity.
"Guard" and "guard page" are fairly unique names. That's valuable.
thanks,
On Mon, Oct 21, 2024 at 12:25:08PM -0700, John Hubbard wrote:
On 10/21/24 10:23 AM, David Hildenbrand wrote:
[snip]
Just to raise it here: MADV_GUARD_INSTALL / MADV_GUARD_REMOVE or sth. like that would have been even clearer, at least to me.
Yes, I think so.
:)
It still feels like poisoning to me because we're explicitly putting something in the page tables to make a range have different fault behaviour like a HW poisoning, and 'installing' suggests backing or something like this, I think that's more confusing.
I connect "poison" to "SIGBUS" and "corrupt memory state", not to "there is nothing and there must not be anything". Thus my thinking. But again, not the end of the world, just wanted to raise it ...
"Poison" is used so far for fairly distinct things, and I'd very much like to avoid extending its meaning to guard pages. It makes the other things less unique, and it misses a naming and classification opportunity.
"Guard" and "guard page" are fairly unique names. That's valuable.
thanks,
John Hubbard
Guys you're breaking my heart... Will you not leave me with even a remnant of a cultural reference?? [0]
On 21.10.24 21:39, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 12:25:08PM -0700, John Hubbard wrote:
On 10/21/24 10:23 AM, David Hildenbrand wrote:
[snip]
Just to raise it here: MADV_GUARD_INSTALL / MADV_GUARD_REMOVE or sth. like that would have been even clearer, at least to me.
Yes, I think so.
:)
It still feels like poisoning to me because we're explicitly putting something in the page tables to make a range have different fault behaviour like a HW poisoning, and 'installing' suggests backing or something like this, I think that's more confusing.
I connect "poison" to "SIGBUS" and "corrupt memory state", not to "there is nothing and there must not be anything". Thus my thinking. But again, not the end of the world, just wanted to raise it ...
"Poison" is used so far for fairly distinct things, and I'd very much like to avoid extending its meaning to guard pages. It makes the other things less unique, and it misses a naming and classification opportunity.
"Guard" and "guard page" are fairly unique names. That's valuable.
thanks,
John Hubbard
Guys you're breaking my heart... Will you not leave me with even a remnant of a cultural reference?? [0]
I'm sure you will recover from that loss :)
On 10/20/24 18:20, Lorenzo Stoakes wrote:
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise.
Currently users must establish PROT_NONE ranges to achieve this.
However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs.
In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges.
The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - poisoning PTEs such that accesses to them cause a fault followed by a SIGSGEV signal being raised.
This is achieved through the PTE marker mechanism, which a previous commit in this series extended to permit this to be done, installed via the generic page walking logic, also extended by a prior commit for this purpose.
These poison ranges are established with MADV_GUARD_POISON, and if the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect).
Any existing poison entries will be left untouched. There is no nesting of poisoned pages.
Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather unexpected behaviour, but are cleared on process teardown or unmapping of memory ranges.
Ranges can have the poison property removed by MADV_GUARD_UNPOISON - 'remedying' the poisoning. The ranges over which this is applied, should they contain non-poison entries, will be untouched, only poison entries will be cleared.
We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive).
Suggested-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
<snip>
+static long madvise_guard_poison(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
+{
- long err;
- *prev = vma;
- if (!is_valid_guard_vma(vma, /* allow_locked = */false))
return -EINVAL;
- /*
* If we install poison markers, then the range is no longer
* empty from a page table perspective and therefore it's
* appropriate to have an anon_vma.
*
* This ensures that on fork, we copy page tables correctly.
*/
- err = anon_vma_prepare(vma);
- if (err)
return err;
- /*
* Optimistically try to install the guard poison pages first. If any
* non-guard pages are encountered, give up and zap the range before
* trying again.
*/
Should the page walker become powerful enough to handle this in one go? :) But sure, if it's too big a task to teach it to zap ptes with all the tlb flushing etc (I assume it's something page walkers don't do today), it makes sense to do it this way. Or we could require userspace to zap first (MADV_DONTNEED), but that would unnecessarily mean extra syscalls for the use case of an allocator debug mode that wants to turn freed memory to guards to catch use after free. So this seems like a good compromise...
- while (true) {
/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
err = walk_page_range_mm(vma->vm_mm, start, end,
&guard_poison_walk_ops, NULL);
if (err <= 0)
return err;
/*
* OK some of the range have non-guard pages mapped, zap
* them. This leaves existing guard pages in place.
*/
zap_page_range_single(vma, start, end - start, NULL);
... however the potentially endless loop doesn't seem great. Could a malicious program keep refaulting the range (ignoring any segfaults if it loses a race) with one thread while failing to make progress here with another thread? Is that ok because it would only punish itself?
I mean if we have to retry the guards page installation more than once, it means the program *is* racing faults with installing guard ptes in the same range, right? So it would be right to segfault it. But I guess when we detect it here, we have no way to send the signal to the right thread and it would be too late? So unless we can do the PTE zap+install marker atomically, maybe there's no better way and this is acceptable as a malicious program can harm only itself?
Maybe it would be just simpler to install the marker via zap_details rather than the pagewalk?
if (fatal_signal_pending(current))
return -EINTR;
cond_resched();
- }
+}
+static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr,
unsigned long next, struct mm_walk *walk)
+{
- pte_t ptent = ptep_get(pte);
- if (is_guard_pte_marker(ptent)) {
/* Simply clear the PTE marker. */
pte_clear_not_present_full(walk->mm, addr, pte, false);
update_mmu_cache(walk->vma, addr, pte);
- }
- return 0;
+}
+static const struct mm_walk_ops guard_unpoison_walk_ops = {
- .pte_entry = guard_unpoison_pte_entry,
- .walk_lock = PGWALK_RDLOCK,
+};
+static long madvise_guard_unpoison(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
+{
- *prev = vma;
- /*
* We're ok with unpoisoning mlock()'d ranges, as this is a
* non-destructive action.
*/
- if (!is_valid_guard_vma(vma, /* allow_locked = */true))
return -EINVAL;
- return walk_page_range(vma->vm_mm, start, end,
&guard_unpoison_walk_ops, NULL);
+}
/*
- Apply an madvise behavior to a region of a vma. madvise_update_vma
- will handle splitting a vm area into separate areas, each area with its own
@@ -1098,6 +1260,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, break; case MADV_COLLAPSE: return madvise_collapse(vma, prev, start, end);
- case MADV_GUARD_POISON:
return madvise_guard_poison(vma, prev, start, end);
- case MADV_GUARD_UNPOISON:
}return madvise_guard_unpoison(vma, prev, start, end);
anon_name = anon_vma_name(vma); @@ -1197,6 +1363,8 @@ madvise_behavior_valid(int behavior) case MADV_DODUMP: case MADV_WIPEONFORK: case MADV_KEEPONFORK:
- case MADV_GUARD_POISON:
- case MADV_GUARD_UNPOISON:
#ifdef CONFIG_MEMORY_FAILURE case MADV_SOFT_OFFLINE: case MADV_HWPOISON: diff --git a/mm/mprotect.c b/mm/mprotect.c index 0c5d6d06107d..d0e3ebfadef8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -236,7 +236,8 @@ static long change_pte_range(struct mmu_gather *tlb, } else if (is_pte_marker_entry(entry)) { /* * Ignore error swap entries unconditionally,
* because any access should sigbus anyway.
* because any access should sigbus/sigsegv
* anyway. */ if (is_poisoned_swp_entry(entry)) continue;
diff --git a/mm/mseal.c b/mm/mseal.c index ece977bd21e1..21bf5534bcf5 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -30,6 +30,7 @@ static bool is_madv_discard(int behavior) case MADV_REMOVE: case MADV_DONTFORK: case MADV_WIPEONFORK:
- case MADV_GUARD_POISON: return true; }
On 21.10.24 22:11, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise.
Currently users must establish PROT_NONE ranges to achieve this.
However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs.
In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges.
The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - poisoning PTEs such that accesses to them cause a fault followed by a SIGSGEV signal being raised.
This is achieved through the PTE marker mechanism, which a previous commit in this series extended to permit this to be done, installed via the generic page walking logic, also extended by a prior commit for this purpose.
These poison ranges are established with MADV_GUARD_POISON, and if the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect).
Any existing poison entries will be left untouched. There is no nesting of poisoned pages.
Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather unexpected behaviour, but are cleared on process teardown or unmapping of memory ranges.
Ranges can have the poison property removed by MADV_GUARD_UNPOISON - 'remedying' the poisoning. The ranges over which this is applied, should they contain non-poison entries, will be untouched, only poison entries will be cleared.
We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive).
Suggested-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
<snip>
+static long madvise_guard_poison(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
+{
- long err;
- *prev = vma;
- if (!is_valid_guard_vma(vma, /* allow_locked = */false))
return -EINVAL;
- /*
* If we install poison markers, then the range is no longer
* empty from a page table perspective and therefore it's
* appropriate to have an anon_vma.
*
* This ensures that on fork, we copy page tables correctly.
*/
- err = anon_vma_prepare(vma);
- if (err)
return err;
- /*
* Optimistically try to install the guard poison pages first. If any
* non-guard pages are encountered, give up and zap the range before
* trying again.
*/
Should the page walker become powerful enough to handle this in one go? :) But sure, if it's too big a task to teach it to zap ptes with all the tlb flushing etc (I assume it's something page walkers don't do today), it makes sense to do it this way. Or we could require userspace to zap first (MADV_DONTNEED), but that would unnecessarily mean extra syscalls for the use case of an allocator debug mode that wants to turn freed memory to guards to catch use after free. So this seems like a good compromise...
Yes please, KIS. We can always implement support for that later if really required (leave behavior open when documenting).
On 10/21/24 22:17, David Hildenbrand wrote:
On 21.10.24 22:11, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
<snip>
+static long madvise_guard_poison(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
+{
- long err;
- *prev = vma;
- if (!is_valid_guard_vma(vma, /* allow_locked = */false))
return -EINVAL;
- /*
* If we install poison markers, then the range is no longer
* empty from a page table perspective and therefore it's
* appropriate to have an anon_vma.
*
* This ensures that on fork, we copy page tables correctly.
*/
- err = anon_vma_prepare(vma);
- if (err)
return err;
- /*
* Optimistically try to install the guard poison pages first. If any
* non-guard pages are encountered, give up and zap the range before
* trying again.
*/
Should the page walker become powerful enough to handle this in one go? :) But sure, if it's too big a task to teach it to zap ptes with all the tlb flushing etc (I assume it's something page walkers don't do today), it makes sense to do it this way. Or we could require userspace to zap first (MADV_DONTNEED), but that would unnecessarily mean extra syscalls for the use case of an allocator debug mode that wants to turn freed memory to guards to catch use after free. So this seems like a good compromise...
Yes please, KIS.
You mean "require userspace to zap first (MADV_DONTNEED)" ?
I'd normally agree with the KIS principle, but..
We can always implement support for that later if
it would either mean later we change behavior (installing guards on non-zapped PTEs would have to be an error now but maybe start working later, which is user observable change thus can break somebody)
really required (leave behavior open when documenting).
and leaving it open when documenting doesn't really mean anything for the "we don't break userspace" promise vs what the implementation actually does.
Or the changed behavior would need to come with a new MADVISE mode. Not appealing as it's a mess already.
So since its uapi we should aim for the best from the start.
On Mon, Oct 21, 2024 at 10:25:06PM +0200, Vlastimil Babka wrote:
On 10/21/24 22:17, David Hildenbrand wrote:
On 21.10.24 22:11, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
<snip>
+static long madvise_guard_poison(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
+{
- long err;
- *prev = vma;
- if (!is_valid_guard_vma(vma, /* allow_locked = */false))
return -EINVAL;
- /*
* If we install poison markers, then the range is no longer
* empty from a page table perspective and therefore it's
* appropriate to have an anon_vma.
*
* This ensures that on fork, we copy page tables correctly.
*/
- err = anon_vma_prepare(vma);
- if (err)
return err;
- /*
* Optimistically try to install the guard poison pages first. If any
* non-guard pages are encountered, give up and zap the range before
* trying again.
*/
Should the page walker become powerful enough to handle this in one go? :) But sure, if it's too big a task to teach it to zap ptes with all the tlb flushing etc (I assume it's something page walkers don't do today), it makes sense to do it this way. Or we could require userspace to zap first (MADV_DONTNEED), but that would unnecessarily mean extra syscalls for the use case of an allocator debug mode that wants to turn freed memory to guards to catch use after free. So this seems like a good compromise...
Yes please, KIS.
You mean "require userspace to zap first (MADV_DONTNEED)" ?
What on earth are you talking about? This is crazy, we can detect if we need to zap with the page walker then just zap? Why would we do this?
The solution as is is perfectly simple... What is the justification for this on any level?
Again, if you think there's a _genuine_ security/DoS issue here you're going to really need to demonstrate it rather than hand wave?
I'd normally agree with the KIS principle, but..
We can always implement support for that later if
it would either mean later we change behavior (installing guards on non-zapped PTEs would have to be an error now but maybe start working later, which is user observable change thus can break somebody)
really required (leave behavior open when documenting).
and leaving it open when documenting doesn't really mean anything for the "we don't break userspace" promise vs what the implementation actually does.
Or the changed behavior would need to come with a new MADVISE mode. Not appealing as it's a mess already.
So since its uapi we should aim for the best from the start.
Best is 'call the madvise(), guard pages installed' which is what it is now.
On 21.10.24 22:25, Vlastimil Babka wrote:
On 10/21/24 22:17, David Hildenbrand wrote:
On 21.10.24 22:11, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
<snip>
+static long madvise_guard_poison(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
+{
- long err;
- *prev = vma;
- if (!is_valid_guard_vma(vma, /* allow_locked = */false))
return -EINVAL;
- /*
* If we install poison markers, then the range is no longer
* empty from a page table perspective and therefore it's
* appropriate to have an anon_vma.
*
* This ensures that on fork, we copy page tables correctly.
*/
- err = anon_vma_prepare(vma);
- if (err)
return err;
- /*
* Optimistically try to install the guard poison pages first. If any
* non-guard pages are encountered, give up and zap the range before
* trying again.
*/
Should the page walker become powerful enough to handle this in one go? :) But sure, if it's too big a task to teach it to zap ptes with all the tlb flushing etc (I assume it's something page walkers don't do today), it makes sense to do it this way. Or we could require userspace to zap first (MADV_DONTNEED), but that would unnecessarily mean extra syscalls for the use case of an allocator debug mode that wants to turn freed memory to guards to catch use after free. So this seems like a good compromise...
Yes please, KIS.
You mean "require userspace to zap first (MADV_DONTNEED)" ?
Yes, I see from Lorenzo's reply that there is apparently some history to this (maybe it's all nicely summarized in the cover letter / this patch, have to dig further).
Not sure yet what the problem is, I would have thought it's all protected by the PTL, and concurrent faults are user space doing something stupid and we'd detect it.
Have to do some more reading on this.
I'd normally agree with the KIS principle, but..
We can always implement support for that later if
it would either mean later we change behavior (installing guards on non-zapped PTEs would have to be an error now but maybe start working later, which is user observable change thus can break somebody)
really required (leave behavior open when documenting).
and leaving it open when documenting doesn't really mean anything for the "we don't break userspace" promise vs what the implementation actually does.
Not quite I think. You could start return -EEXIST or -EOPNOTSUPP and document that this can change in the future to succeed if there is something. User space can sense support.
Something failing that at one point starts working is not really breaking user space, unless someone really *wants* to fail if there is already something (e.g., concurrent fault -> bail out instead of hiding it).
Of course, a more elegant solution would be GUARD_INSTALL vs. GUARD_FORCE_INSTALL.
.. but again, there seems to be more history to this.
On Mon, Oct 21, 2024 at 10:37:39PM +0200, David Hildenbrand wrote:
On 21.10.24 22:25, Vlastimil Babka wrote:
On 10/21/24 22:17, David Hildenbrand wrote:
On 21.10.24 22:11, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
<snip>
+static long madvise_guard_poison(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
+{
- long err;
- *prev = vma;
- if (!is_valid_guard_vma(vma, /* allow_locked = */false))
return -EINVAL;
- /*
* If we install poison markers, then the range is no longer
* empty from a page table perspective and therefore it's
* appropriate to have an anon_vma.
*
* This ensures that on fork, we copy page tables correctly.
*/
- err = anon_vma_prepare(vma);
- if (err)
return err;
- /*
* Optimistically try to install the guard poison pages first. If any
* non-guard pages are encountered, give up and zap the range before
* trying again.
*/
Should the page walker become powerful enough to handle this in one go? :) But sure, if it's too big a task to teach it to zap ptes with all the tlb flushing etc (I assume it's something page walkers don't do today), it makes sense to do it this way. Or we could require userspace to zap first (MADV_DONTNEED), but that would unnecessarily mean extra syscalls for the use case of an allocator debug mode that wants to turn freed memory to guards to catch use after free. So this seems like a good compromise...
Yes please, KIS.
You mean "require userspace to zap first (MADV_DONTNEED)" ?
Yes, I see from Lorenzo's reply that there is apparently some history to this (maybe it's all nicely summarized in the cover letter / this patch, have to dig further).
Not sure yet what the problem is, I would have thought it's all protected by the PTL, and concurrent faults are user space doing something stupid and we'd detect it.
The looping mechanism is fine for dealing with concurrent faults. There's no actual _race_ due to PTL, it's just that a user could repeatedly populate stuff stupidly in a range that is meant to have poison markers put in.
It's not likely and would be kind of an abusive of the interface, and it'd really be a process just hurting itself.
In nearly all cases you won't zap at all. The whole point is it's optimistic. In 99.99% of others you zap once...
Have to do some more reading on this.
May I suggest a book on the history of the prodigy?
I'd normally agree with the KIS principle, but..
We can always implement support for that later if
it would either mean later we change behavior (installing guards on non-zapped PTEs would have to be an error now but maybe start working later, which is user observable change thus can break somebody)
really required (leave behavior open when documenting).
and leaving it open when documenting doesn't really mean anything for the "we don't break userspace" promise vs what the implementation actually does.
Not quite I think. You could start return -EEXIST or -EOPNOTSUPP and document that this can change in the future to succeed if there is something. User space can sense support.
Yeah I mean originally I had a -EAGAIN which was sort of equivalent of this but Jann pointed out you're just shifting work to userland who would loop and repeat.
I just don't see why we'd do this.
In fact I was looking at the series and thinking 'wow it's actually a really small delta' and being proud but... still not KIS enough apparently ;)
Something failing that at one point starts working is not really breaking user space, unless someone really *wants* to fail if there is already something (e.g., concurrent fault -> bail out instead of hiding it).
Of course, a more elegant solution would be GUARD_INSTALL vs. GUARD_FORCE_INSTALL.
.. but again, there seems to be more history to this.
I don't think there's really any value in that. There's just no sensible situation in which a user would care about this I don't think.
And if you're saying 'hey do MADV_DONTNEED if this fails and keep trying!' then why not just do that in the kernel?
Trying to explain to a user 'hey this is for installing guard pages but if there's a facing fault it'll fail and that could keep happening and then you'll have to zap and maybe in a loop' just... seems like a bloody awful interface?
I prefer 'here's an interface for installing and removing guard pages, enjoy!' :)
-- Cheers,
David / dhildenb
Yes, I see from Lorenzo's reply that there is apparently some history to this (maybe it's all nicely summarized in the cover letter / this patch, have to dig further).
Not sure yet what the problem is, I would have thought it's all protected by the PTL, and concurrent faults are user space doing something stupid and we'd detect it.
The looping mechanism is fine for dealing with concurrent faults. There's no actual _race_ due to PTL, it's just that a user could repeatedly populate stuff stupidly in a range that is meant to have poison markers put in.
It's not likely and would be kind of an abusive of the interface, and it'd really be a process just hurting itself.
In nearly all cases you won't zap at all. The whole point is it's optimistic. In 99.99% of others you zap once...
Exactly! And that's why I am questioning whether the kernel should care about that. See below.
Have to do some more reading on this.
May I suggest a book on the history of the prodigy?
:D
I'd normally agree with the KIS principle, but..
We can always implement support for that later if
it would either mean later we change behavior (installing guards on non-zapped PTEs would have to be an error now but maybe start working later, which is user observable change thus can break somebody)
really required (leave behavior open when documenting).
and leaving it open when documenting doesn't really mean anything for the "we don't break userspace" promise vs what the implementation actually does.
Not quite I think. You could start return -EEXIST or -EOPNOTSUPP and document that this can change in the future to succeed if there is something. User space can sense support.
Yeah I mean originally I had a -EAGAIN which was sort of equivalent of this but Jann pointed out you're just shifting work to userland who would loop and repeat.
I just don't see why we'd do this.
In fact I was looking at the series and thinking 'wow it's actually a really small delta' and being proud but... still not KIS enough apparently ;)
You know, I read a lot of kernel code ... and mfill_atomic_install_pte() is what popped in my head: if there is already something, let user space handle it, because it is unexpected.
The uffd interface is slightly better, as it gives you the number of processed PTEs back, which madvise() is not designed for.
But maybe this (returning how many we already processed) is not required due to the nature of guard pages (below).
Something failing that at one point starts working is not really breaking user space, unless someone really *wants* to fail if there is already something (e.g., concurrent fault -> bail out instead of hiding it).
Of course, a more elegant solution would be GUARD_INSTALL vs. GUARD_FORCE_INSTALL.
.. but again, there seems to be more history to this.
I don't think there's really any value in that. There's just no sensible situation in which a user would care about this I don't think.
Making sure nobody touches an area, and wile doing that somebody already touched that area? I guess it could be worked around by mprotect(PROT_NONE),madvise(GUARD),mprotect(PROT_READ|PROT_WRITE) ... which is not particularly nice :)
And if you're saying 'hey do MADV_DONTNEED if this fails and keep trying!' then why not just do that in the kernel?
Heh, no!
If user space doesn't expect there to be something, it should *fail*. That's likely going to be the majority of use cases for guard pages (happy to be told otherwise). No retry.
And if user space expects there to be something it should zap ahead of time (which some allocators maybe already do to free up memory after free()) to then install the guard. No retry.
There is this case where user space might be unsure. There, it might make sense to retry exactly once.
Trying to explain to a user 'hey this is for installing guard pages but if there's a facing fault it'll fail and that could keep happening and then you'll have to zap and maybe in a loop' just... seems like a bloody awful interface?
Hope my example above made it clearer. This "retry forever until it works" use case doesn't quite make sense to me, but I might just be missing something important.
But again, I have to do more reading on the history of the current approach ... and it's fairly late here.
On Mon, Oct 21, 2024 at 11:20:03PM +0200, David Hildenbrand wrote:
Yes, I see from Lorenzo's reply that there is apparently some history to this (maybe it's all nicely summarized in the cover letter / this patch, have to dig further).
Not sure yet what the problem is, I would have thought it's all protected by the PTL, and concurrent faults are user space doing something stupid and we'd detect it.
The looping mechanism is fine for dealing with concurrent faults. There's no actual _race_ due to PTL, it's just that a user could repeatedly populate stuff stupidly in a range that is meant to have poison markers put in.
It's not likely and would be kind of an abusive of the interface, and it'd really be a process just hurting itself.
In nearly all cases you won't zap at all. The whole point is it's optimistic. In 99.99% of others you zap once...
Exactly! And that's why I am questioning whether the kernel should care about that. See below.
Have to do some more reading on this.
May I suggest a book on the history of the prodigy?
:D
I'd normally agree with the KIS principle, but..
We can always implement support for that later if
it would either mean later we change behavior (installing guards on non-zapped PTEs would have to be an error now but maybe start working later, which is user observable change thus can break somebody)
really required (leave behavior open when documenting).
and leaving it open when documenting doesn't really mean anything for the "we don't break userspace" promise vs what the implementation actually does.
Not quite I think. You could start return -EEXIST or -EOPNOTSUPP and document that this can change in the future to succeed if there is something. User space can sense support.
Yeah I mean originally I had a -EAGAIN which was sort of equivalent of this but Jann pointed out you're just shifting work to userland who would loop and repeat.
I just don't see why we'd do this.
In fact I was looking at the series and thinking 'wow it's actually a really small delta' and being proud but... still not KIS enough apparently ;)
You know, I read a lot of kernel code ... and mfill_atomic_install_pte() is what popped in my head: if there is already something, let user space handle it, because it is unexpected.
The uffd interface is slightly better, as it gives you the number of processed PTEs back, which madvise() is not designed for.
But maybe this (returning how many we already processed) is not required due to the nature of guard pages (below).
Something failing that at one point starts working is not really breaking user space, unless someone really *wants* to fail if there is already something (e.g., concurrent fault -> bail out instead of hiding it).
Of course, a more elegant solution would be GUARD_INSTALL vs. GUARD_FORCE_INSTALL.
.. but again, there seems to be more history to this.
I don't think there's really any value in that. There's just no sensible situation in which a user would care about this I don't think.
Making sure nobody touches an area, and wile doing that somebody already touched that area? I guess it could be worked around by mprotect(PROT_NONE),madvise(GUARD),mprotect(PROT_READ|PROT_WRITE) ... which is not particularly nice :)
And if you're saying 'hey do MADV_DONTNEED if this fails and keep trying!' then why not just do that in the kernel?
Heh, no!
If user space doesn't expect there to be something, it should *fail*. That's likely going to be the majority of use cases for guard pages (happy to be told otherwise). No retry.
And if user space expects there to be something it should zap ahead of time (which some allocators maybe already do to free up memory after free()) to then install the guard. No retry.
There is this case where user space might be unsure. There, it might make sense to retry exactly once.
Trying to explain to a user 'hey this is for installing guard pages but if there's a facing fault it'll fail and that could keep happening and then you'll have to zap and maybe in a loop' just... seems like a bloody awful interface?
Hope my example above made it clearer. This "retry forever until it works" use case doesn't quite make sense to me, but I might just be missing something important.
It won't be forever in any case other than a process that is abusing itself.
Then it's a security question and... see below!
But again, I have to do more reading on the history of the current approach ... and it's fairly late here.
Yes late for me too and I've been working since 8am pretty solid so... :)
-- Cheers,
David / dhildenb
So we all agree it's very unlikely, we all agree that even if it happens it'll happen only once, so what is the problem with the loop exactly? Other than philosophical?
We do have other loops in the kernel like this... it'd really be the same as the process throttling itself and whether it loops in userland or kernel land is kind of immaterial really.
So to me this comes down fundamentally - is this a genuine security/abuse concern?
Since Jann suggested this, I find it unlikely. But I have pinged him for his input.
Otherwise I don't know what we're arguing about... and I think we're fine!
On 10/21/24 23:20, David Hildenbrand wrote:
I don't think there's really any value in that. There's just no sensible situation in which a user would care about this I don't think.
Making sure nobody touches an area, and wile doing that somebody already touched that area? I guess it could be worked around by mprotect(PROT_NONE),madvise(GUARD),mprotect(PROT_READ|PROT_WRITE) ... which is not particularly nice :)
And if you're saying 'hey do MADV_DONTNEED if this fails and keep trying!' then why not just do that in the kernel?
Heh, no!
If user space doesn't expect there to be something, it should *fail*. That's likely going to be the majority of use cases for guard pages (happy to be told otherwise). No retry.
And if user space expects there to be something it should zap ahead of time (which some allocators maybe already do to free up memory after free()) to then install the guard. No retry.
There is this case where user space might be unsure. There, it might make sense to retry exactly once.
I've thought so too and the RFC was implemented like this, but Jann came up with a scenario where a THP can cause the range including our to-be-installed guard pte to be populated even if the userspace is not trying to access that exact address, see here:
https://lore.kernel.org/all/CAG48ez3vqbqyWb4bLdpqSUnhwqGo2OQetecNhEGPdCGDr94...
So unless we can't *reliably* detect that userspace is really shooting itself in the foot and return a failure to install guard pte *only* in that case (which would be useful), and not horribly complicate everything to ensure that reliability and to avoid false positives due to races with THP's, then it's probably better to just retry as this version does.
Trying to explain to a user 'hey this is for installing guard pages but if there's a facing fault it'll fail and that could keep happening and then you'll have to zap and maybe in a loop' just... seems like a bloody awful interface?
Hope my example above made it clearer. This "retry forever until it works" use case doesn't quite make sense to me, but I might just be missing something important.
But again, I have to do more reading on the history of the current approach ... and it's fairly late here.
Yeah see the RFC thread linked above.
On Mon, Oct 21, 2024 at 11:35:24PM +0200, Vlastimil Babka wrote:
On 10/21/24 23:20, David Hildenbrand wrote:
I don't think there's really any value in that. There's just no sensible situation in which a user would care about this I don't think.
Making sure nobody touches an area, and wile doing that somebody already touched that area? I guess it could be worked around by mprotect(PROT_NONE),madvise(GUARD),mprotect(PROT_READ|PROT_WRITE) ... which is not particularly nice :)
And if you're saying 'hey do MADV_DONTNEED if this fails and keep trying!' then why not just do that in the kernel?
Heh, no!
If user space doesn't expect there to be something, it should *fail*. That's likely going to be the majority of use cases for guard pages (happy to be told otherwise). No retry.
And if user space expects there to be something it should zap ahead of time (which some allocators maybe already do to free up memory after free()) to then install the guard. No retry.
There is this case where user space might be unsure. There, it might make sense to retry exactly once.
I've thought so too and the RFC was implemented like this, but Jann came up with a scenario where a THP can cause the range including our to-be-installed guard pte to be populated even if the userspace is not trying to access that exact address, see here:
https://lore.kernel.org/all/CAG48ez3vqbqyWb4bLdpqSUnhwqGo2OQetecNhEGPdCGDr94...
So unless we can't *reliably* detect that userspace is really shooting itself in the foot and return a failure to install guard pte *only* in that case (which would be useful), and not horribly complicate everything to ensure that reliability and to avoid false positives due to races with THP's, then it's probably better to just retry as this version does.
It would be complicated, and I'd reallly like to avoid trying to detect this. It feels a bit whack-a-mole because maybe there's other scenarios we've not thought about that could be equally problematic?
Trying to explain to a user 'hey this is for installing guard pages but if there's a facing fault it'll fail and that could keep happening and then you'll have to zap and maybe in a loop' just... seems like a bloody awful interface?
Hope my example above made it clearer. This "retry forever until it works" use case doesn't quite make sense to me, but I might just be missing something important.
But again, I have to do more reading on the history of the current approach ... and it's fairly late here.
Yeah see the RFC thread linked above.
Right, but I don't think this is the only scenario that can happen, and I think, FWIW, yet again the fundamental point comes down to 'is it a problem?'
Because if looping like this isn't, then problem solved we can all high 5 and go home listening to the prodigy and full happiness.
If not then we can revisit.
And how could it be a problem? Surely only security or DoS potential. Hopefully Jann can give some positive feedback on that.
We could also, and I hate to say it, but... try to find some means of checking on reasonable forward progress in the loop if we had to or some other 'reasonable attempt'.
But let's see...
On 21.10.24 23:35, Vlastimil Babka wrote:
On 10/21/24 23:20, David Hildenbrand wrote:
I don't think there's really any value in that. There's just no sensible situation in which a user would care about this I don't think.
Making sure nobody touches an area, and wile doing that somebody already touched that area? I guess it could be worked around by mprotect(PROT_NONE),madvise(GUARD),mprotect(PROT_READ|PROT_WRITE) ... which is not particularly nice :)
And if you're saying 'hey do MADV_DONTNEED if this fails and keep trying!' then why not just do that in the kernel?
Heh, no!
If user space doesn't expect there to be something, it should *fail*. That's likely going to be the majority of use cases for guard pages (happy to be told otherwise). No retry.
And if user space expects there to be something it should zap ahead of time (which some allocators maybe already do to free up memory after free()) to then install the guard. No retry.
There is this case where user space might be unsure. There, it might make sense to retry exactly once.
I've thought so too and the RFC was implemented like this, but Jann came up with a scenario where a THP can cause the range including our to-be-installed guard pte to be populated even if the userspace is not trying to access that exact address, see here:
https://lore.kernel.org/all/CAG48ez3vqbqyWb4bLdpqSUnhwqGo2OQetecNhEGPdCGDr94...
Ah, THP, I should have realized that myself. Yes indeed, in some cases we'll have to zap because something was already populated. Not sure how often it will happen in practice, will depend on the use case.
For use cases like one "one static guard page every 2MiB", I would assume we install the guard pages early, before expecting any page faults in that area. Likely there are other ones where it might happen more frequently.
For uffd that does not apply, because khugepaged backs off with uffd enabled and the only way to resolve a fault is using uffd -- which places exactly what was requested by user space. So, no populated PTEs without actual page faults on the corresponding virtual addresses.
On Mon, Oct 21, 2024 at 10:11:29PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise.
Currently users must establish PROT_NONE ranges to achieve this.
However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs.
In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges.
The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - poisoning PTEs such that accesses to them cause a fault followed by a SIGSGEV signal being raised.
This is achieved through the PTE marker mechanism, which a previous commit in this series extended to permit this to be done, installed via the generic page walking logic, also extended by a prior commit for this purpose.
These poison ranges are established with MADV_GUARD_POISON, and if the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect).
Any existing poison entries will be left untouched. There is no nesting of poisoned pages.
Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather unexpected behaviour, but are cleared on process teardown or unmapping of memory ranges.
Ranges can have the poison property removed by MADV_GUARD_UNPOISON - 'remedying' the poisoning. The ranges over which this is applied, should they contain non-poison entries, will be untouched, only poison entries will be cleared.
We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive).
Suggested-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
<snip>
+static long madvise_guard_poison(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
+{
- long err;
- *prev = vma;
- if (!is_valid_guard_vma(vma, /* allow_locked = */false))
return -EINVAL;
- /*
* If we install poison markers, then the range is no longer
* empty from a page table perspective and therefore it's
* appropriate to have an anon_vma.
*
* This ensures that on fork, we copy page tables correctly.
*/
- err = anon_vma_prepare(vma);
- if (err)
return err;
- /*
* Optimistically try to install the guard poison pages first. If any
* non-guard pages are encountered, give up and zap the range before
* trying again.
*/
Should the page walker become powerful enough to handle this in one go? :)
I can tell you've not read previous threads...
I've addressed this in discussion with Jann - we'd have to do a full fat huge comprehensive thing to do an in-place replace.
It'd either have to be fully duplicative of the multiple copies of the very lengthily code to do this sort of thing right (some in mm/madvise.c itself) or I'd have to go off and do a totally new pre-requisite series centralising that in a way that people probably wouldn't accept... I'm not sure the benefits outway the costs.
But sure, if it's too big a task to teach it to zap ptes with all the tlb flushing etc (I assume it's something page walkers don't do today), it makes sense to do it this way. Or we could require userspace to zap first (MADV_DONTNEED), but that would unnecessarily mean extra syscalls for the use case of an allocator debug mode that wants to turn freed memory to guards to catch use after free. So this seems like a good compromise...
This is optimistic as the comment says, you very often won't need to do this, so we do a little extra work in the case that you need to zap, vs. the more likely case that you don't when you don't.
In the face of racing faults, which we can't reasonably prevent without having to write _and_ VMA lock which is an egregious requirement, this wouldn't really save us anythign anyway.
- while (true) {
/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
err = walk_page_range_mm(vma->vm_mm, start, end,
&guard_poison_walk_ops, NULL);
if (err <= 0)
return err;
/*
* OK some of the range have non-guard pages mapped, zap
* them. This leaves existing guard pages in place.
*/
zap_page_range_single(vma, start, end - start, NULL);
... however the potentially endless loop doesn't seem great. Could a malicious program keep refaulting the range (ignoring any segfaults if it loses a race) with one thread while failing to make progress here with another thread? Is that ok because it would only punish itself?
Sigh. Again, I don't think you've read the previous series have you? Or even the changelog... I added this as Jann asked for it. Originally we'd -EAGAIN if we got raced. See the discussion over in v1 for details.
I did it that way specifically to avoid such things, but Jann didn't appear to think it was a problem.
I mean if we have to retry the guards page installation more than once, it means the program *is* racing faults with installing guard ptes in the same range, right? So it would be right to segfault it. But I guess when we detect it here, we have no way to send the signal to the right thread and it would be too late? So unless we can do the PTE zap+install marker atomically, maybe there's no better way and this is acceptable as a malicious program can harm only itself?
Yup you'd only be hurting yourself. I went over this with Jann, who didn't appear to have a problem with this approach from a security perspective, in fact he explicitly asked me to do this :)
Maybe it would be just simpler to install the marker via zap_details rather than the pagewalk?
Ah the inevitable 'please completely rework how you do everything' comment I was expecting at some point :)
Obviously I've considered this (and a number of other approaches), it would fundamentally change what zap is - right now if it can't traverse a page table level that job done (it's job is to remove PTEs not create).
We'd instead have to completely rework the logic to be able to _install_ page tables and then carefully check we don't break anything and only do it in the specific cases we need.
Or we could add a mode that says 'replace with a poison marker' rather than zap, but that has potential TLB concerns, splits it across two operations (installation and zapping), and then could you really be sure that there isn't a really really badly timed race that would mean you'd have to loop again?
Right now it's simple, elegant, small and we can only make ourselves wait. I don't think this is a huge problem.
I think I'll need an actual security/DoS-based justification to change this.
if (fatal_signal_pending(current))
return -EINTR;
cond_resched();
- }
+}
+static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr,
unsigned long next, struct mm_walk *walk)
+{
- pte_t ptent = ptep_get(pte);
- if (is_guard_pte_marker(ptent)) {
/* Simply clear the PTE marker. */
pte_clear_not_present_full(walk->mm, addr, pte, false);
update_mmu_cache(walk->vma, addr, pte);
- }
- return 0;
+}
+static const struct mm_walk_ops guard_unpoison_walk_ops = {
- .pte_entry = guard_unpoison_pte_entry,
- .walk_lock = PGWALK_RDLOCK,
+};
+static long madvise_guard_unpoison(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
+{
- *prev = vma;
- /*
* We're ok with unpoisoning mlock()'d ranges, as this is a
* non-destructive action.
*/
- if (!is_valid_guard_vma(vma, /* allow_locked = */true))
return -EINVAL;
- return walk_page_range(vma->vm_mm, start, end,
&guard_unpoison_walk_ops, NULL);
+}
/*
- Apply an madvise behavior to a region of a vma. madvise_update_vma
- will handle splitting a vm area into separate areas, each area with its own
@@ -1098,6 +1260,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, break; case MADV_COLLAPSE: return madvise_collapse(vma, prev, start, end);
case MADV_GUARD_POISON:
return madvise_guard_poison(vma, prev, start, end);
case MADV_GUARD_UNPOISON:
return madvise_guard_unpoison(vma, prev, start, end);
}
anon_name = anon_vma_name(vma);
@@ -1197,6 +1363,8 @@ madvise_behavior_valid(int behavior) case MADV_DODUMP: case MADV_WIPEONFORK: case MADV_KEEPONFORK:
- case MADV_GUARD_POISON:
- case MADV_GUARD_UNPOISON:
#ifdef CONFIG_MEMORY_FAILURE case MADV_SOFT_OFFLINE: case MADV_HWPOISON: diff --git a/mm/mprotect.c b/mm/mprotect.c index 0c5d6d06107d..d0e3ebfadef8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -236,7 +236,8 @@ static long change_pte_range(struct mmu_gather *tlb, } else if (is_pte_marker_entry(entry)) { /* * Ignore error swap entries unconditionally,
* because any access should sigbus anyway.
* because any access should sigbus/sigsegv
* anyway. */ if (is_poisoned_swp_entry(entry)) continue;
diff --git a/mm/mseal.c b/mm/mseal.c index ece977bd21e1..21bf5534bcf5 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -30,6 +30,7 @@ static bool is_madv_discard(int behavior) case MADV_REMOVE: case MADV_DONTFORK: case MADV_WIPEONFORK:
- case MADV_GUARD_POISON: return true; }
On 10/21/24 22:27, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 10:11:29PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise.
Currently users must establish PROT_NONE ranges to achieve this.
However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs.
In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges.
The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - poisoning PTEs such that accesses to them cause a fault followed by a SIGSGEV signal being raised.
This is achieved through the PTE marker mechanism, which a previous commit in this series extended to permit this to be done, installed via the generic page walking logic, also extended by a prior commit for this purpose.
These poison ranges are established with MADV_GUARD_POISON, and if the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect).
Any existing poison entries will be left untouched. There is no nesting of poisoned pages.
Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather unexpected behaviour, but are cleared on process teardown or unmapping of memory ranges.
Ranges can have the poison property removed by MADV_GUARD_UNPOISON - 'remedying' the poisoning. The ranges over which this is applied, should they contain non-poison entries, will be untouched, only poison entries will be cleared.
We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive).
Suggested-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
<snip>
+static long madvise_guard_poison(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
+{
- long err;
- *prev = vma;
- if (!is_valid_guard_vma(vma, /* allow_locked = */false))
return -EINVAL;
- /*
* If we install poison markers, then the range is no longer
* empty from a page table perspective and therefore it's
* appropriate to have an anon_vma.
*
* This ensures that on fork, we copy page tables correctly.
*/
- err = anon_vma_prepare(vma);
- if (err)
return err;
- /*
* Optimistically try to install the guard poison pages first. If any
* non-guard pages are encountered, give up and zap the range before
* trying again.
*/
Should the page walker become powerful enough to handle this in one go? :)
I can tell you've not read previous threads...
Whoops, you're right, I did read v1 but forgot about the RFC.
But we can assume people who'll only see the code after it's merged will not have read it either, so since a potentially endless loop could be suspicious, expanding the comment to explain how it's fine wouldn't hurt?
I've addressed this in discussion with Jann - we'd have to do a full fat huge comprehensive thing to do an in-place replace.
It'd either have to be fully duplicative of the multiple copies of the very lengthily code to do this sort of thing right (some in mm/madvise.c itself) or I'd have to go off and do a totally new pre-requisite series centralising that in a way that people probably wouldn't accept... I'm not sure the benefits outway the costs.
But sure, if it's too big a task to teach it to zap ptes with all the tlb flushing etc (I assume it's something page walkers don't do today), it makes sense to do it this way. Or we could require userspace to zap first (MADV_DONTNEED), but that would unnecessarily mean extra syscalls for the use case of an allocator debug mode that wants to turn freed memory to guards to catch use after free. So this seems like a good compromise...
This is optimistic as the comment says, you very often won't need to do this, so we do a little extra work in the case that you need to zap, vs. the more likely case that you don't when you don't.
In the face of racing faults, which we can't reasonably prevent without having to write _and_ VMA lock which is an egregious requirement, this wouldn't really save us anythign anyway.
OK.
- while (true) {
/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
err = walk_page_range_mm(vma->vm_mm, start, end,
&guard_poison_walk_ops, NULL);
if (err <= 0)
return err;
/*
* OK some of the range have non-guard pages mapped, zap
* them. This leaves existing guard pages in place.
*/
zap_page_range_single(vma, start, end - start, NULL);
... however the potentially endless loop doesn't seem great. Could a malicious program keep refaulting the range (ignoring any segfaults if it loses a race) with one thread while failing to make progress here with another thread? Is that ok because it would only punish itself?
Sigh. Again, I don't think you've read the previous series have you? Or even the changelog... I added this as Jann asked for it. Originally we'd -EAGAIN if we got raced. See the discussion over in v1 for details.
I did it that way specifically to avoid such things, but Jann didn't appear to think it was a problem.
If Jann is fine with this then it must be secure enough.
I mean if we have to retry the guards page installation more than once, it means the program *is* racing faults with installing guard ptes in the same range, right? So it would be right to segfault it. But I guess when we detect it here, we have no way to send the signal to the right thread and it would be too late? So unless we can do the PTE zap+install marker atomically, maybe there's no better way and this is acceptable as a malicious program can harm only itself?
Yup you'd only be hurting yourself. I went over this with Jann, who didn't appear to have a problem with this approach from a security perspective, in fact he explicitly asked me to do this :)
Maybe it would be just simpler to install the marker via zap_details rather than the pagewalk?
Ah the inevitable 'please completely rework how you do everything' comment I was expecting at some point :)
Job security :)
j/k
Obviously I've considered this (and a number of other approaches), it would fundamentally change what zap is - right now if it can't traverse a page table level that job done (it's job is to remove PTEs not create).
We'd instead have to completely rework the logic to be able to _install_ page tables and then carefully check we don't break anything and only do it in the specific cases we need.
Or we could add a mode that says 'replace with a poison marker' rather than zap, but that has potential TLB concerns, splits it across two operations (installation and zapping), and then could you really be sure that there isn't a really really badly timed race that would mean you'd have to loop again?
Right now it's simple, elegant, small and we can only make ourselves wait. I don't think this is a huge problem.
I think I'll need an actual security/DoS-based justification to change this.
if (fatal_signal_pending(current))
return -EINTR;
cond_resched();
- }
+}
+static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr,
unsigned long next, struct mm_walk *walk)
+{
- pte_t ptent = ptep_get(pte);
- if (is_guard_pte_marker(ptent)) {
/* Simply clear the PTE marker. */
pte_clear_not_present_full(walk->mm, addr, pte, false);
update_mmu_cache(walk->vma, addr, pte);
- }
- return 0;
+}
+static const struct mm_walk_ops guard_unpoison_walk_ops = {
- .pte_entry = guard_unpoison_pte_entry,
- .walk_lock = PGWALK_RDLOCK,
+};
+static long madvise_guard_unpoison(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
+{
- *prev = vma;
- /*
* We're ok with unpoisoning mlock()'d ranges, as this is a
* non-destructive action.
*/
- if (!is_valid_guard_vma(vma, /* allow_locked = */true))
return -EINVAL;
- return walk_page_range(vma->vm_mm, start, end,
&guard_unpoison_walk_ops, NULL);
+}
/*
- Apply an madvise behavior to a region of a vma. madvise_update_vma
- will handle splitting a vm area into separate areas, each area with its own
@@ -1098,6 +1260,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, break; case MADV_COLLAPSE: return madvise_collapse(vma, prev, start, end);
case MADV_GUARD_POISON:
return madvise_guard_poison(vma, prev, start, end);
case MADV_GUARD_UNPOISON:
return madvise_guard_unpoison(vma, prev, start, end);
}
anon_name = anon_vma_name(vma);
@@ -1197,6 +1363,8 @@ madvise_behavior_valid(int behavior) case MADV_DODUMP: case MADV_WIPEONFORK: case MADV_KEEPONFORK:
- case MADV_GUARD_POISON:
- case MADV_GUARD_UNPOISON:
#ifdef CONFIG_MEMORY_FAILURE case MADV_SOFT_OFFLINE: case MADV_HWPOISON: diff --git a/mm/mprotect.c b/mm/mprotect.c index 0c5d6d06107d..d0e3ebfadef8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -236,7 +236,8 @@ static long change_pte_range(struct mmu_gather *tlb, } else if (is_pte_marker_entry(entry)) { /* * Ignore error swap entries unconditionally,
* because any access should sigbus anyway.
* because any access should sigbus/sigsegv
* anyway. */ if (is_poisoned_swp_entry(entry)) continue;
diff --git a/mm/mseal.c b/mm/mseal.c index ece977bd21e1..21bf5534bcf5 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -30,6 +30,7 @@ static bool is_madv_discard(int behavior) case MADV_REMOVE: case MADV_DONTFORK: case MADV_WIPEONFORK:
- case MADV_GUARD_POISON: return true; }
On Mon, Oct 21, 2024 at 10:46 PM Vlastimil Babka vbabka@suse.cz wrote:
On 10/21/24 22:27, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 10:11:29PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
- while (true) {
/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
err = walk_page_range_mm(vma->vm_mm, start, end,
&guard_poison_walk_ops, NULL);
if (err <= 0)
return err;
/*
* OK some of the range have non-guard pages mapped, zap
* them. This leaves existing guard pages in place.
*/
zap_page_range_single(vma, start, end - start, NULL);
... however the potentially endless loop doesn't seem great. Could a malicious program keep refaulting the range (ignoring any segfaults if it loses a race) with one thread while failing to make progress here with another thread? Is that ok because it would only punish itself?
Sigh. Again, I don't think you've read the previous series have you? Or even the changelog... I added this as Jann asked for it. Originally we'd -EAGAIN if we got raced. See the discussion over in v1 for details.
I did it that way specifically to avoid such things, but Jann didn't appear to think it was a problem.
If Jann is fine with this then it must be secure enough.
My thinking there was:
We can legitimately race with adjacent faults populating the area we're operating on with THP pages; as long as the zapping and poison-marker-setting are separate, *someone* will have to do the retry. Either we do it in the kernel, or we tell userspace to handle it, but having the kernel take care of it is preferable because it makes the stable UAPI less messy.
One easy way to do it in the kernel would be to return -ERESTARTNOINTR after the zap_page_range_single() instead of jumping back up, which in terms of locking and signal handling and such would be equivalent to looping in userspace (because really that's what -ERESTARTNOINTR does - it returns out to userspace and moves the instruction pointer back to restart the syscall). Though if we do that immediately, it might make MADV_POISON unnecessarily slow, so we should probably retry once before doing that. The other easy way is to just loop here.
The cond_resched() and pending fatal signal check mean that (except on CONFIG_PREEMPT_NONE) the only differences between the current implementation and looping in userspace are that we don't handle non-fatal signals in between iterations and that we keep hogging the mmap_lock in read mode. We do already have a bunch of codepaths that retry on concurrent page table changes, like when zap_pte_range() encounters a pte_offset_map_lock() failure; though I guess the difference is that the retry on those is just a couple instructions, which would be harder to race consistently, while here we redo walks across the entire range, which should be fairly easy to race repeatedly.
So I guess you have a point that this might be the easiest way to stall other tasks that are trying to take mmap_lock for an extended amount of time, I did not fully consider that... and then I guess you could use that to slow down usercopy fault handling (once the lock switches to handoff mode because of a stalled writer?) or slow down other processes trying to read /proc/$pid/cmdline?
You can already indefinitely hog the mmap_lock with FUSE, though that requires that you can mount a FUSE filesystem (which you wouldn't be able in reasonably sandboxed code) and that you can find something like a pin_user_pages() call that can't drop the mmap lock in between, and there aren't actually that many of those...
So I guess you have a point and the -ERESTARTNOINTR approach would be a little bit nicer, as long as it's easy to implement.
On Tue, Oct 22, 2024 at 09:08:53PM +0200, Jann Horn wrote:
On Mon, Oct 21, 2024 at 10:46 PM Vlastimil Babka vbabka@suse.cz wrote:
On 10/21/24 22:27, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 10:11:29PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
- while (true) {
/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
err = walk_page_range_mm(vma->vm_mm, start, end,
&guard_poison_walk_ops, NULL);
if (err <= 0)
return err;
/*
* OK some of the range have non-guard pages mapped, zap
* them. This leaves existing guard pages in place.
*/
zap_page_range_single(vma, start, end - start, NULL);
... however the potentially endless loop doesn't seem great. Could a malicious program keep refaulting the range (ignoring any segfaults if it loses a race) with one thread while failing to make progress here with another thread? Is that ok because it would only punish itself?
Sigh. Again, I don't think you've read the previous series have you? Or even the changelog... I added this as Jann asked for it. Originally we'd -EAGAIN if we got raced. See the discussion over in v1 for details.
I did it that way specifically to avoid such things, but Jann didn't appear to think it was a problem.
If Jann is fine with this then it must be secure enough.
My thinking there was:
We can legitimately race with adjacent faults populating the area we're operating on with THP pages; as long as the zapping and poison-marker-setting are separate, *someone* will have to do the retry. Either we do it in the kernel, or we tell userspace to handle it, but having the kernel take care of it is preferable because it makes the stable UAPI less messy.
One easy way to do it in the kernel would be to return -ERESTARTNOINTR after the zap_page_range_single() instead of jumping back up, which in terms of locking and signal handling and such would be equivalent to looping in userspace (because really that's what -ERESTARTNOINTR does
- it returns out to userspace and moves the instruction pointer back
to restart the syscall). Though if we do that immediately, it might make MADV_POISON unnecessarily slow, so we should probably retry once before doing that. The other easy way is to just loop here.
Yes we should definitely retry probably a few times to cover the rare situation of a THP race as you describe under non-abusive circumstances.
The cond_resched() and pending fatal signal check mean that (except on CONFIG_PREEMPT_NONE) the only differences between the current implementation and looping in userspace are that we don't handle non-fatal signals in between iterations and that we keep hogging the mmap_lock in read mode. We do already have a bunch of codepaths that retry on concurrent page table changes, like when zap_pte_range() encounters a pte_offset_map_lock() failure; though I guess the difference is that the retry on those is just a couple instructions, which would be harder to race consistently, while here we redo walks across the entire range, which should be fairly easy to race repeatedly.
So I guess you have a point that this might be the easiest way to stall other tasks that are trying to take mmap_lock for an extended amount of time, I did not fully consider that... and then I guess you could use that to slow down usercopy fault handling (once the lock switches to handoff mode because of a stalled writer?) or slow down other processes trying to read /proc/$pid/cmdline?
Hm does that need a write lock?
You can already indefinitely hog the mmap_lock with FUSE, though that requires that you can mount a FUSE filesystem (which you wouldn't be able in reasonably sandboxed code) and that you can find something like a pin_user_pages() call that can't drop the mmap lock in between, and there aren't actually that many of those...
So I guess you have a point and the -ERESTARTNOINTR approach would be a little bit nicer, as long as it's easy to implement.
I can go ahead and do it that way if nobody objects, with a few loops before we do it... which hopefully covers off all the concerns?
Thanks
On Tue, Oct 22, 2024 at 9:35 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Tue, Oct 22, 2024 at 09:08:53PM +0200, Jann Horn wrote:
On Mon, Oct 21, 2024 at 10:46 PM Vlastimil Babka vbabka@suse.cz wrote:
On 10/21/24 22:27, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 10:11:29PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
- while (true) {
/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
err = walk_page_range_mm(vma->vm_mm, start, end,
&guard_poison_walk_ops, NULL);
if (err <= 0)
return err;
/*
* OK some of the range have non-guard pages mapped, zap
* them. This leaves existing guard pages in place.
*/
zap_page_range_single(vma, start, end - start, NULL);
... however the potentially endless loop doesn't seem great. Could a malicious program keep refaulting the range (ignoring any segfaults if it loses a race) with one thread while failing to make progress here with another thread? Is that ok because it would only punish itself?
Sigh. Again, I don't think you've read the previous series have you? Or even the changelog... I added this as Jann asked for it. Originally we'd -EAGAIN if we got raced. See the discussion over in v1 for details.
I did it that way specifically to avoid such things, but Jann didn't appear to think it was a problem.
If Jann is fine with this then it must be secure enough.
My thinking there was:
We can legitimately race with adjacent faults populating the area we're operating on with THP pages; as long as the zapping and poison-marker-setting are separate, *someone* will have to do the retry. Either we do it in the kernel, or we tell userspace to handle it, but having the kernel take care of it is preferable because it makes the stable UAPI less messy.
One easy way to do it in the kernel would be to return -ERESTARTNOINTR after the zap_page_range_single() instead of jumping back up, which in terms of locking and signal handling and such would be equivalent to looping in userspace (because really that's what -ERESTARTNOINTR does
- it returns out to userspace and moves the instruction pointer back
to restart the syscall). Though if we do that immediately, it might make MADV_POISON unnecessarily slow, so we should probably retry once before doing that. The other easy way is to just loop here.
Yes we should definitely retry probably a few times to cover the rare situation of a THP race as you describe under non-abusive circumstances.
The cond_resched() and pending fatal signal check mean that (except on CONFIG_PREEMPT_NONE) the only differences between the current implementation and looping in userspace are that we don't handle non-fatal signals in between iterations and that we keep hogging the mmap_lock in read mode. We do already have a bunch of codepaths that retry on concurrent page table changes, like when zap_pte_range() encounters a pte_offset_map_lock() failure; though I guess the difference is that the retry on those is just a couple instructions, which would be harder to race consistently, while here we redo walks across the entire range, which should be fairly easy to race repeatedly.
So I guess you have a point that this might be the easiest way to stall other tasks that are trying to take mmap_lock for an extended amount of time, I did not fully consider that... and then I guess you could use that to slow down usercopy fault handling (once the lock switches to handoff mode because of a stalled writer?) or slow down other processes trying to read /proc/$pid/cmdline?
Hm does that need a write lock?
No, but if you have one reader that is hogging the rwsem, and then a writer is queued up on the rwsem afterwards, I think new readers will sometimes be queued up behind the writer. So even though the rwsem is only actually held by a reader, new readers can't immediately take the rwsem because the rwsem code thinks that would be unfair to a pending writer who just wants to make some quick change. I'm not super familiar with this code, but basically I think it works roughly like this:
If the rwsem code notices that a bunch of readers are preventing a writer from taking the lock, the rwsem code will start queuing new readers behind the queued writer. You can see in rwsem_read_trylock() that the trylock fastpath is skipped if anyone is waiting on the rwsem or the handoff flag is set, and in rwsem_down_read_slowpath() the "reader optimistic lock stealing" path is skipped if the lock is currently held by multiple readers or if the handoff bit is set.
The handoff bit can be set in rwsem_try_write_lock() by a writer "if it is an RT task or wait in the wait queue for too long". Basically I think it means something like "I think other users of the lock are hogging it more than they should, stop stealing the lock from me". And the RWSEM_WAIT_TIMEOUT for triggering handoff mode is pretty short, RWSEM_WAIT_TIMEOUT is defined to something like 4ms, so I think that's how long writers tolerate the lock being hogged by readers before they prevent new readers from stealing the lock.
On Tue, Oct 22, 2024 at 09:57:39PM +0200, Jann Horn wrote:
On Tue, Oct 22, 2024 at 9:35 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Tue, Oct 22, 2024 at 09:08:53PM +0200, Jann Horn wrote:
On Mon, Oct 21, 2024 at 10:46 PM Vlastimil Babka vbabka@suse.cz wrote:
On 10/21/24 22:27, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 10:11:29PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote: > + while (true) { > + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ > + err = walk_page_range_mm(vma->vm_mm, start, end, > + &guard_poison_walk_ops, NULL); > + if (err <= 0) > + return err; > + > + /* > + * OK some of the range have non-guard pages mapped, zap > + * them. This leaves existing guard pages in place. > + */ > + zap_page_range_single(vma, start, end - start, NULL);
... however the potentially endless loop doesn't seem great. Could a malicious program keep refaulting the range (ignoring any segfaults if it loses a race) with one thread while failing to make progress here with another thread? Is that ok because it would only punish itself?
Sigh. Again, I don't think you've read the previous series have you? Or even the changelog... I added this as Jann asked for it. Originally we'd -EAGAIN if we got raced. See the discussion over in v1 for details.
I did it that way specifically to avoid such things, but Jann didn't appear to think it was a problem.
If Jann is fine with this then it must be secure enough.
My thinking there was:
We can legitimately race with adjacent faults populating the area we're operating on with THP pages; as long as the zapping and poison-marker-setting are separate, *someone* will have to do the retry. Either we do it in the kernel, or we tell userspace to handle it, but having the kernel take care of it is preferable because it makes the stable UAPI less messy.
One easy way to do it in the kernel would be to return -ERESTARTNOINTR after the zap_page_range_single() instead of jumping back up, which in terms of locking and signal handling and such would be equivalent to looping in userspace (because really that's what -ERESTARTNOINTR does
- it returns out to userspace and moves the instruction pointer back
to restart the syscall). Though if we do that immediately, it might make MADV_POISON unnecessarily slow, so we should probably retry once before doing that. The other easy way is to just loop here.
Yes we should definitely retry probably a few times to cover the rare situation of a THP race as you describe under non-abusive circumstances.
The cond_resched() and pending fatal signal check mean that (except on CONFIG_PREEMPT_NONE) the only differences between the current implementation and looping in userspace are that we don't handle non-fatal signals in between iterations and that we keep hogging the mmap_lock in read mode. We do already have a bunch of codepaths that retry on concurrent page table changes, like when zap_pte_range() encounters a pte_offset_map_lock() failure; though I guess the difference is that the retry on those is just a couple instructions, which would be harder to race consistently, while here we redo walks across the entire range, which should be fairly easy to race repeatedly.
So I guess you have a point that this might be the easiest way to stall other tasks that are trying to take mmap_lock for an extended amount of time, I did not fully consider that... and then I guess you could use that to slow down usercopy fault handling (once the lock switches to handoff mode because of a stalled writer?) or slow down other processes trying to read /proc/$pid/cmdline?
Hm does that need a write lock?
No, but if you have one reader that is hogging the rwsem, and then a writer is queued up on the rwsem afterwards, I think new readers will sometimes be queued up behind the writer. So even though the rwsem is only actually held by a reader, new readers can't immediately take the rwsem because the rwsem code thinks that would be unfair to a pending writer who just wants to make some quick change. I'm not super familiar with this code, but basically I think it works roughly like this:
If the rwsem code notices that a bunch of readers are preventing a writer from taking the lock, the rwsem code will start queuing new readers behind the queued writer. You can see in rwsem_read_trylock() that the trylock fastpath is skipped if anyone is waiting on the rwsem or the handoff flag is set, and in rwsem_down_read_slowpath() the "reader optimistic lock stealing" path is skipped if the lock is currently held by multiple readers or if the handoff bit is set.
The handoff bit can be set in rwsem_try_write_lock() by a writer "if it is an RT task or wait in the wait queue for too long". Basically I think it means something like "I think other users of the lock are hogging it more than they should, stop stealing the lock from me". And the RWSEM_WAIT_TIMEOUT for triggering handoff mode is pretty short, RWSEM_WAIT_TIMEOUT is defined to something like 4ms, so I think that's how long writers tolerate the lock being hogged by readers before they prevent new readers from stealing the lock.
Ack makes sense! -ERESTARTNOINTR should help resolve this so definitely unless anybody has any objection to that I'll go ahead and do a respin taking that approach (+ all otehr fixes) for v2.
Thanks for your input!
Import the new MADV_GUARD_POISON/UNPOISON madvise flags.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- tools/include/uapi/asm-generic/mman-common.h | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..5dfd3d442de4 100644 --- a/tools/include/uapi/asm-generic/mman-common.h +++ b/tools/include/uapi/asm-generic/mman-common.h @@ -79,6 +79,9 @@
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
+#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0
Utilise the kselftest harmness to implement tests for the guard page implementation.
We start by implement basic tests asserting that guard pages can be established (poisoned), cleared (remedied) and that touching poisoned pages result in SIGSEGV. We also assert that, in remedying a range, non-poison pages remain intact.
We then examine different operations on regions containing poison markers behave to ensure correct behaviour:
* Operations over multiple VMAs operate as expected. * Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in batches works correctly. * Ensuring that munmap() correctly tears down poison markers. * Using mprotect() to adjust protection bits does not in any way override or cause issues with poison markers. * Ensuring that splitting and merging VMAs around poison markers causes no issue - i.e. that a marker which 'belongs' to one VMA can function just as well 'belonging' to another. * Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers. * Ensuring that mlock()'ing a range containing poison markers does not cause issues. * Ensuring that mremap() can move a poisoned range and retain poison markers. * Ensuring that mremap() can expand a poisoned range and retain poison markers (perhaps moving the range). * Ensuring that mremap() can shrink a poisoned range and retain poison markers. * Ensuring that forking a process correctly retains poison markers. * Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely. * Ensuring that lazyfree simply clears poison markers. * Ensuring that userfaultfd can co-exist with guard pages. * Ensuring that madvise(..., MADV_POPULATE_READ) and madvise(..., MADV_POPULATE_WRITE) error out when encountering poison markers. * Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do not remove poison markers.
If any test is unable to be run due to lack of permissions, that test is skipped.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1228 ++++++++++++++++++++++ 3 files changed, 1230 insertions(+) create mode 100644 tools/testing/selftests/mm/guard-pages.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 689bbd520296..8f01f4da1c0d 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -54,3 +54,4 @@ droppable hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 +guard-pages diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 02e1204971b0..15c734d6cfec 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable +TEST_GEN_FILES += guard-pages
ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c new file mode 100644 index 000000000000..f67d2700d44a --- /dev/null +++ b/tools/testing/selftests/mm/guard-pages.c @@ -0,0 +1,1228 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <asm-generic/mman.h> /* Force the import of the tools version. */ +#include <assert.h> +#include <errno.h> +#include <fcntl.h> +#include <linux/userfaultfd.h> +#include <setjmp.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <sys/syscall.h> +#include <sys/uio.h> +#include <unistd.h> + +/* + * Ignore the checkpatch warning, as per the C99 standard, section 7.14.1.1: + * + * "If the signal occurs other than as the result of calling the abort or raise + * function, the behavior is undefined if the signal handler refers to any + * object with static storage duration other than by assigning a value to an + * object declared as volatile sig_atomic_t" + */ +static volatile sig_atomic_t signal_jump_set; +static sigjmp_buf signal_jmp_buf; + +/* + * Ignore the checkpatch warning, we must read from x but don't want to do + * anything with it in order to trigger a read page fault. We therefore must use + * volatile to stop the compiler from optimising this away. + */ +#define FORCE_READ(x) (*(volatile typeof(x) *)x) + +static int userfaultfd(int flags) +{ + return syscall(SYS_userfaultfd, flags); +} + +static void handle_fatal(int c) +{ + if (!signal_jump_set) + return; + + siglongjmp(signal_jmp_buf, c); +} + +static int pidfd_open(pid_t pid, unsigned int flags) +{ + return syscall(SYS_pidfd_open, pid, flags); +} + +/* + * Enable our signal catcher and try to read/write the specified buffer. The + * return value indicates whether the read/write succeeds without a fatal + * signal. + */ +static bool try_access_buf(char *ptr, bool write) +{ + bool failed; + + /* Tell signal handler to jump back here on fatal signal. */ + signal_jump_set = true; + /* If a fatal signal arose, we will jump back here and failed is set. */ + failed = sigsetjmp(signal_jmp_buf, 0) != 0; + + if (!failed) { + if (write) + *ptr = 'x'; + else + FORCE_READ(ptr); + } + + signal_jump_set = false; + return !failed; +} + +/* Try and read from a buffer, return true if no fatal signal. */ +static bool try_read_buf(char *ptr) +{ + return try_access_buf(ptr, false); +} + +/* Try and write to a buffer, return true if no fatal signal. */ +static bool try_write_buf(char *ptr) +{ + return try_access_buf(ptr, true); +} + +/* + * Try and BOTH read from AND write to a buffer, return true if BOTH operations + * succeed. + */ +static bool try_read_write_buf(char *ptr) +{ + return try_read_buf(ptr) && try_write_buf(ptr); +} + +FIXTURE(guard_pages) +{ + unsigned long page_size; +}; + +FIXTURE_SETUP(guard_pages) +{ + struct sigaction act = { + .sa_handler = &handle_fatal, + .sa_flags = SA_NODEFER, + }; + + sigemptyset(&act.sa_mask); + if (sigaction(SIGSEGV, &act, NULL)) + ksft_exit_fail_perror("sigaction"); + + self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); +}; + +FIXTURE_TEARDOWN(guard_pages) +{ + struct sigaction act = { + .sa_handler = SIG_DFL, + .sa_flags = SA_NODEFER, + }; + + sigemptyset(&act.sa_mask); + sigaction(SIGSEGV, &act, NULL); +} + +TEST_F(guard_pages, basic) +{ + const unsigned long NUM_PAGES = 10; + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + ptr = mmap(NULL, NUM_PAGES * page_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANON, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Trivially assert we can touch the first page. */ + ASSERT_TRUE(try_read_write_buf(ptr)); + + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0); + + /* Establish that 1st page SIGSEGV's. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + + /* Ensure we can touch everything else.*/ + for (i = 1; i < NUM_PAGES; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_TRUE(try_read_write_buf(curr)); + } + + /* Establish a guard page at the end of the mapping. */ + ASSERT_EQ(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size, + MADV_GUARD_POISON), 0); + + /* Check that both guard pages result in SIGSEGV. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size])); + + /* Unpoison the first. */ + ASSERT_FALSE(madvise(ptr, page_size, MADV_GUARD_UNPOISON)); + + /* Make sure we can touch it. */ + ASSERT_TRUE(try_read_write_buf(ptr)); + + /* Unpoison the last. */ + ASSERT_FALSE(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size, + MADV_GUARD_UNPOISON)); + + /* Make sure we can touch it. */ + ASSERT_TRUE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size])); + + /* + * Test setting a _range_ of pages, namely the first 3. The first of + * these be faulted in, so this also tests that we can poison backed + * pages. + */ + ASSERT_EQ(madvise(ptr, 3 * page_size, MADV_GUARD_POISON), 0); + + /* Make sure they are all poisoned. */ + for (i = 0; i < 3; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + + /* Make sure the rest are not. */ + for (i = 3; i < NUM_PAGES; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_TRUE(try_read_write_buf(curr)); + } + + /* Unpoison them. */ + ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_UNPOISON), 0); + + /* Now make sure we can touch everything. */ + for (i = 0; i < NUM_PAGES; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_TRUE(try_read_write_buf(curr)); + } + + /* Now unpoison everything, make sure we don't remove existing entries */ + ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_UNPOISON), 0); + + for (i = 0; i < NUM_PAGES * page_size; i += page_size) { + char chr = ptr[i]; + + ASSERT_EQ(chr, 'x'); + } + + ASSERT_EQ(munmap(ptr, NUM_PAGES * page_size), 0); +} + +/* Assert that operations applied across multiple VMAs work as expected. */ +TEST_F(guard_pages, multi_vma) +{ + const unsigned long page_size = self->page_size; + char *ptr_region, *ptr, *ptr1, *ptr2, *ptr3; + int i; + + /* Reserve a 100 page region over which we can install VMAs. */ + ptr_region = mmap(NULL, 100 * page_size, PROT_NONE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_region, MAP_FAILED); + + /* Place a VMA of 10 pages size at the start of the region. */ + ptr1 = mmap(ptr_region, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr1, MAP_FAILED); + + /* Place a VMA of 5 pages size 50 pages into the region. */ + ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size, + PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + + /* Place a VMA of 20 pages size at the end of the region. */ + ptr3 = mmap(&ptr_region[80 * page_size], 20 * page_size, + PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr3, MAP_FAILED); + + /* Unmap gaps. */ + ASSERT_EQ(munmap(&ptr_region[10 * page_size], 40 * page_size), 0); + ASSERT_EQ(munmap(&ptr_region[55 * page_size], 25 * page_size), 0); + + /* + * We end up with VMAs like this: + * + * 0 10 .. 50 55 .. 80 100 + * [---] [---] [---] + */ + + /* Now poison the whole range and make sure all VMAs are poisoned. */ + + /* + * madvise() is certifiable and lets you perform operations over gaps, + * everything works, but it indicates an error and errno is set to + * -ENOMEM. Also if anything runs out of memory it is set to + * -ENOMEM. You are meant to guess which is which. + */ + ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_POISON), -1); + ASSERT_EQ(errno, ENOMEM); + + for (i = 0; i < 10; i++) { + char *curr = &ptr1[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + + for (i = 0; i < 5; i++) { + char *curr = &ptr2[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + + for (i = 0; i < 20; i++) { + char *curr = &ptr3[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + + /* Now unpoison the range and assert the opposite. */ + + ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_UNPOISON), -1); + ASSERT_EQ(errno, ENOMEM); + + for (i = 0; i < 10; i++) { + char *curr = &ptr1[i * page_size]; + + ASSERT_TRUE(try_read_write_buf(curr)); + } + + for (i = 0; i < 5; i++) { + char *curr = &ptr2[i * page_size]; + + ASSERT_TRUE(try_read_write_buf(curr)); + } + + for (i = 0; i < 20; i++) { + char *curr = &ptr3[i * page_size]; + + ASSERT_TRUE(try_read_write_buf(curr)); + } + + /* Now map incompatible VMAs in the gaps. */ + ptr = mmap(&ptr_region[10 * page_size], 40 * page_size, + PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + ptr = mmap(&ptr_region[55 * page_size], 25 * page_size, + PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* + * We end up with VMAs like this: + * + * 0 10 .. 50 55 .. 80 100 + * [---][xxxx][---][xxxx][---] + * + * Where 'x' signifies VMAs that cannot be merged with those adjacent to + * them. + */ + + /* Multiple VMAs adjacent to one another should result in no error. */ + ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_POISON), 0); + for (i = 0; i < 100; i++) { + char *curr = &ptr_region[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_UNPOISON), 0); + for (i = 0; i < 100; i++) { + char *curr = &ptr_region[i * page_size]; + + ASSERT_TRUE(try_read_write_buf(curr)); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr_region, 100 * page_size), 0); +} + +/* + * Assert that batched operations performed using process_madvise() work as + * expected. + */ +TEST_F(guard_pages, process_madvise) +{ + const unsigned long page_size = self->page_size; + pid_t pid = getpid(); + int pidfd = pidfd_open(pid, 0); + char *ptr_region, *ptr1, *ptr2, *ptr3; + ssize_t count; + struct iovec vec[6]; + + ASSERT_NE(pidfd, -1); + + /* Reserve region to map over. */ + ptr_region = mmap(NULL, 100 * page_size, PROT_NONE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_region, MAP_FAILED); + + /* 10 pages offset 1 page into reserve region. */ + ptr1 = mmap(&ptr_region[page_size], 10 * page_size, + PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr1, MAP_FAILED); + /* We want poison markers at start/end of each VMA. */ + vec[0].iov_base = ptr1; + vec[0].iov_len = page_size; + vec[1].iov_base = &ptr1[9 * page_size]; + vec[1].iov_len = page_size; + + /* 5 pages offset 50 pages into reserve region. */ + ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size, + PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + vec[2].iov_base = ptr2; + vec[2].iov_len = page_size; + vec[3].iov_base = &ptr2[4 * page_size]; + vec[3].iov_len = page_size; + + /* 20 pages offset 79 pages into reserve region. */ + ptr3 = mmap(&ptr_region[79 * page_size], 20 * page_size, + PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr3, MAP_FAILED); + vec[4].iov_base = ptr3; + vec[4].iov_len = page_size; + vec[5].iov_base = &ptr3[19 * page_size]; + vec[5].iov_len = page_size; + + /* Free surrounding VMAs. */ + ASSERT_EQ(munmap(ptr_region, page_size), 0); + ASSERT_EQ(munmap(&ptr_region[11 * page_size], 39 * page_size), 0); + ASSERT_EQ(munmap(&ptr_region[55 * page_size], 24 * page_size), 0); + ASSERT_EQ(munmap(&ptr_region[99 * page_size], page_size), 0); + + /* Now poison in one step. */ + count = process_madvise(pidfd, vec, 6, MADV_GUARD_POISON, 0); + + /* OK we don't have permission to do this, skip. */ + if (count == -1 && errno == EPERM) + ksft_exit_skip("No process_madvise() permissions, try running as root.\n"); + + /* Returns the number of bytes advised. */ + ASSERT_EQ(count, 6 * page_size); + + /* Now make sure the poisoning was applied. */ + + ASSERT_FALSE(try_read_write_buf(ptr1)); + ASSERT_FALSE(try_read_write_buf(&ptr1[9 * page_size])); + + ASSERT_FALSE(try_read_write_buf(ptr2)); + ASSERT_FALSE(try_read_write_buf(&ptr2[4 * page_size])); + + ASSERT_FALSE(try_read_write_buf(ptr3)); + ASSERT_FALSE(try_read_write_buf(&ptr3[19 * page_size])); + + /* Now do the same with unpoison... */ + count = process_madvise(pidfd, vec, 6, MADV_GUARD_UNPOISON, 0); + + /* ...and everything should now succeed. */ + + ASSERT_TRUE(try_read_write_buf(ptr1)); + ASSERT_TRUE(try_read_write_buf(&ptr1[9 * page_size])); + + ASSERT_TRUE(try_read_write_buf(ptr2)); + ASSERT_TRUE(try_read_write_buf(&ptr2[4 * page_size])); + + ASSERT_TRUE(try_read_write_buf(ptr3)); + ASSERT_TRUE(try_read_write_buf(&ptr3[19 * page_size])); + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr1, 10 * page_size), 0); + ASSERT_EQ(munmap(ptr2, 5 * page_size), 0); + ASSERT_EQ(munmap(ptr3, 20 * page_size), 0); + close(pidfd); +} + +/* Assert that unmapping ranges does not leave poison behind. */ +TEST_F(guard_pages, munmap) +{ + const unsigned long page_size = self->page_size; + char *ptr, *ptr_new1, *ptr_new2; + + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison first and last pages. */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0); + ASSERT_EQ(madvise(&ptr[9 * page_size], page_size, MADV_GUARD_POISON), 0); + + /* Assert that they are poisoned. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[9 * page_size])); + + /* Unmap them. */ + ASSERT_EQ(munmap(ptr, page_size), 0); + ASSERT_EQ(munmap(&ptr[9 * page_size], page_size), 0); + + /* Map over them.*/ + ptr_new1 = mmap(ptr, page_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_new1, MAP_FAILED); + ptr_new2 = mmap(&ptr[9 * page_size], page_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_new2, MAP_FAILED); + + /* Assert that they are now not poisoned. */ + ASSERT_TRUE(try_read_write_buf(ptr_new1)); + ASSERT_TRUE(try_read_write_buf(ptr_new2)); + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Assert that mprotect() operations have no bearing on guard poison markers. */ +TEST_F(guard_pages, mprotect) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison the middle of the range. */ + ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size, + MADV_GUARD_POISON), 0); + + /* Assert that it is indeed poisoned. */ + ASSERT_FALSE(try_read_write_buf(&ptr[5 * page_size])); + ASSERT_FALSE(try_read_write_buf(&ptr[6 * page_size])); + + /* Now make these pages read-only. */ + ASSERT_EQ(mprotect(&ptr[5 * page_size], 2 * page_size, PROT_READ), 0); + + /* Make sure the range is still poisoned. */ + ASSERT_FALSE(try_read_buf(&ptr[5 * page_size])); + ASSERT_FALSE(try_read_buf(&ptr[6 * page_size])); + + /* Make sure we can poison again without issue.*/ + ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size, + MADV_GUARD_POISON), 0); + + /* Make sure the range is, yet again, still poisoned. */ + ASSERT_FALSE(try_read_buf(&ptr[5 * page_size])); + ASSERT_FALSE(try_read_buf(&ptr[6 * page_size])); + + /* Now unpoison the whole range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0); + + /* Make sure the whole range is readable. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_TRUE(try_read_buf(curr)); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Split and merge VMAs and make sure guard pages still behave. */ +TEST_F(guard_pages, split_merge) +{ + const unsigned long page_size = self->page_size; + char *ptr, *ptr_new; + int i; + + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison the whole range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0); + + /* Make sure the whole range is poisoned. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + + /* Now unmap some pages in the range so we split. */ + ASSERT_EQ(munmap(&ptr[2 * page_size], page_size), 0); + ASSERT_EQ(munmap(&ptr[5 * page_size], page_size), 0); + ASSERT_EQ(munmap(&ptr[8 * page_size], page_size), 0); + + /* Make sure the remaining ranges are poisoned post-split. */ + for (i = 0; i < 2; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + for (i = 2; i < 5; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + for (i = 6; i < 8; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + for (i = 9; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + + /* Now map them again - the unmap will have cleared the poison. */ + ptr_new = mmap(&ptr[2 * page_size], page_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_new, MAP_FAILED); + ptr_new = mmap(&ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_new, MAP_FAILED); + ptr_new = mmap(&ptr[8 * page_size], page_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_new, MAP_FAILED); + + /* Now make sure poisoning is as expected. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + bool result = try_read_write_buf(curr); + bool expect_true = i == 2 || i == 5 || i == 8; + + ASSERT_TRUE(expect_true ? result : !result); + } + + /* Now poison everything again. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0); + + /* Make sure the whole range is poisoned. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + + /* Now split the range into three. */ + ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0); + ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0); + + /* Make sure the whole range is poisoned for read. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_buf(curr)); + } + + /* Now reset protection bits so we merge the whole thing. */ + ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0); + ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, + PROT_READ | PROT_WRITE), 0); + + /* Make sure the whole range is still poisoned. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + + /* Split range into 3 again... */ + ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0); + ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0); + + /* ...and unpoison the whole range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0); + + /* Make sure the whole range is remedied for read. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_TRUE(try_read_buf(curr)); + } + + /* Merge them again. */ + ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0); + ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, + PROT_READ | PROT_WRITE), 0); + + /* Now ensure the merged range is remedied for read/write. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_TRUE(try_read_write_buf(curr)); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Assert that MADV_DONTNEED does not remove guard poison markers. */ +TEST_F(guard_pages, dontneed) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Back the whole range. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + *curr = 'y'; + } + + /* Poison every other page. */ + for (i = 0; i < 10; i += 2) { + char *curr = &ptr[i * page_size]; + int res = madvise(curr, page_size, MADV_GUARD_POISON); + + ASSERT_EQ(res, 0); + } + + /* Indicate that we don't need any of the range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_DONTNEED), 0); + + /* Check to ensure poison markers are still in place. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + bool result = try_read_buf(curr); + + if (i % 2 == 0) { + ASSERT_FALSE(result); + } else { + ASSERT_TRUE(result); + /* Make sure we really did get reset to zero page. */ + ASSERT_EQ(*curr, '\0'); + } + + /* Now write... */ + result = try_write_buf(&ptr[i * page_size]); + + /* ...and make sure same result. */ + ASSERT_TRUE(i % 2 != 0 ? result : !result); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Assert that mlock()'ed pages work correctly with poison markers. */ +TEST_F(guard_pages, mlock) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Populate. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + *curr = 'y'; + } + + /* Lock. */ + ASSERT_EQ(mlock(ptr, 10 * page_size), 0); + + /* Now try to poison, should fail with EINVAL. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), -1); + ASSERT_EQ(errno, EINVAL); + + /* OK unlock. */ + ASSERT_EQ(munlock(ptr, 10 * page_size), 0); + + /* Poison first half of range, should now succeed. */ + ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0); + + /* Make sure poison works. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + bool result = try_read_write_buf(curr); + + if (i < 5) { + ASSERT_FALSE(result); + } else { + ASSERT_TRUE(result); + ASSERT_EQ(*curr, 'x'); + } + } + + /* + * Now lock the latter part of the range. We can't lock the poisoned + * pages, as this would result in the pages being populated and the + * poisoning would cause this to error out. + */ + ASSERT_EQ(mlock(&ptr[5 * page_size], 5 * page_size), 0); + + /* + * Now unpoison, we do not permit mlock()'d ranges to be remedied as it is + * a non-destructive operation. + */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0); + + /* Now check that everything is remedied. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_TRUE(try_read_write_buf(curr)); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* + * Assert that moving, extending and shrinking memory via mremap() retains + * poison markers where possible. + * + * - Moving a mapping alone should retain markers as they are. + */ +TEST_F(guard_pages, mremap_move) +{ + const unsigned long page_size = self->page_size; + char *ptr, *ptr_new; + + /* Map 5 pages. */ + ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Place poison markers at both ends of the 5 page span. */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0); + ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0); + + /* Make sure the poison is in effect. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size])); + + /* Map a new region we will move this range into. Doing this ensures + * that we have reserved a range to map into. + */ + ptr_new = mmap(NULL, 5 * page_size, PROT_NONE, MAP_ANON | MAP_PRIVATE, + -1, 0); + ASSERT_NE(ptr_new, MAP_FAILED); + + ASSERT_EQ(mremap(ptr, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new), ptr_new); + + /* Make sure the poison is retained. */ + ASSERT_FALSE(try_read_write_buf(ptr_new)); + ASSERT_FALSE(try_read_write_buf(&ptr_new[4 * page_size])); + + /* + * Clean up - we only need reference the new pointer as we overwrote the + * PROT_NONE range and moved the existing one. + */ + munmap(ptr_new, 5 * page_size); +} + +/* + * Assert that moving, extending and shrinking memory via mremap() retains + * poison markers where possible. + * + * - Expanding should retain, only now in different position. The user will have + * to unpoison manually to fix up (they'd have to do the same if it were a + * PROT_NONE mapping) + */ +TEST_F(guard_pages, mremap_expand) +{ + const unsigned long page_size = self->page_size; + char *ptr, *ptr_new; + + /* Map 10 pages... */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + /* ...But unmap the last 5 so we can ensure we can expand into them. */ + ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0); + + /* Place poison markers at both ends of the 5 page span. */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0); + ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0); + + /* Make sure the poison is in effect. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size])); + + /* Now expand to 10 pages. */ + ptr = mremap(ptr, 5 * page_size, 10 * page_size, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Make sure the poison is retained in its original positions. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size])); + + /* Reserve a region which we can move to and expand into. */ + ptr_new = mmap(NULL, 20 * page_size, PROT_NONE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_new, MAP_FAILED); + + /* Now move and expand into it. */ + ptr = mremap(ptr, 10 * page_size, 20 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new); + ASSERT_EQ(ptr, ptr_new); + + /* Again, make sure the poison is retained in its original positions. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size])); + + /* + * A real user would have to unpoison, but would reasonably expect all + * characteristics of the mapping to be retained, including poison + * markers. + */ + + /* Cleanup. */ + munmap(ptr, 20 * page_size); +} +/* + * Assert that moving, extending and shrinking memory via mremap() retains + * poison markers where possible. + * + * - Shrinking will result in markers that are shrunk over being removed. Again, + * if the user were using a PROT_NONE mapping they'd have to manually fix this + * up also so this is OK. + */ +TEST_F(guard_pages, mremap_shrink) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + /* Map 5 pages. */ + ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Place poison markers at both ends of the 5 page span. */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0); + ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0); + + /* Make sure the poison is in effect. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size])); + + /* Now shrink to 3 pages. */ + ptr = mremap(ptr, 5 * page_size, 3 * page_size, MREMAP_MAYMOVE); + ASSERT_NE(ptr, MAP_FAILED); + + /* We expect the poison marker at the start to be retained... */ + ASSERT_FALSE(try_read_write_buf(ptr)); + + /* ...But remaining pages will not have poison markers. */ + for (i = 1; i < 3; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_TRUE(try_read_write_buf(curr)); + } + + /* + * As with expansion, a real user would have to unpoison and fixup. But + * you'd have to do similar manual things with PROT_NONE mappings too. + */ + + /* + * If we expand back to the original size, the end marker will, of + * course, no longer be present. + */ + ptr = mremap(ptr, 3 * page_size, 5 * page_size, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Again, we expect the poison marker at the start to be retained... */ + ASSERT_FALSE(try_read_write_buf(ptr)); + + /* ...But remaining pages will not have poison markers. */ + for (i = 1; i < 5; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_TRUE(try_read_write_buf(curr)); + } + + /* Cleanup. */ + munmap(ptr, 5 * page_size); +} + +/* + * Assert that forking a process with VMAs that do not have VM_WIPEONFORK set + * retain guard pages. + */ +TEST_F(guard_pages, fork) +{ + const unsigned long page_size = self->page_size; + char *ptr; + pid_t pid; + int i; + + /* Map 10 pages. */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison the first 5 pages. */ + ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0); + + pid = fork(); + ASSERT_NE(pid, -1); + if (!pid) { + /* This is the child process now. */ + + /* Assert that the poisoning is in effect. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + bool result = try_read_write_buf(curr); + + ASSERT_TRUE(i >= 5 ? result : !result); + } + + /* Now unpoison the range.*/ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0); + + exit(0); + } + + /* Parent process. */ + + /* Parent simply waits on child. */ + waitpid(pid, NULL, 0); + + /* Child unpoison does not impact parent page table state. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + bool result = try_read_write_buf(curr); + + ASSERT_TRUE(i >= 5 ? result : !result); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* + * Assert that forking a process with VMAs that do have VM_WIPEONFORK set + * behave as expected. + */ +TEST_F(guard_pages, fork_wipeonfork) +{ + const unsigned long page_size = self->page_size; + char *ptr; + pid_t pid; + int i; + + /* Map 10 pages. */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Mark wipe on fork. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_WIPEONFORK), 0); + + /* Poison the first 5 pages. */ + ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0); + + pid = fork(); + ASSERT_NE(pid, -1); + if (!pid) { + /* This is the child process now. */ + + /* Poison will have been wiped. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_TRUE(try_read_write_buf(curr)); + } + + exit(0); + } + + /* Parent process. */ + + waitpid(pid, NULL, 0); + + /* Poison should be in effect.*/ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + bool result = try_read_write_buf(curr); + + ASSERT_TRUE(i >= 5 ? result : !result); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Ensure that MADV_FREE frees poison entries as expected. */ +TEST_F(guard_pages, lazyfree) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + /* Map 10 pages. */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0); + + /* Ensure poisoned. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + + /* Lazyfree range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_FREE), 0); + + /* This should simply clear the poison markers. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_TRUE(try_read_write_buf(curr)); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Ensure that MADV_POPULATE_READ, MADV_POPULATE_WRITE behave as expected. */ +TEST_F(guard_pages, populate) +{ + const unsigned long page_size = self->page_size; + char *ptr; + + /* Map 10 pages. */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0); + + /* Populate read should error out... */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_READ), -1); + ASSERT_EQ(errno, EFAULT); + + /* ...as should populate write. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_WRITE), -1); + ASSERT_EQ(errno, EFAULT); + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Ensure that MADV_COLD, MADV_PAGEOUT do not remove poison markers. */ +TEST_F(guard_pages, cold_pageout) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + /* Map 10 pages. */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0); + + /* Ensured poisoned. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + + /* Now mark cold. This should have no impact on poison markers. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_COLD), 0); + + /* Should remain poisoned. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + + /* OK, now page out. This should equally, have no effect on markers. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_PAGEOUT), 0); + + /* Should remain poisoned. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Ensure that guard pages do not break userfaultd. */ +TEST_F(guard_pages, uffd) +{ + const unsigned long page_size = self->page_size; + int uffd; + char *ptr; + int i; + struct uffdio_api api = { + .api = UFFD_API, + .features = 0, + }; + struct uffdio_register reg; + struct uffdio_range range; + + /* Set up uffd. */ + uffd = userfaultfd(0); + if (uffd == -1 && errno == EPERM) + ksft_exit_skip("No userfaultfd permissions, try running as root.\n"); + ASSERT_NE(uffd, -1); + + ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0); + + /* Map 10 pages. */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Register the range with uffd. */ + range.start = (unsigned long)ptr; + range.len = 10 * page_size; + reg.range = range; + reg.mode = UFFDIO_REGISTER_MODE_MISSING; + ASSERT_EQ(ioctl(uffd, UFFDIO_REGISTER, ®), 0); + + /* Poison the range. This should not trigger the uffd. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0); + + /* The poisoning should behave as usual with no uffd intervention. */ + for (i = 0; i < 10; i++) { + char *curr = &ptr[i * page_size]; + + ASSERT_FALSE(try_read_write_buf(curr)); + } + + /* Cleanup. */ + ASSERT_EQ(ioctl(uffd, UFFDIO_UNREGISTER, &range), 0); + close(uffd); + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +TEST_HARNESS_MAIN
On 10/20/24 10:20, Lorenzo Stoakes wrote:
Utilise the kselftest harmness to implement tests for the guard page implementation.
We start by implement basic tests asserting that guard pages can be established (poisoned), cleared (remedied) and that touching poisoned pages result in SIGSEGV. We also assert that, in remedying a range, non-poison pages remain intact.
We then examine different operations on regions containing poison markers behave to ensure correct behaviour:
- Operations over multiple VMAs operate as expected.
- Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in batches works correctly.
- Ensuring that munmap() correctly tears down poison markers.
- Using mprotect() to adjust protection bits does not in any way override or cause issues with poison markers.
- Ensuring that splitting and merging VMAs around poison markers causes no issue - i.e. that a marker which 'belongs' to one VMA can function just as well 'belonging' to another.
- Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
- Ensuring that mlock()'ing a range containing poison markers does not cause issues.
- Ensuring that mremap() can move a poisoned range and retain poison markers.
- Ensuring that mremap() can expand a poisoned range and retain poison markers (perhaps moving the range).
- Ensuring that mremap() can shrink a poisoned range and retain poison markers.
- Ensuring that forking a process correctly retains poison markers.
- Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
- Ensuring that lazyfree simply clears poison markers.
- Ensuring that userfaultfd can co-exist with guard pages.
- Ensuring that madvise(..., MADV_POPULATE_READ) and madvise(..., MADV_POPULATE_WRITE) error out when encountering poison markers.
- Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do not remove poison markers.
If any test is unable to be run due to lack of permissions, that test is skipped.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1228 ++++++++++++++++++++++ 3 files changed, 1230 insertions(+) create mode 100644 tools/testing/selftests/mm/guard-pages.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 689bbd520296..8f01f4da1c0d 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -54,3 +54,4 @@ droppable hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 +guard-pages diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 02e1204971b0..15c734d6cfec 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable +TEST_GEN_FILES += guard-pages ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c new file mode 100644 index 000000000000..f67d2700d44a --- /dev/null +++ b/tools/testing/selftests/mm/guard-pages.c @@ -0,0 +1,1228 @@ +// SPDX-License-Identifier: GPL-2.0-or-later
+#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <asm-generic/mman.h> /* Force the import of the tools version. */ +#include <assert.h> +#include <errno.h> +#include <fcntl.h> +#include <linux/userfaultfd.h> +#include <setjmp.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <sys/syscall.h> +#include <sys/uio.h> +#include <unistd.h>
+/*
- Ignore the checkpatch warning, as per the C99 standard, section 7.14.1.1:
- "If the signal occurs other than as the result of calling the abort or raise
- function, the behavior is undefined if the signal handler refers to any
- object with static storage duration other than by assigning a value to an
- object declared as volatile sig_atomic_t"
- */
+static volatile sig_atomic_t signal_jump_set; +static sigjmp_buf signal_jmp_buf;
+/*
- Ignore the checkpatch warning, we must read from x but don't want to do
- anything with it in order to trigger a read page fault. We therefore must use
- volatile to stop the compiler from optimising this away.
- */
+#define FORCE_READ(x) (*(volatile typeof(x) *)x)
Thank you.
Reviewed-by: Shuah Khan skhan@linuxfoundation.org
thanks, -- Shuah
On Mon, Oct 21, 2024 at 03:31:59PM -0600, Shuah Khan wrote:
On 10/20/24 10:20, Lorenzo Stoakes wrote:
Utilise the kselftest harmness to implement tests for the guard page implementation.
We start by implement basic tests asserting that guard pages can be established (poisoned), cleared (remedied) and that touching poisoned pages result in SIGSEGV. We also assert that, in remedying a range, non-poison pages remain intact.
We then examine different operations on regions containing poison markers behave to ensure correct behaviour:
- Operations over multiple VMAs operate as expected.
- Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in batches works correctly.
- Ensuring that munmap() correctly tears down poison markers.
- Using mprotect() to adjust protection bits does not in any way override or cause issues with poison markers.
- Ensuring that splitting and merging VMAs around poison markers causes no issue - i.e. that a marker which 'belongs' to one VMA can function just as well 'belonging' to another.
- Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
- Ensuring that mlock()'ing a range containing poison markers does not cause issues.
- Ensuring that mremap() can move a poisoned range and retain poison markers.
- Ensuring that mremap() can expand a poisoned range and retain poison markers (perhaps moving the range).
- Ensuring that mremap() can shrink a poisoned range and retain poison markers.
- Ensuring that forking a process correctly retains poison markers.
- Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
- Ensuring that lazyfree simply clears poison markers.
- Ensuring that userfaultfd can co-exist with guard pages.
- Ensuring that madvise(..., MADV_POPULATE_READ) and madvise(..., MADV_POPULATE_WRITE) error out when encountering poison markers.
- Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do not remove poison markers.
If any test is unable to be run due to lack of permissions, that test is skipped.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1228 ++++++++++++++++++++++ 3 files changed, 1230 insertions(+) create mode 100644 tools/testing/selftests/mm/guard-pages.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 689bbd520296..8f01f4da1c0d 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -54,3 +54,4 @@ droppable hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 +guard-pages diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 02e1204971b0..15c734d6cfec 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable +TEST_GEN_FILES += guard-pages ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c new file mode 100644 index 000000000000..f67d2700d44a --- /dev/null +++ b/tools/testing/selftests/mm/guard-pages.c @@ -0,0 +1,1228 @@ +// SPDX-License-Identifier: GPL-2.0-or-later
+#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <asm-generic/mman.h> /* Force the import of the tools version. */ +#include <assert.h> +#include <errno.h> +#include <fcntl.h> +#include <linux/userfaultfd.h> +#include <setjmp.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <sys/syscall.h> +#include <sys/uio.h> +#include <unistd.h>
+/*
- Ignore the checkpatch warning, as per the C99 standard, section 7.14.1.1:
- "If the signal occurs other than as the result of calling the abort or raise
- function, the behavior is undefined if the signal handler refers to any
- object with static storage duration other than by assigning a value to an
- object declared as volatile sig_atomic_t"
- */
+static volatile sig_atomic_t signal_jump_set; +static sigjmp_buf signal_jmp_buf;
+/*
- Ignore the checkpatch warning, we must read from x but don't want to do
- anything with it in order to trigger a read page fault. We therefore must use
- volatile to stop the compiler from optimising this away.
- */
+#define FORCE_READ(x) (*(volatile typeof(x) *)x)
Thank you.
Reviewed-by: Shuah Khan skhan@linuxfoundation.org
Thanks! :)
thanks, -- Shuah
* Lorenzo Stoakes:
Early testing of the prototype version of this code suggests a 5 times speed up in memory mapping invocations (in conjunction with use of process_madvise()) and a 13% reduction in VMAs on an entirely idle android system and unoptimised code.
We expect with optimisation and a loaded system with a larger number of guard pages this could significantly increase, but in any case these numbers are encouraging.
This way, rather than having separate VMAs specifying which parts of a range are guard pages, instead we have a VMA spanning the entire range of memory a user is permitted to access and including ranges which are to be 'guarded'.
After mapping this, a user can specify which parts of the range should result in a fatal signal when accessed.
By restricting the ability to specify guard pages to memory mapped by existing VMAs, we can rely on the mappings being torn down when the mappings are ultimately unmapped and everything works simply as if the memory were not faulted in, from the point of view of the containing VMAs.
We have a glibc (so not Android) dynamic linker bug that asks us to remove PROT_NONE mappings in mapped shared objects:
Extra struct vm_area_struct with ---p created when PAGE_SIZE < max-page-size https://sourceware.org/bugzilla/show_bug.cgi?id=31076
It's slightly different from a guard page because our main goal is to avoid other mappings to end up in those gaps, which has been shown to cause odd application behavior in cases where it happens. If I understand the series correctly, the kernel would not automatically attribute those PROT_NONE gaps to the previous or subsequent mapping. We would have to extend one of the surrounding mapps and apply MADV_POISON to that over-mapped part. That doesn't seem too onerous.
Could the ELF loader in the kernel do the same thing for the main executable and the program loader?
On Sun, Oct 20, 2024 at 07:37:54PM +0200, Florian Weimer wrote:
- Lorenzo Stoakes:
Early testing of the prototype version of this code suggests a 5 times speed up in memory mapping invocations (in conjunction with use of process_madvise()) and a 13% reduction in VMAs on an entirely idle android system and unoptimised code.
We expect with optimisation and a loaded system with a larger number of guard pages this could significantly increase, but in any case these numbers are encouraging.
This way, rather than having separate VMAs specifying which parts of a range are guard pages, instead we have a VMA spanning the entire range of memory a user is permitted to access and including ranges which are to be 'guarded'.
After mapping this, a user can specify which parts of the range should result in a fatal signal when accessed.
By restricting the ability to specify guard pages to memory mapped by existing VMAs, we can rely on the mappings being torn down when the mappings are ultimately unmapped and everything works simply as if the memory were not faulted in, from the point of view of the containing VMAs.
We have a glibc (so not Android) dynamic linker bug that asks us to remove PROT_NONE mappings in mapped shared objects:
Extra struct vm_area_struct with ---p created when PAGE_SIZE < max-page-size https://sourceware.org/bugzilla/show_bug.cgi?id=31076
It's slightly different from a guard page because our main goal is to avoid other mappings to end up in those gaps, which has been shown to cause odd application behavior in cases where it happens. If I understand the series correctly, the kernel would not automatically attribute those PROT_NONE gaps to the previous or subsequent mapping. We would have to extend one of the surrounding mapps and apply MADV_POISON to that over-mapped part. That doesn't seem too onerous.
Could the ELF loader in the kernel do the same thing for the main executable and the program loader?
Currently this implementation is only available for private anonymous memory. We may look at extending it to shmem and file-backed memory in the future but there are a whole host of things to overcome to make that work so it's one step at a time! :)
Hi Florian, Lorenzo,
This looks great!
What I am VERY interested in is if poisoned pages cause SIGSEGV even when the access happens in the kernel. Namely, the syscall still returns EFAULT, but also SIGSEGV is queued on return to user-space.
Catching bad accesses in system calls is currently the weak spot for all user-space bug detection tools (GWP-ASan, libefence, libefency, etc). It's almost possible with userfaultfd, but catching faults in the kernel requires admin capability, so not really an option for generic bug detection tools (+inconvinience of userfaultfd setup/handler). Intercepting all EFAULT from syscalls is not generally possible (w/o ptrace, usually not an option as well), and EFAULT does not always mean a bug.
Triggering SIGSEGV even in syscalls would be not just a performance optimization, but a new useful capability that would allow it to catch more bugs.
Thanks
On 23.10.24 08:24, Dmitry Vyukov wrote:
Hi Florian, Lorenzo,
This looks great!
What I am VERY interested in is if poisoned pages cause SIGSEGV even when the access happens in the kernel. Namely, the syscall still returns EFAULT, but also SIGSEGV is queued on return to user-space.
Catching bad accesses in system calls is currently the weak spot for all user-space bug detection tools (GWP-ASan, libefence, libefency, etc). It's almost possible with userfaultfd, but catching faults in the kernel requires admin capability, so not really an option for generic bug detection tools (+inconvinience of userfaultfd setup/handler). Intercepting all EFAULT from syscalls is not generally possible (w/o ptrace, usually not an option as well), and EFAULT does not always mean a bug.
Triggering SIGSEGV even in syscalls would be not just a performance optimization, but a new useful capability that would allow it to catch more bugs.
Right, we discussed that offline also as a possible extension to the userfaultfd SIGBUS mode.
I did not look into that yet, but I was wonder if there could be cases where a different process could trigger that SIGSEGV, and how to (and if to) handle that.
For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. I think with userfaultfd() we will currently return -EFAULT, because we call get_user_page_vma_remote() that is not prepared for dropping the mmap lock. Possibly that is the right thing to do, but not sure :)
These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might be able to distinguish them and perform different handling.
+cc Linus as reference a commit of his below...
On Wed, Oct 23, 2024 at 09:19:03AM +0200, David Hildenbrand wrote:
On 23.10.24 08:24, Dmitry Vyukov wrote:
Hi Florian, Lorenzo,
This looks great!
Thanks!
What I am VERY interested in is if poisoned pages cause SIGSEGV even when the access happens in the kernel. Namely, the syscall still returns EFAULT, but also SIGSEGV is queued on return to user-space.
Yeah we don't in any way.
I think adding something like this would be a bit of its own project.
The fault andler for this is in handle_pte_marker() in mm/memory.c, where we do the following:
/* Hitting a guard page is always a fatal condition. */ if (marker & PTE_MARKER_GUARD) return VM_FAULT_SIGSEGV;
So basically we pass this back to whoever invoked the fault. For uaccess we end up in arch-specific code that eventually checks exception tables etc. and for x86-64 that's kernelmode_fixup_or_oops().
There used to be a sig_on_uaccess_err in the x86-specific thread_struct that let you propagate it but Linus pulled it out in commit 02b670c1f88e ("x86/mm: Remove broken vsyscall emulation code from the page fault code") where it was presumably used for vsyscall.
Of course we could just get something much higher up the stack to send the signal, but we'd need to be careful we weren't breaking anything doing it...
I address GUP below.
Catching bad accesses in system calls is currently the weak spot for all user-space bug detection tools (GWP-ASan, libefence, libefency, etc). It's almost possible with userfaultfd, but catching faults in the kernel requires admin capability, so not really an option for generic bug detection tools (+inconvinience of userfaultfd setup/handler). Intercepting all EFAULT from syscalls is not generally possible (w/o ptrace, usually not an option as well), and EFAULT does not always mean a bug.
Triggering SIGSEGV even in syscalls would be not just a performance optimization, but a new useful capability that would allow it to catch more bugs.
Right, we discussed that offline also as a possible extension to the userfaultfd SIGBUS mode.
I did not look into that yet, but I was wonder if there could be cases where a different process could trigger that SIGSEGV, and how to (and if to) handle that.
For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. I think with userfaultfd() we will currently return -EFAULT, because we call get_user_page_vma_remote() that is not prepared for dropping the mmap lock. Possibly that is the right thing to do, but not sure :)
These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might be able to distinguish them and perform different handling.
So all GUP will return -EFAULT when hitting guard pages unless we change something.
In GUP we handle this in faultin_page():
if (ret & VM_FAULT_ERROR) { int err = vm_fault_to_errno(ret, flags);
if (err) return err; BUG(); }
And vm_fault_to_errno() is:
static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) { if (vm_fault & VM_FAULT_OOM) return -ENOMEM; if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT; if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV)) return -EFAULT; return 0; }
Again, I think if we wanted special handling here we'd need to probably propagate that fault from higher up, but yes we'd need to for one definitely not do so if it's remote but I worry about other cases.
-- Cheers,
David / dhildenb
Overall while I sympathise with this, it feels dangerous and a pretty major change, because there'll be something somewhere that will break because it expects faults to be swallowed that we no longer do swallow.
So I'd say it'd be something we should defer, but of course it's a highly user-facing change so how easy that would be I don't know.
But I definitely don't think a 'introduce the ability to do cheap PROT_NONE guards' series is the place to also fundmentally change how user access page faults are handled within the kernel :)
On Wed, 23 Oct 2024 at 10:12, Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
+cc Linus as reference a commit of his below...
On Wed, Oct 23, 2024 at 09:19:03AM +0200, David Hildenbrand wrote:
On 23.10.24 08:24, Dmitry Vyukov wrote:
Hi Florian, Lorenzo,
This looks great!
Thanks!
What I am VERY interested in is if poisoned pages cause SIGSEGV even when the access happens in the kernel. Namely, the syscall still returns EFAULT, but also SIGSEGV is queued on return to user-space.
Yeah we don't in any way.
I think adding something like this would be a bit of its own project.
I can totally understand this.
The fault andler for this is in handle_pte_marker() in mm/memory.c, where we do the following:
/* Hitting a guard page is always a fatal condition. */ if (marker & PTE_MARKER_GUARD) return VM_FAULT_SIGSEGV;
So basically we pass this back to whoever invoked the fault. For uaccess we end up in arch-specific code that eventually checks exception tables etc. and for x86-64 that's kernelmode_fixup_or_oops().
There used to be a sig_on_uaccess_err in the x86-specific thread_struct that let you propagate it but Linus pulled it out in commit 02b670c1f88e ("x86/mm: Remove broken vsyscall emulation code from the page fault code") where it was presumably used for vsyscall.
Of course we could just get something much higher up the stack to send the signal, but we'd need to be careful we weren't breaking anything doing it...
Can setting TIF_NOTIFY_RESUME and then doing the rest when returning to userspace help here?
I address GUP below.
Catching bad accesses in system calls is currently the weak spot for all user-space bug detection tools (GWP-ASan, libefence, libefency, etc). It's almost possible with userfaultfd, but catching faults in the kernel requires admin capability, so not really an option for generic bug detection tools (+inconvinience of userfaultfd setup/handler). Intercepting all EFAULT from syscalls is not generally possible (w/o ptrace, usually not an option as well), and EFAULT does not always mean a bug.
Triggering SIGSEGV even in syscalls would be not just a performance optimization, but a new useful capability that would allow it to catch more bugs.
Right, we discussed that offline also as a possible extension to the userfaultfd SIGBUS mode.
I did not look into that yet, but I was wonder if there could be cases where a different process could trigger that SIGSEGV, and how to (and if to) handle that.
For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. I think with userfaultfd() we will currently return -EFAULT, because we call get_user_page_vma_remote() that is not prepared for dropping the mmap lock. Possibly that is the right thing to do, but not sure :)
That's a good corner case. I guess also process_vm_readv/writev. Not triggering the signal in these cases looks like the right thing to do.
These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might be able to distinguish them and perform different handling.
So all GUP will return -EFAULT when hitting guard pages unless we change something.
In GUP we handle this in faultin_page():
if (ret & VM_FAULT_ERROR) { int err = vm_fault_to_errno(ret, flags); if (err) return err; BUG(); }
And vm_fault_to_errno() is:
static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) { if (vm_fault & VM_FAULT_OOM) return -ENOMEM; if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT; if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV)) return -EFAULT; return 0; }
Again, I think if we wanted special handling here we'd need to probably propagate that fault from higher up, but yes we'd need to for one definitely not do so if it's remote but I worry about other cases.
-- Cheers,
David / dhildenb
Overall while I sympathise with this, it feels dangerous and a pretty major change, because there'll be something somewhere that will break because it expects faults to be swallowed that we no longer do swallow.
So I'd say it'd be something we should defer, but of course it's a highly user-facing change so how easy that would be I don't know.
But I definitely don't think a 'introduce the ability to do cheap PROT_NONE guards' series is the place to also fundmentally change how user access page faults are handled within the kernel :)
Will delivering signals on kernel access be a backwards compatible change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? It's just somewhat painful to detect/update all userspace if we add this feature in future. Can we say signal delivery on kernel accesses is unspecified?
On 10/23/24 10:56, Dmitry Vyukov wrote:
Overall while I sympathise with this, it feels dangerous and a pretty major change, because there'll be something somewhere that will break because it expects faults to be swallowed that we no longer do swallow.
So I'd say it'd be something we should defer, but of course it's a highly user-facing change so how easy that would be I don't know.
But I definitely don't think a 'introduce the ability to do cheap PROT_NONE guards' series is the place to also fundmentally change how user access page faults are handled within the kernel :)
Will delivering signals on kernel access be a backwards compatible change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? It's just somewhat painful to detect/update all userspace if we add this feature in future. Can we say signal delivery on kernel accesses is unspecified?
Would adding signal delivery to guard PTEs only help enough the ASAN etc usecase? Wouldn't it be instead possible to add some prctl to opt-in the whole ASANized process to deliver all existing segfaults as signals instead of -EFAULT ?
On 23.10.24 11:06, Vlastimil Babka wrote:
On 10/23/24 10:56, Dmitry Vyukov wrote:
Overall while I sympathise with this, it feels dangerous and a pretty major change, because there'll be something somewhere that will break because it expects faults to be swallowed that we no longer do swallow.
So I'd say it'd be something we should defer, but of course it's a highly user-facing change so how easy that would be I don't know.
But I definitely don't think a 'introduce the ability to do cheap PROT_NONE guards' series is the place to also fundmentally change how user access page faults are handled within the kernel :)
Will delivering signals on kernel access be a backwards compatible change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? It's just somewhat painful to detect/update all userspace if we add this feature in future. Can we say signal delivery on kernel accesses is unspecified?
Would adding signal delivery to guard PTEs only help enough the ASAN etc usecase? Wouldn't it be instead possible to add some prctl to opt-in the whole ASANized process to deliver all existing segfaults as signals instead of -EFAULT ?
Not sure if it is an "instead", you might have to deliver the signal in addition to letting the syscall fail (not that I would be an expert on signal delivery :D ).
prctl sounds better, or some way to configure the behavior on VMA ranges; otherwise we would need yet another marker, which is not the end of the world but would make it slightly more confusing.
On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote:
On 23.10.24 11:06, Vlastimil Babka wrote:
On 10/23/24 10:56, Dmitry Vyukov wrote:
Overall while I sympathise with this, it feels dangerous and a pretty major change, because there'll be something somewhere that will break because it expects faults to be swallowed that we no longer do swallow.
So I'd say it'd be something we should defer, but of course it's a highly user-facing change so how easy that would be I don't know.
But I definitely don't think a 'introduce the ability to do cheap PROT_NONE guards' series is the place to also fundmentally change how user access page faults are handled within the kernel :)
Will delivering signals on kernel access be a backwards compatible change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? It's just somewhat painful to detect/update all userspace if we add this feature in future. Can we say signal delivery on kernel accesses is unspecified?
Would adding signal delivery to guard PTEs only help enough the ASAN etc usecase? Wouldn't it be instead possible to add some prctl to opt-in the whole ASANized process to deliver all existing segfaults as signals instead of -EFAULT ?
Not sure if it is an "instead", you might have to deliver the signal in addition to letting the syscall fail (not that I would be an expert on signal delivery :D ).
prctl sounds better, or some way to configure the behavior on VMA ranges; otherwise we would need yet another marker, which is not the end of the world but would make it slightly more confusing.
Yeah prctl() sounds sensible, and since we are explicitly adding a marker for guard pages here we can do this as a follow up too without breaking any userland expectations, i.e. 'new feature to make guard pages signal' is not going to contradict the default behaviour.
So all makes sense to me, but I do think best as a follow up! :)
-- Cheers,
David / dhildenb
On 23.10.24 11:18, Lorenzo Stoakes wrote:
On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote:
On 23.10.24 11:06, Vlastimil Babka wrote:
On 10/23/24 10:56, Dmitry Vyukov wrote:
Overall while I sympathise with this, it feels dangerous and a pretty major change, because there'll be something somewhere that will break because it expects faults to be swallowed that we no longer do swallow.
So I'd say it'd be something we should defer, but of course it's a highly user-facing change so how easy that would be I don't know.
But I definitely don't think a 'introduce the ability to do cheap PROT_NONE guards' series is the place to also fundmentally change how user access page faults are handled within the kernel :)
Will delivering signals on kernel access be a backwards compatible change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? It's just somewhat painful to detect/update all userspace if we add this feature in future. Can we say signal delivery on kernel accesses is unspecified?
Would adding signal delivery to guard PTEs only help enough the ASAN etc usecase? Wouldn't it be instead possible to add some prctl to opt-in the whole ASANized process to deliver all existing segfaults as signals instead of -EFAULT ?
Not sure if it is an "instead", you might have to deliver the signal in addition to letting the syscall fail (not that I would be an expert on signal delivery :D ).
prctl sounds better, or some way to configure the behavior on VMA ranges; otherwise we would need yet another marker, which is not the end of the world but would make it slightly more confusing.
Yeah prctl() sounds sensible, and since we are explicitly adding a marker for guard pages here we can do this as a follow up too without breaking any userland expectations, i.e. 'new feature to make guard pages signal' is not going to contradict the default behaviour.
So all makes sense to me, but I do think best as a follow up! :)
Yeah, fully agreed. And my gut feeling is that it might not be that easy ... :)
In the end, what we want is *some* notification that a guard PTE was accessed. Likely the notification must not necessarily completely synchronous (although it would be ideal) and it must not be a signal.
Maybe having a different way to obtain that information from user space would work.
On Wed, 23 Oct 2024 at 11:29, David Hildenbrand david@redhat.com wrote:
On 23.10.24 11:18, Lorenzo Stoakes wrote:
On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote:
On 23.10.24 11:06, Vlastimil Babka wrote:
On 10/23/24 10:56, Dmitry Vyukov wrote:
Overall while I sympathise with this, it feels dangerous and a pretty major change, because there'll be something somewhere that will break because it expects faults to be swallowed that we no longer do swallow.
So I'd say it'd be something we should defer, but of course it's a highly user-facing change so how easy that would be I don't know.
But I definitely don't think a 'introduce the ability to do cheap PROT_NONE guards' series is the place to also fundmentally change how user access page faults are handled within the kernel :)
Will delivering signals on kernel access be a backwards compatible change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? It's just somewhat painful to detect/update all userspace if we add this feature in future. Can we say signal delivery on kernel accesses is unspecified?
Would adding signal delivery to guard PTEs only help enough the ASAN etc usecase? Wouldn't it be instead possible to add some prctl to opt-in the whole ASANized process to deliver all existing segfaults as signals instead of -EFAULT ?
Not sure if it is an "instead", you might have to deliver the signal in addition to letting the syscall fail (not that I would be an expert on signal delivery :D ).
prctl sounds better, or some way to configure the behavior on VMA ranges; otherwise we would need yet another marker, which is not the end of the world but would make it slightly more confusing.
Yeah prctl() sounds sensible, and since we are explicitly adding a marker for guard pages here we can do this as a follow up too without breaking any userland expectations, i.e. 'new feature to make guard pages signal' is not going to contradict the default behaviour.
So all makes sense to me, but I do think best as a follow up! :)
Yeah, fully agreed. And my gut feeling is that it might not be that easy ... :)
In the end, what we want is *some* notification that a guard PTE was accessed. Likely the notification must not necessarily completely synchronous (although it would be ideal) and it must not be a signal.
Maybe having a different way to obtain that information from user space would work.
For bug detection tools (like GWP-ASan [1]) it's essential to have useful stack traces. As such, having this signal be synchronous would be more useful. I don't see how one could get a useful stack trace (or other information like what's stashed away in ucontext like CPU registers) if this were asynchronous.
On 23.10.24 13:31, Marco Elver wrote:
On Wed, 23 Oct 2024 at 11:29, David Hildenbrand david@redhat.com wrote:
On 23.10.24 11:18, Lorenzo Stoakes wrote:
On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote:
On 23.10.24 11:06, Vlastimil Babka wrote:
On 10/23/24 10:56, Dmitry Vyukov wrote:
> > Overall while I sympathise with this, it feels dangerous and a pretty major > change, because there'll be something somewhere that will break because it > expects faults to be swallowed that we no longer do swallow. > > So I'd say it'd be something we should defer, but of course it's a highly > user-facing change so how easy that would be I don't know. > > But I definitely don't think a 'introduce the ability to do cheap PROT_NONE > guards' series is the place to also fundmentally change how user access > page faults are handled within the kernel :)
Will delivering signals on kernel access be a backwards compatible change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? It's just somewhat painful to detect/update all userspace if we add this feature in future. Can we say signal delivery on kernel accesses is unspecified?
Would adding signal delivery to guard PTEs only help enough the ASAN etc usecase? Wouldn't it be instead possible to add some prctl to opt-in the whole ASANized process to deliver all existing segfaults as signals instead of -EFAULT ?
Not sure if it is an "instead", you might have to deliver the signal in addition to letting the syscall fail (not that I would be an expert on signal delivery :D ).
prctl sounds better, or some way to configure the behavior on VMA ranges; otherwise we would need yet another marker, which is not the end of the world but would make it slightly more confusing.
Yeah prctl() sounds sensible, and since we are explicitly adding a marker for guard pages here we can do this as a follow up too without breaking any userland expectations, i.e. 'new feature to make guard pages signal' is not going to contradict the default behaviour.
So all makes sense to me, but I do think best as a follow up! :)
Yeah, fully agreed. And my gut feeling is that it might not be that easy ... :)
In the end, what we want is *some* notification that a guard PTE was accessed. Likely the notification must not necessarily completely synchronous (although it would be ideal) and it must not be a signal.
Maybe having a different way to obtain that information from user space would work.
For bug detection tools (like GWP-ASan [1]) it's essential to have useful stack traces. As such, having this signal be synchronous would be more useful. I don't see how one could get a useful stack trace (or other information like what's stashed away in ucontext like CPU registers) if this were asynchronous.
Yes, I know. But it would be better than not getting *any* notification except of some syscalls simply failing with -EFAULT, and not having an idea which address was even accessed.
Maybe the signal injection is easier than I think, but I somehow doubt it ...
On Wed, Oct 23, 2024 at 01:36:10PM +0200, David Hildenbrand wrote:
On 23.10.24 13:31, Marco Elver wrote:
On Wed, 23 Oct 2024 at 11:29, David Hildenbrand david@redhat.com wrote:
On 23.10.24 11:18, Lorenzo Stoakes wrote:
On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote:
On 23.10.24 11:06, Vlastimil Babka wrote:
On 10/23/24 10:56, Dmitry Vyukov wrote: > > > > Overall while I sympathise with this, it feels dangerous and a pretty major > > change, because there'll be something somewhere that will break because it > > expects faults to be swallowed that we no longer do swallow. > > > > So I'd say it'd be something we should defer, but of course it's a highly > > user-facing change so how easy that would be I don't know. > > > > But I definitely don't think a 'introduce the ability to do cheap PROT_NONE > > guards' series is the place to also fundmentally change how user access > > page faults are handled within the kernel :) > > Will delivering signals on kernel access be a backwards compatible > change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? > It's just somewhat painful to detect/update all userspace if we add > this feature in future. Can we say signal delivery on kernel accesses > is unspecified?
Would adding signal delivery to guard PTEs only help enough the ASAN etc usecase? Wouldn't it be instead possible to add some prctl to opt-in the whole ASANized process to deliver all existing segfaults as signals instead of -EFAULT ?
Not sure if it is an "instead", you might have to deliver the signal in addition to letting the syscall fail (not that I would be an expert on signal delivery :D ).
prctl sounds better, or some way to configure the behavior on VMA ranges; otherwise we would need yet another marker, which is not the end of the world but would make it slightly more confusing.
Yeah prctl() sounds sensible, and since we are explicitly adding a marker for guard pages here we can do this as a follow up too without breaking any userland expectations, i.e. 'new feature to make guard pages signal' is not going to contradict the default behaviour.
So all makes sense to me, but I do think best as a follow up! :)
Yeah, fully agreed. And my gut feeling is that it might not be that easy ... :)
In the end, what we want is *some* notification that a guard PTE was accessed. Likely the notification must not necessarily completely synchronous (although it would be ideal) and it must not be a signal.
Maybe having a different way to obtain that information from user space would work.
For bug detection tools (like GWP-ASan [1]) it's essential to have useful stack traces. As such, having this signal be synchronous would be more useful. I don't see how one could get a useful stack trace (or other information like what's stashed away in ucontext like CPU registers) if this were asynchronous.
Yes, I know. But it would be better than not getting *any* notification except of some syscalls simply failing with -EFAULT, and not having an idea which address was even accessed.
Maybe the signal injection is easier than I think, but I somehow doubt it ...
Yeah I'm afraid I don't think this series is a place where I can fundamentally change how something so sensitive works in the kernel.
It's espeically super sensitive because this is a uAPI change and a wrong decision here could result in guard pages being broken out the gate and I really don't want to risk that.
-- Cheers,
David / dhildenb
On Wed, 23 Oct 2024 at 11:06, Vlastimil Babka vbabka@suse.cz wrote:
On 10/23/24 10:56, Dmitry Vyukov wrote:
Overall while I sympathise with this, it feels dangerous and a pretty major change, because there'll be something somewhere that will break because it expects faults to be swallowed that we no longer do swallow.
So I'd say it'd be something we should defer, but of course it's a highly user-facing change so how easy that would be I don't know.
But I definitely don't think a 'introduce the ability to do cheap PROT_NONE guards' series is the place to also fundmentally change how user access page faults are handled within the kernel :)
Will delivering signals on kernel access be a backwards compatible change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? It's just somewhat painful to detect/update all userspace if we add this feature in future. Can we say signal delivery on kernel accesses is unspecified?
Would adding signal delivery to guard PTEs only help enough the ASAN etc usecase? Wouldn't it be instead possible to add some prctl to opt-in the whole ASANized process to deliver all existing segfaults as signals instead of -EFAULT ?
ASAN per se does not need this (it does not use page protection). However, if you mean bug detection tools in general, then, yes, that's what I had in mind. There are also things like stack guard pages in libc that would benefit from that as well.
But I observed that some libraries intentionally use EFAULT to probe for memory readability, i.e. use some cheap syscall to probe memory before reading it. So changing behavior globally may not work.
linux-kselftest-mirror@lists.linaro.org