Userland library functions such as allocators and threading implementations often require regions of memory to act as 'guard pages' - mappings which, when accessed, result in a fatal signal being sent to the accessing process.
The current means by which these are implemented is via a PROT_NONE mmap() mapping, which provides the required semantics however incur an overhead of a VMA for each such region.
With a great many processes and threads, this can rapidly add up and incur a significant memory penalty. It also has the added problem of preventing merges that might otherwise be permitted.
This series takes a different approach - an idea suggested by Vlasimil Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the provenance becomes a little tricky to ascertain after this - please forgive any omissions!) - rather than locating the guard pages at the VMA layer, instead placing them in page tables mapping the required ranges.
Early testing of the prototype version of this code suggests a 5 times speed up in memory mapping invocations (in conjunction with use of process_madvise()) and a 13% reduction in VMAs on an entirely idle android system and unoptimised code.
We expect with optimisation and a loaded system with a larger number of guard pages this could significantly increase, but in any case these numbers are encouraging.
This way, rather than having separate VMAs specifying which parts of a range are guard pages, instead we have a VMA spanning the entire range of memory a user is permitted to access and including ranges which are to be 'guarded'.
After mapping this, a user can specify which parts of the range should result in a fatal signal when accessed.
By restricting the ability to specify guard pages to memory mapped by existing VMAs, we can rely on the mappings being torn down when the mappings are ultimately unmapped and everything works simply as if the memory were not faulted in, from the point of view of the containing VMAs.
This mechanism in effect poisons memory ranges similar to hardware memory poisoning, only it is an entirely software-controlled form of poisoning.
Any poisoned region of memory is also able to 'unpoisoned', that is, to have its poison markers removed.
The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this poisoning.
Poisoning can be performed across multiple VMAs and any existing mappings will be cleared, that is zapped, before installing the poisoned page table mappings.
There is no concept of 'nested' poisoning, multiple attempts to poison a range will, after the first poisoning, have no effect.
Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned memory, so a user can safely unpoison a range of memory and clear only poison page table mappings leaving the rest intact.
The actual mechanism by which the page table entries are specified makes use of existing logic - PTE markers, which are used for the userfaultfd UFFDIO_POISON mechanism.
Unfortunately PTE_MARKER_POISONED is not suited for the guard page mechanism as it results in VM_FAULT_HWPOISON semantics in the fault handler, so we add our own specific PTE_MARKER_GUARD and adapt existing logic to handle it.
We also extend the generic page walk mechanism to allow for installation of PTEs (carefully restricted to memory management logic only to prevent unwanted abuse).
We ensure that zapping performed by, for instance, MADV_DONTNEED, does not remove guard poison markers, nor does forking (except when VM_WIPEONFORK is specified for a VMA which implies a total removal of memory characteristics).
It's important to note that the guard page implementation is emphatically NOT a security feature, so a user can remove the poisoning if they wish. We simply implement it in such a way as to provide the least surprising behaviour.
An extensive set of self-tests are provided which ensure behaviour is as expected and additionally self-documents expected behaviour of poisoned ranges.
Suggested-by: Vlastimil Babka vbabka@suze.cz Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com
v1 * Un-RFC'd as appears no major objections to approach but rather debate on implementation. * Fixed issue with arches which need mmu_context.h and tlbfush.h. header imports in pagewalker logic to be able to use update_mmu_cache() as reported by the kernel test bot. * Added comments in page walker logic to clarify who can use ops->install_pte and why as well as adding a check_ops_valid() helper function, as suggested by Christoph. * Pass false in full parameter in pte_clear_not_present_full() as suggested by Jann. * Stopped erroneously requiring a write lock for the poison operation as suggested by Jann and Suren. * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be consistent with how this is used elsewhere in the kernel as suggested by Jann. * Avoid returning -EAGAIN if we are raced on page faults, just keep looping and duck out if a fatal signal is pending or a conditional reschedule is needed, as suggested by Jann. * Avoid needlessly splitting huge PUDs and PMDs by specifying ACTION_CONTINUE, as suggested by Jann.
RFC https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (4): mm: pagewalk: add the ability to install PTEs mm: add PTE_MARKER_GUARD PTE marker mm: madvise: implement lightweight guard page mechanism selftests/mm: add self tests for guard page feature
arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/linux/mm_inline.h | 2 +- include/linux/pagewalk.h | 18 +- include/linux/swapops.h | 26 +- include/uapi/asm-generic/mman-common.h | 3 + mm/hugetlb.c | 3 + mm/internal.h | 6 + mm/madvise.c | 168 ++++ mm/memory.c | 18 +- mm/mprotect.c | 3 +- mm/mseal.c | 1 + mm/pagewalk.c | 200 ++-- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++ 18 files changed, 1564 insertions(+), 66 deletions(-) create mode 100644 tools/testing/selftests/mm/guard-pages.c
-- 2.46.2
The existing generic pagewalk logic permits the walking of page tables, invoking callbacks at individual page table levels via user-provided mm_walk_ops callbacks.
This is useful for traversing existing page table entries, but precludes the ability to establish new ones.
Existing mechanism for performing a walk which also installs page table entries if necessary are heavily duplicated throughout the kernel, each with semantic differences from one another and largely unavailable for use elsewhere.
Rather than add yet another implementation, we extend the generic pagewalk logic to enable the installation of page table entries by adding a new install_pte() callback in mm_walk_ops. If this is specified, then upon encountering a missing page table entry, we allocate and install a new one and continue the traversal.
If a THP huge page is encountered, we make use of existing logic to split it. Then once we reach the PTE level, we invoke the install_pte() callback which provides a PTE entry to install. We do not support hugetlb at this stage.
If this function returns an error, or an allocation fails during the operation, we abort the operation altogether. It is up to the caller to deal appropriately with partially populated page table ranges.
If install_pte() is defined, the semantics of pte_entry() change - this callback is then only invoked if the entry already exists. This is a useful property, as it allows a caller to handle existing PTEs while installing new ones where necessary in the specified range.
If install_pte() is not defined, then there is no functional difference to this patch, so all existing logic will work precisely as it did before.
As we only permit the installation of PTEs where a mapping does not already exist there is no need for TLB management, however we do invoke update_mmu_cache() for architectures which require manual maintenance of mappings for other CPUs.
We explicitly do not allow the existing page walk API to expose this feature as it is dangerous and intended for internal mm use only. Therefore we provide a new walk_page_range_mm() function exposed only to mm/internal.h.
Reviewed-by: Jann Horn jannh@google.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- include/linux/pagewalk.h | 18 +++- mm/internal.h | 6 ++ mm/pagewalk.c | 200 ++++++++++++++++++++++++++++----------- 3 files changed, 165 insertions(+), 59 deletions(-)
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index f5eb5a32aeed..9700a29f8afb 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -25,12 +25,15 @@ enum page_walk_lock { * this handler is required to be able to handle * pmd_trans_huge() pmds. They may simply choose to * split_huge_page() instead of handling it explicitly. - * @pte_entry: if set, called for each PTE (lowest-level) entry, - * including empty ones + * @pte_entry: if set, called for each PTE (lowest-level) entry + * including empty ones, except if @install_pte is set. + * If @install_pte is set, @pte_entry is called only for + * existing PTEs. * @pte_hole: if set, called for each hole at all levels, * depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD. * Any folded depths (where PTRS_PER_P?D is equal to 1) - * are skipped. + * are skipped. If @install_pte is specified, this will + * not trigger for any populated ranges. * @hugetlb_entry: if set, called for each hugetlb entry. This hook * function is called with the vma lock held, in order to * protect against a concurrent freeing of the pte_t* or @@ -51,6 +54,13 @@ enum page_walk_lock { * @pre_vma: if set, called before starting walk on a non-null vma. * @post_vma: if set, called after a walk on a non-null vma, provided * that @pre_vma and the vma walk succeeded. + * @install_pte: if set, missing page table entries are installed and + * thus all levels are always walked in the specified + * range. This callback is then invoked at the PTE level + * (having split any THP pages prior), providing the PTE to + * install. If allocations fail, the walk is aborted. This + * operation is only available for userland memory. Not + * usable for hugetlb ranges. * * p?d_entry callbacks are called even if those levels are folded on a * particular architecture/configuration. @@ -76,6 +86,8 @@ struct mm_walk_ops { int (*pre_vma)(unsigned long start, unsigned long end, struct mm_walk *walk); void (*post_vma)(struct mm_walk *walk); + int (*install_pte)(unsigned long addr, unsigned long next, + pte_t *ptep, struct mm_walk *walk); enum page_walk_lock walk_lock; };
diff --git a/mm/internal.h b/mm/internal.h index 508f7802dd2b..fb1fb0c984e4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -12,6 +12,7 @@ #include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/pagemap.h> +#include <linux/pagewalk.h> #include <linux/rmap.h> #include <linux/swap.h> #include <linux/swapops.h> @@ -1451,4 +1452,9 @@ static inline void accept_page(struct page *page) } #endif /* CONFIG_UNACCEPTED_MEMORY */
+/* pagewalk.c */ +int walk_page_range_mm(struct mm_struct *mm, unsigned long start, + unsigned long end, const struct mm_walk_ops *ops, + void *private); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 5f9f01532e67..261cd5f2de38 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -3,9 +3,14 @@ #include <linux/highmem.h> #include <linux/sched.h> #include <linux/hugetlb.h> +#include <linux/mmu_context.h> #include <linux/swap.h> #include <linux/swapops.h>
+#include <asm/tlbflush.h> + +#include "internal.h" + /* * We want to know the real level where a entry is located ignoring any * folding of levels which may be happening. For example if p4d is folded then @@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr, int err = 0;
for (;;) { - err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk); - if (err) - break; + if (ops->install_pte && pte_none(ptep_get(pte))) { + pte_t new_pte; + + err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte, + walk); + if (err) + break; + + set_pte_at(walk->mm, addr, pte, new_pte); + /* Non-present before, so for arches that need it. */ + if (!WARN_ON_ONCE(walk->no_vma)) + update_mmu_cache(walk->vma, addr, pte); + } else { + err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk); + if (err) + break; + } if (addr >= end - PAGE_SIZE) break; addr += PAGE_SIZE; @@ -89,11 +108,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, again: next = pmd_addr_end(addr, end); if (pmd_none(*pmd)) { - if (ops->pte_hole) + if (ops->install_pte) + err = __pte_alloc(walk->mm, pmd); + else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break; - continue; + if (!ops->install_pte) + continue; }
walk->action = ACTION_SUBTREE; @@ -116,7 +138,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, */ if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) || walk->action == ACTION_CONTINUE || - !(ops->pte_entry)) + !(ops->pte_entry || ops->install_pte)) continue;
if (walk->vma) @@ -148,11 +170,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, again: next = pud_addr_end(addr, end); if (pud_none(*pud)) { - if (ops->pte_hole) + if (ops->install_pte) + err = __pmd_alloc(walk->mm, pud, addr); + else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break; - continue; + if (!ops->install_pte) + continue; }
walk->action = ACTION_SUBTREE; @@ -167,7 +192,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) || walk->action == ACTION_CONTINUE || - !(ops->pmd_entry || ops->pte_entry)) + !(ops->pmd_entry || ops->pte_entry || ops->install_pte)) continue;
if (walk->vma) @@ -196,18 +221,22 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, do { next = p4d_addr_end(addr, end); if (p4d_none_or_clear_bad(p4d)) { - if (ops->pte_hole) + if (ops->install_pte) + err = __pud_alloc(walk->mm, p4d, addr); + else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break; - continue; + if (!ops->install_pte) + continue; } if (ops->p4d_entry) { err = ops->p4d_entry(p4d, addr, next, walk); if (err) break; } - if (ops->pud_entry || ops->pmd_entry || ops->pte_entry) + if (ops->pud_entry || ops->pmd_entry || ops->pte_entry || + ops->install_pte) err = walk_pud_range(p4d, addr, next, walk); if (err) break; @@ -231,18 +260,22 @@ static int walk_pgd_range(unsigned long addr, unsigned long end, do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) { - if (ops->pte_hole) + if (ops->install_pte) + err = __p4d_alloc(walk->mm, pgd, addr); + else if (ops->pte_hole) err = ops->pte_hole(addr, next, 0, walk); if (err) break; - continue; + if (!ops->install_pte) + continue; } if (ops->pgd_entry) { err = ops->pgd_entry(pgd, addr, next, walk); if (err) break; } - if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry) + if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || + ops->pte_entry || ops->install_pte) err = walk_p4d_range(pgd, addr, next, walk); if (err) break; @@ -334,6 +367,11 @@ static int __walk_page_range(unsigned long start, unsigned long end, int err = 0; struct vm_area_struct *vma = walk->vma; const struct mm_walk_ops *ops = walk->ops; + bool is_hugetlb = is_vm_hugetlb_page(vma); + + /* We do not support hugetlb PTE installation. */ + if (ops->install_pte && is_hugetlb) + return -EINVAL;
if (ops->pre_vma) { err = ops->pre_vma(start, end, walk); @@ -341,7 +379,7 @@ static int __walk_page_range(unsigned long start, unsigned long end, return err; }
- if (is_vm_hugetlb_page(vma)) { + if (is_hugetlb) { if (ops->hugetlb_entry) err = walk_hugetlb_range(start, end, walk); } else @@ -380,47 +418,14 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma, #endif }
-/** - * walk_page_range - walk page table with caller specific callbacks - * @mm: mm_struct representing the target process of page table walk - * @start: start address of the virtual address range - * @end: end address of the virtual address range - * @ops: operation to call during the walk - * @private: private data for callbacks' usage - * - * Recursively walk the page table tree of the process represented by @mm - * within the virtual address range [@start, @end). During walking, we can do - * some caller-specific works for each entry, by setting up pmd_entry(), - * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these - * callbacks, the associated entries/pages are just ignored. - * The return values of these callbacks are commonly defined like below: - * - * - 0 : succeeded to handle the current entry, and if you don't reach the - * end address yet, continue to walk. - * - >0 : succeeded to handle the current entry, and return to the caller - * with caller specific value. - * - <0 : failed to handle the current entry, and return to the caller - * with error code. - * - * Before starting to walk page table, some callers want to check whether - * they really want to walk over the current vma, typically by checking - * its vm_flags. walk_page_test() and @ops->test_walk() are used for this - * purpose. - * - * If operations need to be staged before and committed after a vma is walked, - * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(), - * since it is intended to handle commit-type operations, can't return any - * errors. - * - * struct mm_walk keeps current values of some common data like vma and pmd, - * which are useful for the access from callbacks. If you want to pass some - * caller-specific data to callbacks, @private should be helpful. +/* + * See the comment for walk_page_range(), this performs the heavy lifting of the + * operation, only sets no restrictions on how the walk proceeds. * - * Locking: - * Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock, - * because these function traverse vma list and/or access to vma's data. + * We usually restrict the ability to install PTEs, but this functionality is + * available to internal memory management code and provided in mm/internal.h. */ -int walk_page_range(struct mm_struct *mm, unsigned long start, +int walk_page_range_mm(struct mm_struct *mm, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, void *private) { @@ -479,6 +484,80 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, return err; }
+/* + * Determine if the walk operations specified are permitted to be used for a + * page table walk. + * + * This check is performed on all functions which are parameterised by walk + * operations and exposed in include/linux/pagewalk.h. + * + * Internal memory management code can use the walk_page_range_mm() function to + * be able to use all page walking operations. + */ +static bool check_ops_valid(const struct mm_walk_ops *ops) +{ + /* + * The installation of PTEs is solely under the control of memory + * management logic and subject to many subtle locking, security and + * cache considerations so we cannot permit other users to do so, and + * certainly not for exported symbols. + */ + if (ops->install_pte) + return false; + + return true; +} + +/** + * walk_page_range - walk page table with caller specific callbacks + * @mm: mm_struct representing the target process of page table walk + * @start: start address of the virtual address range + * @end: end address of the virtual address range + * @ops: operation to call during the walk + * @private: private data for callbacks' usage + * + * Recursively walk the page table tree of the process represented by @mm + * within the virtual address range [@start, @end). During walking, we can do + * some caller-specific works for each entry, by setting up pmd_entry(), + * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these + * callbacks, the associated entries/pages are just ignored. + * The return values of these callbacks are commonly defined like below: + * + * - 0 : succeeded to handle the current entry, and if you don't reach the + * end address yet, continue to walk. + * - >0 : succeeded to handle the current entry, and return to the caller + * with caller specific value. + * - <0 : failed to handle the current entry, and return to the caller + * with error code. + * + * Before starting to walk page table, some callers want to check whether + * they really want to walk over the current vma, typically by checking + * its vm_flags. walk_page_test() and @ops->test_walk() are used for this + * purpose. + * + * If operations need to be staged before and committed after a vma is walked, + * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(), + * since it is intended to handle commit-type operations, can't return any + * errors. + * + * struct mm_walk keeps current values of some common data like vma and pmd, + * which are useful for the access from callbacks. If you want to pass some + * caller-specific data to callbacks, @private should be helpful. + * + * Locking: + * Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock, + * because these function traverse vma list and/or access to vma's data. + */ +int walk_page_range(struct mm_struct *mm, unsigned long start, + unsigned long end, const struct mm_walk_ops *ops, + void *private) +{ + if (!check_ops_valid(ops)) + return -EINVAL; + + return walk_page_range_mm(mm, start, end, ops, private); +} + /** * walk_page_range_novma - walk a range of pagetables not backed by a vma * @mm: mm_struct representing the target process of page table walk @@ -494,7 +573,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, * walking the kernel pages tables or page tables for firmware. * * Note: Be careful to walk the kernel pages tables, the caller may be need to - * take other effective approache (mmap lock may be insufficient) to prevent + * take other effective approaches (mmap lock may be insufficient) to prevent * the intermediate kernel page tables belonging to the specified address range * from being freed (e.g. memory hot-remove). */ @@ -513,6 +592,8 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
if (start >= end || !walk.mm) return -EINVAL; + if (!check_ops_valid(ops)) + return -EINVAL;
/* * 1) For walking the user virtual address space: @@ -556,6 +637,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, return -EINVAL; if (start < vma->vm_start || end > vma->vm_end) return -EINVAL; + if (!check_ops_valid(ops)) + return -EINVAL;
process_mm_walk_lock(walk.mm, ops->walk_lock); process_vma_walk_lock(vma, ops->walk_lock); @@ -574,6 +657,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
if (!walk.mm) return -EINVAL; + if (!check_ops_valid(ops)) + return -EINVAL;
process_mm_walk_lock(walk.mm, ops->walk_lock); process_vma_walk_lock(vma, ops->walk_lock); @@ -623,6 +708,9 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, unsigned long start_addr, end_addr; int err = 0;
+ if (!check_ops_valid(ops)) + return -EINVAL; + lockdep_assert_held(&mapping->i_mmap_rwsem); vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index, first_index + nr - 1) {
Add a new PTE marker that results in any access causing the accessing process to segfault.
This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range.
This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination.
Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect.
We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or the like.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- include/linux/mm_inline.h | 2 +- include/linux/swapops.h | 26 ++++++++++++++++++++++++-- mm/hugetlb.c | 3 +++ mm/memory.c | 18 +++++++++++++++--- 4 files changed, 43 insertions(+), 6 deletions(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 6f801c7b36e2..0d97a14bf051 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -531,7 +531,7 @@ static inline pte_marker copy_pte_marker( { pte_marker srcm = pte_marker_get(entry); /* Always copy error entries. */ - pte_marker dstm = srcm & PTE_MARKER_POISONED; + pte_marker dstm = srcm & (PTE_MARKER_POISONED | PTE_MARKER_GUARD);
/* Only copy PTE markers if UFFD register matches. */ if ((srcm & PTE_MARKER_UFFD_WP) && userfaultfd_wp(dst_vma)) diff --git a/include/linux/swapops.h b/include/linux/swapops.h index cb468e418ea1..4d0606df0791 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -426,9 +426,15 @@ typedef unsigned long pte_marker; * "Poisoned" here is meant in the very general sense of "future accesses are * invalid", instead of referring very specifically to hardware memory errors. * This marker is meant to represent any of various different causes of this. + * + * Note that, when encountered by the faulting logic, PTEs with this marker will + * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error + * logic. */ #define PTE_MARKER_POISONED BIT(1) -#define PTE_MARKER_MASK (BIT(2) - 1) +/* Indicates that, on fault, this PTE will case a SIGSEGV signal to be sent. */ +#define PTE_MARKER_GUARD BIT(2) +#define PTE_MARKER_MASK (BIT(3) - 1)
static inline swp_entry_t make_pte_marker_entry(pte_marker marker) { @@ -461,9 +467,25 @@ static inline swp_entry_t make_poisoned_swp_entry(void) }
static inline int is_poisoned_swp_entry(swp_entry_t entry) +{ + /* + * We treat guard pages as poisoned too as these have the same semantics + * as poisoned ranges, only with different fault handling. + */ + return is_pte_marker_entry(entry) && + (pte_marker_get(entry) & + (PTE_MARKER_POISONED | PTE_MARKER_GUARD)); +} + +static inline swp_entry_t make_guard_swp_entry(void) +{ + return make_pte_marker_entry(PTE_MARKER_GUARD); +} + +static inline int is_guard_swp_entry(swp_entry_t entry) { return is_pte_marker_entry(entry) && - (pte_marker_get(entry) & PTE_MARKER_POISONED); + (pte_marker_get(entry) & PTE_MARKER_GUARD); }
/* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 906294ac85dc..50e3f6ed73ac 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6353,6 +6353,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, ret = VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); goto out_mutex; + } else if (marker & PTE_MARKER_GUARD) { + ret = VM_FAULT_SIGSEGV; + goto out_mutex; } }
diff --git a/mm/memory.c b/mm/memory.c index 0f614523b9f4..551455cd453f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1455,7 +1455,7 @@ static inline bool should_zap_folio(struct zap_details *details, return !folio_test_anon(folio); }
-static inline bool zap_drop_file_uffd_wp(struct zap_details *details) +static inline bool zap_drop_markers(struct zap_details *details) { if (!details) return false; @@ -1476,7 +1476,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, if (vma_is_anonymous(vma)) return;
- if (zap_drop_file_uffd_wp(details)) + if (zap_drop_markers(details)) return;
for (;;) { @@ -1671,7 +1671,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, * drop the marker if explicitly requested. */ if (!vma_is_anonymous(vma) && - !zap_drop_file_uffd_wp(details)) + !zap_drop_markers(details)) + continue; + } else if (is_guard_swp_entry(entry)) { + /* + * Ordinary zapping should not remove guard PTE + * markers. Only do so if we should remove PTE markers + * in general. + */ + if (!zap_drop_markers(details)) continue; } else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { @@ -4003,6 +4011,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) if (marker & PTE_MARKER_POISONED) return VM_FAULT_HWPOISON;
+ /* Hitting a guard page is always a fatal condition. */ + if (marker & PTE_MARKER_GUARD) + return VM_FAULT_SIGSEGV; + if (pte_marker_entry_uffd_wp(entry)) return pte_marker_handle_uffd_wp(vmf);
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise.
Currently users must establish PROT_NONE ranges to achieve this.
However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs.
In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges.
The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - poisoning PTEs such that accesses to them cause a fault followed by a SIGSGEV signal being raised.
This is achieved through the PTE marker mechanism, which a previous commit in this series extended to permit this to be done, installed via the generic page walking logic, also extended by a prior commit for this purpose.
These poison ranges are established with MADV_GUARD_POISON, and if the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect).
Any existing poison entries will be left untouched. There is no nesting of poisoned pages.
Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather unexpected behaviour, but are cleared on process teardown or unmapping of memory ranges.
Ranges can have the poison property removed by MADV_GUARD_UNPOISON - 'remedying' the poisoning. The ranges over which this is applied, should they contain non-poison entries, will be untouched, only poison entries will be cleared.
We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive).
Suggested-by: Vlastimil Babka vbabka@suze.cz Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/uapi/asm-generic/mman-common.h | 3 + mm/madvise.c | 168 +++++++++++++++++++++++++ mm/mprotect.c | 3 +- mm/mseal.c | 1 + 8 files changed, 186 insertions(+), 1 deletion(-)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 763929e814e9..71e13f27742d 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -78,6 +78,9 @@
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
+#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index 9c48d9a21aa0..1a2222322f77 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -105,6 +105,9 @@
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
+#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 68c44f99bc93..380905522397 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -75,6 +75,9 @@ #define MADV_HWPOISON 100 /* poison a page for testing */ #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */
+#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index 1ff0c858544f..e8d5affceb28 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -113,6 +113,9 @@
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
+#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..5dfd3d442de4 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -79,6 +79,9 @@
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
+#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0
diff --git a/mm/madvise.c b/mm/madvise.c index e871a72a6c32..7b9a357b84d2 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -60,6 +60,8 @@ static int madvise_need_mmap_write(int behavior) case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE: + case MADV_GUARD_POISON: + case MADV_GUARD_UNPOISON: return 0; default: /* be safe, default to 1. list exceptions explicitly */ @@ -1017,6 +1019,166 @@ static long madvise_remove(struct vm_area_struct *vma, return error; }
+static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) +{ + vm_flags_t disallowed = VM_SPECIAL | VM_HUGETLB; + + /* + * A user could lock after poisoning but that's fine, as they'd not be + * able to fault in. The issue arises when we try to zap existing locked + * VMAs. We don't want to do that. + */ + if (!allow_locked) + disallowed |= VM_LOCKED; + + if (!vma_is_anonymous(vma)) + return false; + + if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE) + return false; + + return true; +} + +static bool is_guard_pte_marker(pte_t ptent) +{ + return is_pte_marker(ptent) && + is_guard_swp_entry(pte_to_swp_entry(ptent)); +} + +static int guard_poison_pud_entry(pud_t *pud, unsigned long addr, unsigned long next, + struct mm_walk *walk) +{ + pud_t pudval = pudp_get(pud); + + /* Do not split a huge pud - we do nothing with these so just ignore. */ + if (pud_trans_huge(pudval) || pud_devmap(pudval)) + walk->action = ACTION_CONTINUE; + + return 0; +} + +static int guard_poison_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, + struct mm_walk *walk) +{ + pmd_t pmdval = pmdp_get(pmd); + + /* Do not split a huge pmd - we do nothing with these so just ignore. */ + if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) + walk->action = ACTION_CONTINUE; + + return 0; +} + +static int guard_poison_pte_entry(pte_t *pte, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pte_t pteval = ptep_get(pte); + + /* + * If not a guard marker, simply abort the operation. We return a value + * > 0 indicating a non-error abort. + */ + return !is_guard_pte_marker(pteval); +} + +static int guard_poison_install_pte(unsigned long addr, unsigned long next, + pte_t *ptep, struct mm_walk *walk) +{ + /* Simply install a PTE marker, this causes segfault on access. */ + *ptep = make_pte_marker(PTE_MARKER_GUARD); + + return 0; +} + +static const struct mm_walk_ops guard_poison_walk_ops = { + .pud_entry = guard_poison_pud_entry, + .pmd_entry = guard_poison_pmd_entry, + .pte_entry = guard_poison_pte_entry, + .install_pte = guard_poison_install_pte, + .walk_lock = PGWALK_RDLOCK, +}; + +static long madvise_guard_poison(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + long err; + + *prev = vma; + if (!is_valid_guard_vma(vma, /* allow_locked = */false)) + return -EINVAL; + + /* + * If we install poison markers, then the range is no longer + * empty from a page table perspective and therefore it's + * appropriate to have an anon_vma. + * + * This ensures that on fork, we copy page tables correctly. + */ + err = anon_vma_prepare(vma); + if (err) + return err; + + /* + * Optimistically try to install the guard poison pages first. If any + * non-guard pages are encountered, give up and zap the range before + * trying again. + */ + while (true) { + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ + err = walk_page_range_mm(vma->vm_mm, start, end, + &guard_poison_walk_ops, NULL); + if (err <= 0) + return err; + + /* + * OK some of the range have non-guard pages mapped, zap + * them. This leaves existing guard pages in place. + */ + zap_page_range_single(vma, start, end - start, NULL); + + if (fatal_signal_pending(current)) + return -EINTR; + cond_resched(); + } +} + +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pte_t ptent = ptep_get(pte); + + if (is_guard_pte_marker(ptent)) { + /* Simply clear the PTE marker. */ + pte_clear_not_present_full(walk->mm, addr, pte, false); + update_mmu_cache(walk->vma, addr, pte); + } + + return 0; +} + +static const struct mm_walk_ops guard_unpoison_walk_ops = { + .pte_entry = guard_unpoison_pte_entry, + .walk_lock = PGWALK_RDLOCK, +}; + +static long madvise_guard_unpoison(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + *prev = vma; + /* + * We're ok with unpoisoning mlock()'d ranges, as this is a + * non-destructive action. + */ + if (!is_valid_guard_vma(vma, /* allow_locked = */true)) + return -EINVAL; + + return walk_page_range(vma->vm_mm, start, end, + &guard_unpoison_walk_ops, NULL); +} + /* * Apply an madvise behavior to a region of a vma. madvise_update_vma * will handle splitting a vm area into separate areas, each area with its own @@ -1098,6 +1260,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, break; case MADV_COLLAPSE: return madvise_collapse(vma, prev, start, end); + case MADV_GUARD_POISON: + return madvise_guard_poison(vma, prev, start, end); + case MADV_GUARD_UNPOISON: + return madvise_guard_unpoison(vma, prev, start, end); }
anon_name = anon_vma_name(vma); @@ -1197,6 +1363,8 @@ madvise_behavior_valid(int behavior) case MADV_DODUMP: case MADV_WIPEONFORK: case MADV_KEEPONFORK: + case MADV_GUARD_POISON: + case MADV_GUARD_UNPOISON: #ifdef CONFIG_MEMORY_FAILURE case MADV_SOFT_OFFLINE: case MADV_HWPOISON: diff --git a/mm/mprotect.c b/mm/mprotect.c index 0c5d6d06107d..d0e3ebfadef8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -236,7 +236,8 @@ static long change_pte_range(struct mmu_gather *tlb, } else if (is_pte_marker_entry(entry)) { /* * Ignore error swap entries unconditionally, - * because any access should sigbus anyway. + * because any access should sigbus/sigsegv + * anyway. */ if (is_poisoned_swp_entry(entry)) continue; diff --git a/mm/mseal.c b/mm/mseal.c index ece977bd21e1..21bf5534bcf5 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -30,6 +30,7 @@ static bool is_madv_discard(int behavior) case MADV_REMOVE: case MADV_DONTFORK: case MADV_WIPEONFORK: + case MADV_GUARD_POISON: return true; }
Utilise the kselftest harmness to implement tests for the guard page implementation.
We start by implement basic tests asserting that guard pages can be established (poisoned), cleared (remedied) and that touching poisoned pages result in SIGSEGV. We also assert that, in remedying a range, non-poison pages remain intact.
We then examine different operations on regions containing poison markers behave to ensure correct behaviour:
* Operations over multiple VMAs operate as expected. * Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in batches works correctly. * Ensuring that munmap() correctly tears down poison markers. * Using mprotect() to adjust protection bits does not in any way override or cause issues with poison markers. * Ensuring that splitting and merging VMAs around poison markers causes no issue - i.e. that a marker which 'belongs' to one VMA can function just as well 'belonging' to another. * Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers. * Ensuring that mlock()'ing a range containing poison markers does not cause issues. * Ensuring that mremap() can move a poisoned range and retain poison markers. * Ensuring that mremap() can expand a poisoned range and retain poison markers (perhaps moving the range). * Ensuring that mremap() can shrink a poisoned range and retain poison markers. * Ensuring that forking a process correctly retains poison markers. * Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely. * Ensuring that lazyfree simply clears poison markers. * Ensuring that userfaultfd can co-exist with guard pages. * Ensuring that madvise(..., MADV_POPULATE_READ) and madvise(..., MADV_POPULATE_WRITE) error out when encountering poison markers. * Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do not remove poison markers.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++ 3 files changed, 1170 insertions(+) create mode 100644 tools/testing/selftests/mm/guard-pages.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 689bbd520296..8f01f4da1c0d 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -54,3 +54,4 @@ droppable hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 +guard-pages diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 02e1204971b0..15c734d6cfec 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable +TEST_GEN_FILES += guard-pages
ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c new file mode 100644 index 000000000000..2ab0ff3ba5a0 --- /dev/null +++ b/tools/testing/selftests/mm/guard-pages.c @@ -0,0 +1,1168 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <assert.h> +#include <fcntl.h> +#include <setjmp.h> +#include <errno.h> +#include <linux/userfaultfd.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <sys/syscall.h> +#include <sys/uio.h> +#include <unistd.h> + +/* These may not yet be available in the uAPI so define if not. */ + +#ifndef MADV_GUARD_POISON +#define MADV_GUARD_POISON 102 +#endif + +#ifndef MADV_GUARD_UNPOISON +#define MADV_GUARD_UNPOISON 103 +#endif + +volatile bool signal_jump_set; +sigjmp_buf signal_jmp_buf; + +static int userfaultfd(int flags) +{ + return syscall(SYS_userfaultfd, flags); +} + +static void handle_fatal(int c) +{ + if (!signal_jump_set) + return; + + siglongjmp(signal_jmp_buf, c); +} + +static int pidfd_open(pid_t pid, unsigned int flags) +{ + return syscall(SYS_pidfd_open, pid, flags); +} + +/* + * Enable our signal catcher and try to read/write the specified buffer. The + * return value indicates whether the read/write succeeds without a fatal + * signal. + */ +static bool try_access_buf(char *ptr, bool write) +{ + bool failed; + + /* Tell signal handler to jump back here on fatal signal. */ + signal_jump_set = true; + /* If a fatal signal arose, we will jump back here and failed is set. */ + failed = sigsetjmp(signal_jmp_buf, 0) != 0; + + if (!failed) { + if (write) { + *ptr = 'x'; + } else { + const volatile char *chr = ptr; + + /* Force read. */ + (void)*chr; + } + } + + signal_jump_set = false; + return !failed; +} + +/* Try and read from a buffer, return true if no fatal signal. */ +static bool try_read_buf(char *ptr) +{ + return try_access_buf(ptr, false); +} + +/* Try and write to a buffer, return true if no fatal signal. */ +static bool try_write_buf(char *ptr) +{ + return try_access_buf(ptr, true); +} + +/* + * Try and BOTH read from AND write to a buffer, return true if BOTH operations + * succeed. + */ +static bool try_read_write_buf(char *ptr) +{ + return try_read_buf(ptr) && try_write_buf(ptr); +} + +FIXTURE(guard_pages) +{ + unsigned long page_size; +}; + +FIXTURE_SETUP(guard_pages) +{ + struct sigaction act = { + .sa_handler = &handle_fatal, + .sa_flags = SA_NODEFER, + }; + + sigemptyset(&act.sa_mask); + if (sigaction(SIGSEGV, &act, NULL)) { + perror("sigaction"); + ksft_exit_fail(); + } + + self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); +}; + +FIXTURE_TEARDOWN(guard_pages) +{ + struct sigaction act = { + .sa_handler = SIG_DFL, + .sa_flags = SA_NODEFER, + }; + + sigemptyset(&act.sa_mask); + sigaction(SIGSEGV, &act, NULL); +} + +TEST_F(guard_pages, basic) +{ + const unsigned long NUM_PAGES = 10; + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + ptr = mmap(NULL, NUM_PAGES * page_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANON, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Trivially assert we can touch the first page. */ + ASSERT_TRUE(try_read_write_buf(ptr)); + + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0); + + /* Establish that 1st page SIGSEGV's. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + + /* Ensure we can touch everything else.*/ + for (i = 1; i < NUM_PAGES; i++) { + ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Establish a guard page at the end of the mapping. */ + ASSERT_EQ(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size, + MADV_GUARD_POISON), 0); + + /* Check that both guard pages result in SIGSEGV. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size])); + + /* Unpoison the first. */ + ASSERT_FALSE(madvise(ptr, page_size, MADV_GUARD_UNPOISON)); + + /* Make sure we can touch it. */ + ASSERT_TRUE(try_read_write_buf(ptr)); + + /* Unpoison the last. */ + ASSERT_FALSE(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size, + MADV_GUARD_UNPOISON)); + + /* Make sure we can touch it. */ + ASSERT_TRUE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size])); + + /* + * Test setting a _range_ of pages, namely the first 3. The first of + * these be faulted in, so this also tests that we can poison backed + * pages. + */ + ASSERT_EQ(madvise(ptr, 3 * page_size, MADV_GUARD_POISON), 0); + + /* Make sure they are all poisoned. */ + for (i = 0; i < 3; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Make sure the rest are not. */ + for (i = 3; i < NUM_PAGES; i++) { + ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Unpoison them. */ + ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_UNPOISON), 0); + + /* Now make sure we can touch everything. */ + for (i = 0; i < NUM_PAGES; i++) { + ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Now unpoison everything, make sure we don't remove existing entries */ + ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_UNPOISON), 0); + + for (i = 0; i < NUM_PAGES * page_size; i += page_size) { + ASSERT_EQ(ptr[i], 'x'); + } + + ASSERT_EQ(munmap(ptr, NUM_PAGES * page_size), 0); +} + +/* Assert that operations applied across multiple VMAs work as expected. */ +TEST_F(guard_pages, multi_vma) +{ + const unsigned long page_size = self->page_size; + char *ptr_region, *ptr, *ptr1, *ptr2, *ptr3; + int i; + + /* Reserve a 100 page region over which we can install VMAs. */ + ptr_region = mmap(NULL, 100 * page_size, PROT_NONE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_region, MAP_FAILED); + + /* Place a VMA of 10 pages size at the start of the region. */ + ptr1 = mmap(ptr_region, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr1, MAP_FAILED); + + /* Place a VMA of 5 pages size 50 pages into the region. */ + ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size, + PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + + /* Place a VMA of 20 pages size at the end of the region. */ + ptr3 = mmap(&ptr_region[80 * page_size], 20 * page_size, + PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr3, MAP_FAILED); + + /* Unmap gaps. */ + ASSERT_EQ(munmap(&ptr_region[10 * page_size], 40 * page_size), 0); + ASSERT_EQ(munmap(&ptr_region[55 * page_size], 25 * page_size), 0); + + /* + * We end up with VMAs like this: + * + * 0 10 .. 50 55 .. 80 100 + * [---] [---] [---] + */ + + /* Now poison the whole range and make sure all VMAs are poisoned. */ + + /* + * madvise() is certifiable and lets you perform operations over gaps, + * everything works, but it indicates an error and errno is set to + * -ENOMEM. Also if anything runs out of memory it is set to + * -ENOMEM. You are meant to guess which is which. + */ + ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_POISON), -1); + ASSERT_EQ(errno, ENOMEM); + + for (i = 0; i < 10; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr1[i * page_size])); + } + + for (i = 0; i < 5; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr2[i * page_size])); + } + + for (i = 0; i < 20; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr3[i * page_size])); + } + + /* Now unpoison the range and assert the opposite. */ + + ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_UNPOISON), -1); + ASSERT_EQ(errno, ENOMEM); + + for (i = 0; i < 10; i++) { + ASSERT_TRUE(try_read_write_buf(&ptr1[i * page_size])); + } + + for (i = 0; i < 5; i++) { + ASSERT_TRUE(try_read_write_buf(&ptr2[i * page_size])); + } + + for (i = 0; i < 20; i++) { + ASSERT_TRUE(try_read_write_buf(&ptr3[i * page_size])); + } + + /* Now map incompatible VMAs in the gaps. */ + ptr = mmap(&ptr_region[10 * page_size], 40 * page_size, + PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + ptr = mmap(&ptr_region[55 * page_size], 25 * page_size, + PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* + * We end up with VMAs like this: + * + * 0 10 .. 50 55 .. 80 100 + * [---][xxxx][---][xxxx][---] + * + * Where 'x' signifies VMAs that cannot be merged with those adjacent to + * them. + */ + + /* Multiple VMAs adjacent to one another should result in no error. */ + ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_POISON), 0); + for (i = 0; i < 100; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr_region[i * page_size])); + } + ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_UNPOISON), 0); + for (i = 0; i < 100; i++) { + ASSERT_TRUE(try_read_write_buf(&ptr_region[i * page_size])); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr_region, 100 * page_size), 0); +} + +/* + * Assert that batched operations performed using process_madvise() work as + * expected. + */ +TEST_F(guard_pages, process_madvise) +{ + const unsigned long page_size = self->page_size; + pid_t pid = getpid(); + int pidfd = pidfd_open(pid, 0); + char *ptr_region, *ptr1, *ptr2, *ptr3; + ssize_t count; + struct iovec vec[6]; + + ASSERT_NE(pidfd, -1); + + /* Reserve region to map over. */ + ptr_region = mmap(NULL, 100 * page_size, PROT_NONE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_region, MAP_FAILED); + + /* 10 pages offset 1 page into reserve region. */ + ptr1 = mmap(&ptr_region[page_size], 10 * page_size, + PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr1, MAP_FAILED); + /* We want poison markers at start/end of each VMA. */ + vec[0].iov_base = ptr1; + vec[0].iov_len = page_size; + vec[1].iov_base = &ptr1[9 * page_size]; + vec[1].iov_len = page_size; + + /* 5 pages offset 50 pages into reserve region. */ + ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size, + PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + vec[2].iov_base = ptr2; + vec[2].iov_len = page_size; + vec[3].iov_base = &ptr2[4 * page_size]; + vec[3].iov_len = page_size; + + /* 20 pages offset 79 pages into reserve region. */ + ptr3 = mmap(&ptr_region[79 * page_size], 20 * page_size, + PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr3, MAP_FAILED); + vec[4].iov_base = ptr3; + vec[4].iov_len = page_size; + vec[5].iov_base = &ptr3[19 * page_size]; + vec[5].iov_len = page_size; + + /* Free surrounding VMAs. */ + ASSERT_EQ(munmap(ptr_region, page_size), 0); + ASSERT_EQ(munmap(&ptr_region[11 * page_size], 39 * page_size), 0); + ASSERT_EQ(munmap(&ptr_region[55 * page_size], 24 * page_size), 0); + ASSERT_EQ(munmap(&ptr_region[99 * page_size], page_size), 0); + + /* Now poison in one step. */ + count = process_madvise(pidfd, vec, 6, MADV_GUARD_POISON, 0); + + /* OK we don't have permission to do this, skip. */ + if (count == -1 && errno == EPERM) + ksft_exit_skip("No process_madvise() permissions\n"); + + /* Returns the number of bytes advised. */ + ASSERT_EQ(count, 6 * page_size); + + /* Now make sure the poisoning was applied. */ + + ASSERT_FALSE(try_read_write_buf(ptr1)); + ASSERT_FALSE(try_read_write_buf(&ptr1[9 * page_size])); + + ASSERT_FALSE(try_read_write_buf(ptr2)); + ASSERT_FALSE(try_read_write_buf(&ptr2[4 * page_size])); + + ASSERT_FALSE(try_read_write_buf(ptr3)); + ASSERT_FALSE(try_read_write_buf(&ptr3[19 * page_size])); + + /* Now do the same with unpoison... */ + count = process_madvise(pidfd, vec, 6, MADV_GUARD_UNPOISON, 0); + + /* ...and everything should now succeed. */ + + ASSERT_TRUE(try_read_write_buf(ptr1)); + ASSERT_TRUE(try_read_write_buf(&ptr1[9 * page_size])); + + ASSERT_TRUE(try_read_write_buf(ptr2)); + ASSERT_TRUE(try_read_write_buf(&ptr2[4 * page_size])); + + ASSERT_TRUE(try_read_write_buf(ptr3)); + ASSERT_TRUE(try_read_write_buf(&ptr3[19 * page_size])); + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr1, 10 * page_size), 0); + ASSERT_EQ(munmap(ptr2, 5 * page_size), 0); + ASSERT_EQ(munmap(ptr3, 20 * page_size), 0); + close(pidfd); +} + +/* Assert that unmapping ranges does not leave poison behind. */ +TEST_F(guard_pages, munmap) +{ + const unsigned long page_size = self->page_size; + char *ptr, *ptr_new1, *ptr_new2; + + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison first and last pages. */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0); + ASSERT_EQ(madvise(&ptr[9 * page_size], page_size, MADV_GUARD_POISON), 0); + + /* Assert that they are poisoned. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[9 * page_size])); + + /* Unmap them. */ + ASSERT_EQ(munmap(ptr, page_size), 0); + ASSERT_EQ(munmap(&ptr[9 * page_size], page_size), 0); + + /* Map over them.*/ + ptr_new1 = mmap(ptr, page_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_new1, MAP_FAILED); + ptr_new2 = mmap(&ptr[9 * page_size], page_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_new2, MAP_FAILED); + + /* Assert that they are now not poisoned. */ + ASSERT_TRUE(try_read_write_buf(ptr_new1)); + ASSERT_TRUE(try_read_write_buf(ptr_new2)); + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Assert that mprotect() operations have no bearing on guard poison markers. */ +TEST_F(guard_pages, mprotect) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison the middle of the range. */ + ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size, + MADV_GUARD_POISON), 0); + + /* Assert that it is indeed poisoned. */ + ASSERT_FALSE(try_read_write_buf(&ptr[5 * page_size])); + ASSERT_FALSE(try_read_write_buf(&ptr[6 * page_size])); + + /* Now make these pages read-only. */ + ASSERT_EQ(mprotect(&ptr[5 * page_size], 2 * page_size, PROT_READ), 0); + + /* Make sure the range is still poisoned. */ + ASSERT_FALSE(try_read_buf(&ptr[5 * page_size])); + ASSERT_FALSE(try_read_buf(&ptr[6 * page_size])); + + /* Make sure we can poison again without issue.*/ + ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size, + MADV_GUARD_POISON), 0); + + /* Make sure the range is, yet again, still poisoned. */ + ASSERT_FALSE(try_read_buf(&ptr[5 * page_size])); + ASSERT_FALSE(try_read_buf(&ptr[6 * page_size])); + + /* Now unpoison the whole range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0); + + /* Make sure the whole range is readable. */ + for (i = 0; i < 10; i++) { + ASSERT_TRUE(try_read_buf(&ptr[i * page_size])); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Split and merge VMAs and make sure guard pages still behave. */ +TEST_F(guard_pages, split_merge) +{ + const unsigned long page_size = self->page_size; + char *ptr, *ptr_new; + int i; + + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison the whole range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0); + + /* Make sure the whole range is poisoned. */ + for (i = 0; i < 10; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Now unmap some pages in the range so we split. */ + ASSERT_EQ(munmap(&ptr[2 * page_size], page_size), 0); + ASSERT_EQ(munmap(&ptr[5 * page_size], page_size), 0); + ASSERT_EQ(munmap(&ptr[8 * page_size], page_size), 0); + + /* Make sure the remaining ranges are poisoned post-split. */ + for (i = 0; i < 2; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size])); + } + for (i = 2; i < 5; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size])); + } + for (i = 6; i < 8; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size])); + } + for (i = 9; i < 10; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Now map them again - the unmap will have cleared the poison. */ + ptr_new = mmap(&ptr[2 * page_size], page_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_new, MAP_FAILED); + ptr_new = mmap(&ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_new, MAP_FAILED); + ptr_new = mmap(&ptr[8 * page_size], page_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_new, MAP_FAILED); + + /* Now make sure poisoning is as expected. */ + for (i = 0; i < 10; i++) { + bool result = try_read_write_buf(&ptr[i * page_size]); + + if (i == 2 || i == 5 || i == 8) { + ASSERT_TRUE(result); + } else { + ASSERT_FALSE(result); + } + } + + /* Now poison everything again. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0); + + /* Make sure the whole range is poisoned. */ + for (i = 0; i < 10; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Now split the range into three. */ + ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0); + ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0); + + /* Make sure the whole range is poisoned for read. */ + for (i = 0; i < 10; i++) { + ASSERT_FALSE(try_read_buf(&ptr[i * page_size])); + } + + /* Now reset protection bits so we merge the whole thing. */ + ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0); + ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, + PROT_READ | PROT_WRITE), 0); + + /* Make sure the whole range is still poisoned. */ + for (i = 0; i < 10; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Split range into 3 again... */ + ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0); + ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0); + + /* ...and unpoison the whole range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0); + + /* Make sure the whole range is remedied for read. */ + for (i = 0; i < 10; i++) { + ASSERT_TRUE(try_read_buf(&ptr[i * page_size])); + } + + /* Merge them again. */ + ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0); + ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, + PROT_READ | PROT_WRITE), 0); + + /* Now ensure the merged range is remedied for read/write. */ + for (i = 0; i < 10; i++) { + ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Assert that MADV_DONTNEED does not remove guard poison markers. */ +TEST_F(guard_pages, dontneed) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Back the whole range. */ + for (i = 0; i < 10; i++) { + ptr[i * page_size] = 'y'; + } + + /* Poison every other page. */ + for (i = 0; i < 10; i += 2) { + ASSERT_EQ(madvise(&ptr[i * page_size], + page_size, MADV_GUARD_POISON), 0); + } + + /* Indicate that we don't need any of the range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_DONTNEED), 0); + + /* Check to ensure poison markers are still in place. */ + for (i = 0; i < 10; i++) { + bool result = try_read_buf(&ptr[i * page_size]); + + if (i % 2 == 0) { + ASSERT_FALSE(result); + } else { + ASSERT_TRUE(result); + /* Make sure we really did get reset to zero page. */ + ASSERT_EQ(ptr[i * page_size], '\0'); + } + + /* Now write... */ + result = try_write_buf(&ptr[i * page_size]); + + /* ...and make sure same result. */ + if (i % 2 == 0) { + ASSERT_FALSE(result); + } else { + ASSERT_TRUE(result); + } + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Assert that mlock()'ed pages work correctly with poison markers. */ +TEST_F(guard_pages, mlock) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Populate. */ + for (i = 0; i < 10; i++) { + ptr[i * page_size] = 'y'; + } + + /* Lock. */ + ASSERT_EQ(mlock(ptr, 10 * page_size), 0); + + /* Now try to poison, should fail with EINVAL. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), -1); + ASSERT_EQ(errno, EINVAL); + + /* OK unlock. */ + ASSERT_EQ(munlock(ptr, 10 * page_size), 0); + + /* Poison first half of range, should now succeed. */ + ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0); + + /* Make sure poison works. */ + for (i = 0; i < 10; i++) { + bool result = try_read_write_buf(&ptr[i * page_size]); + + if (i < 5) { + ASSERT_FALSE(result); + } else { + ASSERT_TRUE(result); + ASSERT_EQ(ptr[i * page_size], 'x'); + } + } + + /* + * Now lock the latter part of the range. We can't lock the poisoned + * pages, as this would result in the pages being populated and the + * poisoning would cause this to error out. + */ + ASSERT_EQ(mlock(&ptr[5 * page_size], 5 * page_size), 0); + + /* + * Now unpoison, we do not permit mlock()'d ranges to be remedied as it is + * a non-destructive operation. + */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0); + + /* Now check that everything is remedied. */ + for (i = 0; i < 10; i++) { + ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* + * Assert that moving, extending and shrinking memory via mremap() retains + * poison markers where possible. + * + * - Moving a mapping alone should retain markers as they are. + */ +TEST_F(guard_pages, mremap_move) +{ + const unsigned long page_size = self->page_size; + char *ptr, *ptr_new; + + /* Map 5 pages. */ + ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Place poison markers at both ends of the 5 page span. */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0); + ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0); + + /* Make sure the poison is in effect. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size])); + + /* Map a new region we will move this range into. Doing this ensures + * that we have reserved a range to map into. + */ + ptr_new = mmap(NULL, 5 * page_size, PROT_NONE, MAP_ANON | MAP_PRIVATE, + -1, 0); + ASSERT_NE(ptr_new, MAP_FAILED); + + ASSERT_EQ(mremap(ptr, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new), ptr_new); + + /* Make sure the poison is retained. */ + ASSERT_FALSE(try_read_write_buf(ptr_new)); + ASSERT_FALSE(try_read_write_buf(&ptr_new[4 * page_size])); + + /* + * Clean up - we only need reference the new pointer as we overwrote the + * PROT_NONE range and moved the existing one. + */ + munmap(ptr_new, 5 * page_size); +} + +/* + * Assert that moving, extending and shrinking memory via mremap() retains + * poison markers where possible. + * + * - Expanding should retain, only now in different position. The user will have + * to unpoison manually to fix up (they'd have to do the same if it were a + * PROT_NONE mapping) + */ +TEST_F(guard_pages, mremap_expand) +{ + const unsigned long page_size = self->page_size; + char *ptr, *ptr_new; + + /* Map 10 pages... */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + /* ...But unmap the last 5 so we can ensure we can expand into them. */ + ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0); + + /* Place poison markers at both ends of the 5 page span. */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0); + ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0); + + /* Make sure the poison is in effect. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size])); + + /* Now expand to 10 pages. */ + ptr = mremap(ptr, 5 * page_size, 10 * page_size, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Make sure the poison is retained in its original positions. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size])); + + /* Reserve a region which we can move to and expand into. */ + ptr_new = mmap(NULL, 20 * page_size, PROT_NONE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr_new, MAP_FAILED); + + /* Now move and expand into it. */ + ptr = mremap(ptr, 10 * page_size, 20 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new); + ASSERT_EQ(ptr, ptr_new); + + /* Again, make sure the poison is retained in its original + * positions. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size])); + + /* + * A real user would have to unpoison, but would reasonably expect all + * characteristics of the mapping to be retained, including poison + * markers. + */ + + /* Cleanup. */ + munmap(ptr, 20 * page_size); +} +/* + * Assert that moving, extending and shrinking memory via mremap() retains + * poison markers where possible. + * + * - Shrinking will result in markers that are shrunk over being removed. Again, + * if the user were using a PROT_NONE mapping they'd have to manually fix this + * up also so this is OK. + */ +TEST_F(guard_pages, mremap_shrink) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + /* Map 5 pages. */ + ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Place poison markers at both ends of the 5 page span. */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0); + ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0); + + /* Make sure the poison is in effect. */ + ASSERT_FALSE(try_read_write_buf(ptr)); + ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size])); + + /* Now shrink to 3 pages. */ + ptr = mremap(ptr, 5 * page_size, 3 * page_size, MREMAP_MAYMOVE); + ASSERT_NE(ptr, MAP_FAILED); + + /* We expect the poison marker at the start to be retained... */ + ASSERT_FALSE(try_read_write_buf(ptr)); + + /* ...But remaining pages will not have poison markers. */ + for (i = 1; i < 3; i++) { + ASSERT_TRUE(try_read_write_buf(&ptr[i + page_size])); + } + + /* + * As with expansion, a real user would have to unpoison and fixup. But + * you'd have to do similar manual things with PROT_NONE mappings too. + */ + + /* + * If we expand back to the original size, the end marker will, of + * course, no longer be present. + */ + ptr = mremap(ptr, 3 * page_size, 5 * page_size, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Again, we expect the poison marker at the start to be retained... */ + ASSERT_FALSE(try_read_write_buf(ptr)); + + /* ...But remaining pages will not have poison markers. */ + for (i = 1; i < 5; i++) { + ASSERT_TRUE(try_read_write_buf(&ptr[i + page_size])); + } + + /* Cleanup. */ + munmap(ptr, 5 * page_size); +} + +/* + * Assert that forking a process with VMAs that do not have VM_WIPEONFORK set + * retain guard pages. + */ +TEST_F(guard_pages, fork) +{ + const unsigned long page_size = self->page_size; + char *ptr; + pid_t pid; + int i; + + /* Map 10 pages. */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison the first 5 pages. */ + ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0); + + pid = fork(); + ASSERT_NE(pid, -1); + if (!pid) { + /* This is the child process now. */ + + /* Assert that the poisoning is in effect. */ + for (i = 0; i < 10; i++) { + bool result = try_read_write_buf(&ptr[i * page_size]); + + if (i < 5) { + ASSERT_FALSE(result); + } else { + ASSERT_TRUE(result); + } + } + + /* Now unpoison the range.*/ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0); + + exit(0); + } + + /* Parent process. */ + + /* Parent simply waits on child. */ + waitpid(pid, NULL, 0); + + /* Child unpoison does not impact parent page table state. */ + for (i = 0; i < 10; i++) { + bool result = try_read_write_buf(&ptr[i * page_size]); + + if (i < 5) { + ASSERT_FALSE(result); + } else { + ASSERT_TRUE(result); + } + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* + * Assert that forking a process with VMAs that do have VM_WIPEONFORK set + * behave as expected. + */ +TEST_F(guard_pages, fork_wipeonfork) +{ + const unsigned long page_size = self->page_size; + char *ptr; + pid_t pid; + int i; + + /* Map 10 pages. */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Mark wipe on fork. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_WIPEONFORK), 0); + + /* Poison the first 5 pages. */ + ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0); + + pid = fork(); + ASSERT_NE(pid, -1); + if (!pid) { + /* This is the child process now. */ + + /* Poison will have been wiped. */ + for (i = 0; i < 10; i++) { + ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size])); + } + + exit(0); + } + + /* Parent process. */ + + waitpid(pid, NULL, 0); + + /* Poison should be in effect.*/ + for (i = 0; i < 10; i++) { + bool result = try_read_write_buf(&ptr[i * page_size]); + + if (i < 5) { + ASSERT_FALSE(result); + } else { + ASSERT_TRUE(result); + } + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Ensure that MADV_FREE frees poison entries as expected. */ +TEST_F(guard_pages, lazyfree) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + /* Map 10 pages. */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0); + + /* Ensure poisoned. */ + for (i = 0; i < 10; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Lazyfree range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_FREE), 0); + + /* This should simply clear the poison markers. */ + for (i = 0; i < 10; i++) { + ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Ensure that MADV_POPULATE_READ, MADV_POPULATE_WRITE behave as expected. */ +TEST_F(guard_pages, populate) +{ + const unsigned long page_size = self->page_size; + char *ptr; + + /* Map 10 pages. */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0); + + /* Populate read should error out... */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_READ), -1); + ASSERT_EQ(errno, EFAULT); + + /* ...as should populate write. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_WRITE), -1); + ASSERT_EQ(errno, EFAULT); + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Ensure that MADV_COLD, MADV_PAGEOUT do not remove poison markers. */ +TEST_F(guard_pages, cold_pageout) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + /* Map 10 pages. */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Poison range. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0); + + /* Ensured poisoned. */ + for (i = 0; i < 10; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Now mark cold. This should have no impact on poison markers. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_COLD), 0); + + /* Should remain poisoned. */ + for (i = 0; i < 10; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size])); + } + + /* OK, now page out. This should equally, have no effect on markers. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_PAGEOUT), 0); + + /* Should remain poisoned. */ + for (i = 0; i < 10; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Cleanup. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* Ensure that guard pages do not break userfaultd. */ +TEST_F(guard_pages, uffd) +{ + const unsigned long page_size = self->page_size; + int uffd; + char *ptr; + int i; + struct uffdio_api api = { + .api = UFFD_API, + .features = 0, + }; + struct uffdio_register reg; + struct uffdio_range range; + + /* Set up uffd. */ + uffd = userfaultfd(0); + if (uffd == -1 && errno == EPERM) + ksft_exit_skip("No uffd permissions\n"); + ASSERT_NE(uffd, -1); + + ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0); + + /* Map 10 pages. */ + ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Register the range with uffd. */ + range.start = (unsigned long)ptr; + range.len = 10 * page_size; + reg.range = range; + reg.mode = UFFDIO_REGISTER_MODE_MISSING; + ASSERT_EQ(ioctl(uffd, UFFDIO_REGISTER, ®), 0); + + /* Poison the range. This should not trigger the uffd. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0); + + /* The poisoning should behave as usual with no uffd intervention. */ + for (i = 0; i < 10; i++) { + ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size])); + } + + /* Cleanup. */ + ASSERT_EQ(ioctl(uffd, UFFDIO_UNREGISTER, &range), 0); + close(uffd); + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +TEST_HARNESS_MAIN
On 10/17/24 14:42, Lorenzo Stoakes wrote:
Utilise the kselftest harmness to implement tests for the guard page
Splleing NIT - harmness -> harness
implementation.
We start by implement basic tests asserting that guard pages can be
implmenting? By the way checkpatch will catch spelling stuuf. Please see comments about warnings below.
established (poisoned), cleared (remedied) and that touching poisoned pages result in SIGSEGV. We also assert that, in remedying a range, non-poison pages remain intact.
We then examine different operations on regions containing poison markers behave to ensure correct behaviour:
- Operations over multiple VMAs operate as expected.
- Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in batches works correctly.
- Ensuring that munmap() correctly tears down poison markers.
- Using mprotect() to adjust protection bits does not in any way override or cause issues with poison markers.
- Ensuring that splitting and merging VMAs around poison markers causes no issue - i.e. that a marker which 'belongs' to one VMA can function just as well 'belonging' to another.
- Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
- Ensuring that mlock()'ing a range containing poison markers does not cause issues.
- Ensuring that mremap() can move a poisoned range and retain poison markers.
- Ensuring that mremap() can expand a poisoned range and retain poison markers (perhaps moving the range).
- Ensuring that mremap() can shrink a poisoned range and retain poison markers.
- Ensuring that forking a process correctly retains poison markers.
- Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
- Ensuring that lazyfree simply clears poison markers.
- Ensuring that userfaultfd can co-exist with guard pages.
- Ensuring that madvise(..., MADV_POPULATE_READ) and madvise(..., MADV_POPULATE_WRITE) error out when encountering poison markers.
- Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do not remove poison markers.
Good summary of test. Does the test require root access? If so does it check and skip appropriately?
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++ 3 files changed, 1170 insertions(+) create mode 100644 tools/testing/selftests/mm/guard-pages.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 689bbd520296..8f01f4da1c0d 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -54,3 +54,4 @@ droppable hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 +guard-pages diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 02e1204971b0..15c734d6cfec 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable +TEST_GEN_FILES += guard-pages ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c new file mode 100644 index 000000000000..2ab0ff3ba5a0 --- /dev/null +++ b/tools/testing/selftests/mm/guard-pages.c @@ -0,0 +1,1168 @@ +// SPDX-License-Identifier: GPL-2.0-or-later
+#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <assert.h> +#include <fcntl.h> +#include <setjmp.h> +#include <errno.h> +#include <linux/userfaultfd.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <sys/syscall.h> +#include <sys/uio.h> +#include <unistd.h>
+/* These may not yet be available in the uAPI so define if not. */
+#ifndef MADV_GUARD_POISON +#define MADV_GUARD_POISON 102 +#endif
+#ifndef MADV_GUARD_UNPOISON +#define MADV_GUARD_UNPOISON 103 +#endif
+volatile bool signal_jump_set;
Can you add a comment about why volatile is needed. By the way did you happen to run checkpatck on this. There are several instances where single statement blocks with braces {}
I noticed a few and ran checkpatch on your patch. There are 45 warnings regarding codeing style.
Please run checkpatch and clean them up so we can avoid followup checkpatch cleanup patches.
+sigjmp_buf signal_jmp_buf;
+static int userfaultfd(int flags) +{
- return syscall(SYS_userfaultfd, flags);
+}
+static void handle_fatal(int c) +{
- if (!signal_jump_set)
return;
- siglongjmp(signal_jmp_buf, c);
+}
+static int pidfd_open(pid_t pid, unsigned int flags) +{
- return syscall(SYS_pidfd_open, pid, flags);
+}
+/*
- Enable our signal catcher and try to read/write the specified buffer. The
- return value indicates whether the read/write succeeds without a fatal
- signal.
- */
+static bool try_access_buf(char *ptr, bool write) +{
- bool failed;
- /* Tell signal handler to jump back here on fatal signal. */
- signal_jump_set = true;
- /* If a fatal signal arose, we will jump back here and failed is set. */
- failed = sigsetjmp(signal_jmp_buf, 0) != 0;
- if (!failed) {
if (write) {
*ptr = 'x';
} else {
const volatile char *chr = ptr;
/* Force read. */
(void)*chr;
}
- }
- signal_jump_set = false;
- return !failed;
+}
+/* Try and read from a buffer, return true if no fatal signal. */ +static bool try_read_buf(char *ptr) +{
- return try_access_buf(ptr, false);
+}
+/* Try and write to a buffer, return true if no fatal signal. */ +static bool try_write_buf(char *ptr) +{
- return try_access_buf(ptr, true);
+}
+/*
- Try and BOTH read from AND write to a buffer, return true if BOTH operations
- succeed.
- */
+static bool try_read_write_buf(char *ptr) +{
- return try_read_buf(ptr) && try_write_buf(ptr);
+}
+FIXTURE(guard_pages) +{
- unsigned long page_size;
+};
+FIXTURE_SETUP(guard_pages) +{
- struct sigaction act = {
.sa_handler = &handle_fatal,
.sa_flags = SA_NODEFER,
- };
- sigemptyset(&act.sa_mask);
- if (sigaction(SIGSEGV, &act, NULL)) {
perror("sigaction");
ksft_exit_fail();
Replase the above two with ksft_exit_fail_perror() There is no need for perror("sigaction"); followed by ksft_exit_fail();
- }
- self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
+};
+FIXTURE_TEARDOWN(guard_pages) +{
- struct sigaction act = {
.sa_handler = SIG_DFL,
.sa_flags = SA_NODEFER,
- };
- sigemptyset(&act.sa_mask);
- sigaction(SIGSEGV, &act, NULL);
+}
+TEST_F(guard_pages, basic) +{
- const unsigned long NUM_PAGES = 10;
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- ptr = mmap(NULL, NUM_PAGES * page_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANON, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Trivially assert we can touch the first page. */
- ASSERT_TRUE(try_read_write_buf(ptr));
- ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
- /* Establish that 1st page SIGSEGV's. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- /* Ensure we can touch everything else.*/
- for (i = 1; i < NUM_PAGES; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Establish a guard page at the end of the mapping. */
- ASSERT_EQ(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size,
MADV_GUARD_POISON), 0);
- /* Check that both guard pages result in SIGSEGV. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size]));
- /* Unpoison the first. */
- ASSERT_FALSE(madvise(ptr, page_size, MADV_GUARD_UNPOISON));
- /* Make sure we can touch it. */
- ASSERT_TRUE(try_read_write_buf(ptr));
- /* Unpoison the last. */
- ASSERT_FALSE(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size,
MADV_GUARD_UNPOISON));
- /* Make sure we can touch it. */
- ASSERT_TRUE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size]));
- /*
* Test setting a _range_ of pages, namely the first 3. The first of
* these be faulted in, so this also tests that we can poison backed
* pages.
*/
- ASSERT_EQ(madvise(ptr, 3 * page_size, MADV_GUARD_POISON), 0);
- /* Make sure they are all poisoned. */
- for (i = 0; i < 3; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
This one here and
- /* Make sure the rest are not. */
- for (i = 3; i < NUM_PAGES; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Unpoison them. */
- ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_UNPOISON), 0);
- /* Now make sure we can touch everything. */
- for (i = 0; i < NUM_PAGES; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Now unpoison everything, make sure we don't remove existing entries */
- ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_UNPOISON), 0);
- for (i = 0; i < NUM_PAGES * page_size; i += page_size) {
ASSERT_EQ(ptr[i], 'x');
- }
- ASSERT_EQ(munmap(ptr, NUM_PAGES * page_size), 0);
+}
+/* Assert that operations applied across multiple VMAs work as expected. */ +TEST_F(guard_pages, multi_vma) +{
- const unsigned long page_size = self->page_size;
- char *ptr_region, *ptr, *ptr1, *ptr2, *ptr3;
- int i;
- /* Reserve a 100 page region over which we can install VMAs. */
- ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_region, MAP_FAILED);
- /* Place a VMA of 10 pages size at the start of the region. */
- ptr1 = mmap(ptr_region, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr1, MAP_FAILED);
- /* Place a VMA of 5 pages size 50 pages into the region. */
- ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr2, MAP_FAILED);
- /* Place a VMA of 20 pages size at the end of the region. */
- ptr3 = mmap(&ptr_region[80 * page_size], 20 * page_size,
PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr3, MAP_FAILED);
- /* Unmap gaps. */
- ASSERT_EQ(munmap(&ptr_region[10 * page_size], 40 * page_size), 0);
- ASSERT_EQ(munmap(&ptr_region[55 * page_size], 25 * page_size), 0);
- /*
* We end up with VMAs like this:
*
* 0 10 .. 50 55 .. 80 100
* [---] [---] [---]
*/
- /* Now poison the whole range and make sure all VMAs are poisoned. */
- /*
* madvise() is certifiable and lets you perform operations over gaps,
* everything works, but it indicates an error and errno is set to
* -ENOMEM. Also if anything runs out of memory it is set to
* -ENOMEM. You are meant to guess which is which.
*/
- ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_POISON), -1);
- ASSERT_EQ(errno, ENOMEM);
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr1[i * page_size]));
- }
- for (i = 0; i < 5; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr2[i * page_size]));
- }
- for (i = 0; i < 20; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr3[i * page_size]));
- }
- /* Now unpoison the range and assert the opposite. */
- ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_UNPOISON), -1);
- ASSERT_EQ(errno, ENOMEM);
- for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr1[i * page_size]));
- }
- for (i = 0; i < 5; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr2[i * page_size]));
- }
- for (i = 0; i < 20; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr3[i * page_size]));
- }
- /* Now map incompatible VMAs in the gaps. */
- ptr = mmap(&ptr_region[10 * page_size], 40 * page_size,
PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- ptr = mmap(&ptr_region[55 * page_size], 25 * page_size,
PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /*
* We end up with VMAs like this:
*
* 0 10 .. 50 55 .. 80 100
* [---][xxxx][---][xxxx][---]
*
* Where 'x' signifies VMAs that cannot be merged with those adjacent to
* them.
*/
- /* Multiple VMAs adjacent to one another should result in no error. */
- ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_POISON), 0);
- for (i = 0; i < 100; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr_region[i * page_size]));
- }
- ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_UNPOISON), 0);
- for (i = 0; i < 100; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr_region[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr_region, 100 * page_size), 0);
+}
+/*
- Assert that batched operations performed using process_madvise() work as
- expected.
- */
+TEST_F(guard_pages, process_madvise) +{
- const unsigned long page_size = self->page_size;
- pid_t pid = getpid();
- int pidfd = pidfd_open(pid, 0);
- char *ptr_region, *ptr1, *ptr2, *ptr3;
- ssize_t count;
- struct iovec vec[6];
- ASSERT_NE(pidfd, -1);
- /* Reserve region to map over. */
- ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_region, MAP_FAILED);
- /* 10 pages offset 1 page into reserve region. */
- ptr1 = mmap(&ptr_region[page_size], 10 * page_size,
PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr1, MAP_FAILED);
- /* We want poison markers at start/end of each VMA. */
- vec[0].iov_base = ptr1;
- vec[0].iov_len = page_size;
- vec[1].iov_base = &ptr1[9 * page_size];
- vec[1].iov_len = page_size;
- /* 5 pages offset 50 pages into reserve region. */
- ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr2, MAP_FAILED);
- vec[2].iov_base = ptr2;
- vec[2].iov_len = page_size;
- vec[3].iov_base = &ptr2[4 * page_size];
- vec[3].iov_len = page_size;
- /* 20 pages offset 79 pages into reserve region. */
- ptr3 = mmap(&ptr_region[79 * page_size], 20 * page_size,
PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr3, MAP_FAILED);
- vec[4].iov_base = ptr3;
- vec[4].iov_len = page_size;
- vec[5].iov_base = &ptr3[19 * page_size];
- vec[5].iov_len = page_size;
- /* Free surrounding VMAs. */
- ASSERT_EQ(munmap(ptr_region, page_size), 0);
- ASSERT_EQ(munmap(&ptr_region[11 * page_size], 39 * page_size), 0);
- ASSERT_EQ(munmap(&ptr_region[55 * page_size], 24 * page_size), 0);
- ASSERT_EQ(munmap(&ptr_region[99 * page_size], page_size), 0);
- /* Now poison in one step. */
- count = process_madvise(pidfd, vec, 6, MADV_GUARD_POISON, 0);
- /* OK we don't have permission to do this, skip. */
- if (count == -1 && errno == EPERM)
ksft_exit_skip("No process_madvise() permissions\n");
Can you make this a bit more informative? What would user do if they see this message? Are they supposed to run the test as root?
- /* Returns the number of bytes advised. */
- ASSERT_EQ(count, 6 * page_size);
- /* Now make sure the poisoning was applied. */
- ASSERT_FALSE(try_read_write_buf(ptr1));
- ASSERT_FALSE(try_read_write_buf(&ptr1[9 * page_size]));
- ASSERT_FALSE(try_read_write_buf(ptr2));
- ASSERT_FALSE(try_read_write_buf(&ptr2[4 * page_size]));
- ASSERT_FALSE(try_read_write_buf(ptr3));
- ASSERT_FALSE(try_read_write_buf(&ptr3[19 * page_size]));
- /* Now do the same with unpoison... */
- count = process_madvise(pidfd, vec, 6, MADV_GUARD_UNPOISON, 0);
- /* ...and everything should now succeed. */
- ASSERT_TRUE(try_read_write_buf(ptr1));
- ASSERT_TRUE(try_read_write_buf(&ptr1[9 * page_size]));
- ASSERT_TRUE(try_read_write_buf(ptr2));
- ASSERT_TRUE(try_read_write_buf(&ptr2[4 * page_size]));
- ASSERT_TRUE(try_read_write_buf(ptr3));
- ASSERT_TRUE(try_read_write_buf(&ptr3[19 * page_size]));
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr1, 10 * page_size), 0);
- ASSERT_EQ(munmap(ptr2, 5 * page_size), 0);
- ASSERT_EQ(munmap(ptr3, 20 * page_size), 0);
- close(pidfd);
+}
+/* Assert that unmapping ranges does not leave poison behind. */ +TEST_F(guard_pages, munmap) +{
- const unsigned long page_size = self->page_size;
- char *ptr, *ptr_new1, *ptr_new2;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison first and last pages. */
- ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
- ASSERT_EQ(madvise(&ptr[9 * page_size], page_size, MADV_GUARD_POISON), 0);
- /* Assert that they are poisoned. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[9 * page_size]));
- /* Unmap them. */
- ASSERT_EQ(munmap(ptr, page_size), 0);
- ASSERT_EQ(munmap(&ptr[9 * page_size], page_size), 0);
- /* Map over them.*/
- ptr_new1 = mmap(ptr, page_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_new1, MAP_FAILED);
- ptr_new2 = mmap(&ptr[9 * page_size], page_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_new2, MAP_FAILED);
- /* Assert that they are now not poisoned. */
- ASSERT_TRUE(try_read_write_buf(ptr_new1));
- ASSERT_TRUE(try_read_write_buf(ptr_new2));
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Assert that mprotect() operations have no bearing on guard poison markers. */ +TEST_F(guard_pages, mprotect) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison the middle of the range. */
- ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size,
MADV_GUARD_POISON), 0);
- /* Assert that it is indeed poisoned. */
- ASSERT_FALSE(try_read_write_buf(&ptr[5 * page_size]));
- ASSERT_FALSE(try_read_write_buf(&ptr[6 * page_size]));
- /* Now make these pages read-only. */
- ASSERT_EQ(mprotect(&ptr[5 * page_size], 2 * page_size, PROT_READ), 0);
- /* Make sure the range is still poisoned. */
- ASSERT_FALSE(try_read_buf(&ptr[5 * page_size]));
- ASSERT_FALSE(try_read_buf(&ptr[6 * page_size]));
- /* Make sure we can poison again without issue.*/
- ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size,
MADV_GUARD_POISON), 0);
- /* Make sure the range is, yet again, still poisoned. */
- ASSERT_FALSE(try_read_buf(&ptr[5 * page_size]));
- ASSERT_FALSE(try_read_buf(&ptr[6 * page_size]));
- /* Now unpoison the whole range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
- /* Make sure the whole range is readable. */
- for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_buf(&ptr[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Split and merge VMAs and make sure guard pages still behave. */ +TEST_F(guard_pages, split_merge) +{
- const unsigned long page_size = self->page_size;
- char *ptr, *ptr_new;
- int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison the whole range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
- /* Make sure the whole range is poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Now unmap some pages in the range so we split. */
- ASSERT_EQ(munmap(&ptr[2 * page_size], page_size), 0);
- ASSERT_EQ(munmap(&ptr[5 * page_size], page_size), 0);
- ASSERT_EQ(munmap(&ptr[8 * page_size], page_size), 0);
- /* Make sure the remaining ranges are poisoned post-split. */
- for (i = 0; i < 2; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- for (i = 2; i < 5; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- for (i = 6; i < 8; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- for (i = 9; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Now map them again - the unmap will have cleared the poison. */
- ptr_new = mmap(&ptr[2 * page_size], page_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_new, MAP_FAILED);
- ptr_new = mmap(&ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_new, MAP_FAILED);
- ptr_new = mmap(&ptr[8 * page_size], page_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_new, MAP_FAILED);
- /* Now make sure poisoning is as expected. */
- for (i = 0; i < 10; i++) {
bool result = try_read_write_buf(&ptr[i * page_size]);
if (i == 2 || i == 5 || i == 8) {
ASSERT_TRUE(result);
} else {
ASSERT_FALSE(result);
}
- }
- /* Now poison everything again. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
- /* Make sure the whole range is poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Now split the range into three. */
- ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0);
- ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0);
- /* Make sure the whole range is poisoned for read. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_buf(&ptr[i * page_size]));
- }
- /* Now reset protection bits so we merge the whole thing. */
- ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0);
- ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size,
PROT_READ | PROT_WRITE), 0);
- /* Make sure the whole range is still poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Split range into 3 again... */
- ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0);
- ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0);
- /* ...and unpoison the whole range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
- /* Make sure the whole range is remedied for read. */
- for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_buf(&ptr[i * page_size]));
- }
- /* Merge them again. */
- ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0);
- ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size,
PROT_READ | PROT_WRITE), 0);
- /* Now ensure the merged range is remedied for read/write. */
- for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Assert that MADV_DONTNEED does not remove guard poison markers. */ +TEST_F(guard_pages, dontneed) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Back the whole range. */
- for (i = 0; i < 10; i++) {
ptr[i * page_size] = 'y';
- }
- /* Poison every other page. */
- for (i = 0; i < 10; i += 2) {
ASSERT_EQ(madvise(&ptr[i * page_size],
page_size, MADV_GUARD_POISON), 0);
- }
- /* Indicate that we don't need any of the range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_DONTNEED), 0);
- /* Check to ensure poison markers are still in place. */
- for (i = 0; i < 10; i++) {
bool result = try_read_buf(&ptr[i * page_size]);
if (i % 2 == 0) {
ASSERT_FALSE(result);
} else {
ASSERT_TRUE(result);
/* Make sure we really did get reset to zero page. */
ASSERT_EQ(ptr[i * page_size], '\0');
}
/* Now write... */
result = try_write_buf(&ptr[i * page_size]);
/* ...and make sure same result. */
if (i % 2 == 0) {
You don't need { here
ASSERT_FALSE(result);
} else {
ASSERT_TRUE(result);
}
Same here.
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Assert that mlock()'ed pages work correctly with poison markers. */ +TEST_F(guard_pages, mlock) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Populate. */
- for (i = 0; i < 10; i++) {
ptr[i * page_size] = 'y';
- }
- /* Lock. */
- ASSERT_EQ(mlock(ptr, 10 * page_size), 0);
- /* Now try to poison, should fail with EINVAL. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), -1);
- ASSERT_EQ(errno, EINVAL);
- /* OK unlock. */
- ASSERT_EQ(munlock(ptr, 10 * page_size), 0);
- /* Poison first half of range, should now succeed. */
- ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0);
- /* Make sure poison works. */
- for (i = 0; i < 10; i++) {
bool result = try_read_write_buf(&ptr[i * page_size]);
if (i < 5) {
ASSERT_FALSE(result);
} else {
ASSERT_TRUE(result);
ASSERT_EQ(ptr[i * page_size], 'x');
}
- }
- /*
* Now lock the latter part of the range. We can't lock the poisoned
* pages, as this would result in the pages being populated and the
* poisoning would cause this to error out.
*/
- ASSERT_EQ(mlock(&ptr[5 * page_size], 5 * page_size), 0);
- /*
* Now unpoison, we do not permit mlock()'d ranges to be remedied as it is
* a non-destructive operation.
*/
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
- /* Now check that everything is remedied. */
- for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/*
- Assert that moving, extending and shrinking memory via mremap() retains
- poison markers where possible.
- Moving a mapping alone should retain markers as they are.
- */
+TEST_F(guard_pages, mremap_move) +{
- const unsigned long page_size = self->page_size;
- char *ptr, *ptr_new;
- /* Map 5 pages. */
- ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Place poison markers at both ends of the 5 page span. */
- ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
- ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0);
- /* Make sure the poison is in effect. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
- /* Map a new region we will move this range into. Doing this ensures
* that we have reserved a range to map into.
*/
- ptr_new = mmap(NULL, 5 * page_size, PROT_NONE, MAP_ANON | MAP_PRIVATE,
-1, 0);
- ASSERT_NE(ptr_new, MAP_FAILED);
- ASSERT_EQ(mremap(ptr, 5 * page_size, 5 * page_size,
MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new), ptr_new);
- /* Make sure the poison is retained. */
- ASSERT_FALSE(try_read_write_buf(ptr_new));
- ASSERT_FALSE(try_read_write_buf(&ptr_new[4 * page_size]));
- /*
* Clean up - we only need reference the new pointer as we overwrote the
* PROT_NONE range and moved the existing one.
*/
- munmap(ptr_new, 5 * page_size);
+}
+/*
- Assert that moving, extending and shrinking memory via mremap() retains
- poison markers where possible.
- Expanding should retain, only now in different position. The user will have
- to unpoison manually to fix up (they'd have to do the same if it were a
- PROT_NONE mapping)
- */
+TEST_F(guard_pages, mremap_expand) +{
- const unsigned long page_size = self->page_size;
- char *ptr, *ptr_new;
- /* Map 10 pages... */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* ...But unmap the last 5 so we can ensure we can expand into them. */
- ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0);
- /* Place poison markers at both ends of the 5 page span. */
- ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
- ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0);
- /* Make sure the poison is in effect. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
- /* Now expand to 10 pages. */
- ptr = mremap(ptr, 5 * page_size, 10 * page_size, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Make sure the poison is retained in its original positions. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
- /* Reserve a region which we can move to and expand into. */
- ptr_new = mmap(NULL, 20 * page_size, PROT_NONE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_new, MAP_FAILED);
- /* Now move and expand into it. */
- ptr = mremap(ptr, 10 * page_size, 20 * page_size,
MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new);
- ASSERT_EQ(ptr, ptr_new);
- /* Again, make sure the poison is retained in its original
* positions. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
- /*
* A real user would have to unpoison, but would reasonably expect all
* characteristics of the mapping to be retained, including poison
* markers.
*/
- /* Cleanup. */
- munmap(ptr, 20 * page_size);
+} +/*
- Assert that moving, extending and shrinking memory via mremap() retains
- poison markers where possible.
- Shrinking will result in markers that are shrunk over being removed. Again,
- if the user were using a PROT_NONE mapping they'd have to manually fix this
- up also so this is OK.
- */
+TEST_F(guard_pages, mremap_shrink) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- /* Map 5 pages. */
- ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Place poison markers at both ends of the 5 page span. */
- ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
- ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0);
- /* Make sure the poison is in effect. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
- /* Now shrink to 3 pages. */
- ptr = mremap(ptr, 5 * page_size, 3 * page_size, MREMAP_MAYMOVE);
- ASSERT_NE(ptr, MAP_FAILED);
- /* We expect the poison marker at the start to be retained... */
- ASSERT_FALSE(try_read_write_buf(ptr));
- /* ...But remaining pages will not have poison markers. */
- for (i = 1; i < 3; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i + page_size]));
- }
- /*
* As with expansion, a real user would have to unpoison and fixup. But
* you'd have to do similar manual things with PROT_NONE mappings too.
*/
- /*
* If we expand back to the original size, the end marker will, of
* course, no longer be present.
*/
- ptr = mremap(ptr, 3 * page_size, 5 * page_size, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Again, we expect the poison marker at the start to be retained... */
- ASSERT_FALSE(try_read_write_buf(ptr));
- /* ...But remaining pages will not have poison markers. */
- for (i = 1; i < 5; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i + page_size]));
- }
- /* Cleanup. */
- munmap(ptr, 5 * page_size);
+}
+/*
- Assert that forking a process with VMAs that do not have VM_WIPEONFORK set
- retain guard pages.
- */
+TEST_F(guard_pages, fork) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- pid_t pid;
- int i;
- /* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison the first 5 pages. */
- ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0);
- pid = fork();
- ASSERT_NE(pid, -1);
- if (!pid) {
/* This is the child process now. */
/* Assert that the poisoning is in effect. */
for (i = 0; i < 10; i++) {
bool result = try_read_write_buf(&ptr[i * page_size]);
if (i < 5) {
ASSERT_FALSE(result);
} else {
ASSERT_TRUE(result);
}
}
/* Now unpoison the range.*/
ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
exit(0);
- }
- /* Parent process. */
- /* Parent simply waits on child. */
- waitpid(pid, NULL, 0);
- /* Child unpoison does not impact parent page table state. */
- for (i = 0; i < 10; i++) {
bool result = try_read_write_buf(&ptr[i * page_size]);
if (i < 5) {
ASSERT_FALSE(result);
} else {
ASSERT_TRUE(result);
}
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/*
- Assert that forking a process with VMAs that do have VM_WIPEONFORK set
- behave as expected.
- */
+TEST_F(guard_pages, fork_wipeonfork) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- pid_t pid;
- int i;
- /* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Mark wipe on fork. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_WIPEONFORK), 0);
- /* Poison the first 5 pages. */
- ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0);
- pid = fork();
- ASSERT_NE(pid, -1);
- if (!pid) {
/* This is the child process now. */
/* Poison will have been wiped. */
for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
}
exit(0);
- }
- /* Parent process. */
- waitpid(pid, NULL, 0);
- /* Poison should be in effect.*/
- for (i = 0; i < 10; i++) {
bool result = try_read_write_buf(&ptr[i * page_size]);
if (i < 5) {
ASSERT_FALSE(result);
} else {
ASSERT_TRUE(result);
}
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Ensure that MADV_FREE frees poison entries as expected. */ +TEST_F(guard_pages, lazyfree) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- /* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
- /* Ensure poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Lazyfree range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_FREE), 0);
- /* This should simply clear the poison markers. */
- for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Ensure that MADV_POPULATE_READ, MADV_POPULATE_WRITE behave as expected. */ +TEST_F(guard_pages, populate) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- /* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
- /* Populate read should error out... */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_READ), -1);
- ASSERT_EQ(errno, EFAULT);
- /* ...as should populate write. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_WRITE), -1);
- ASSERT_EQ(errno, EFAULT);
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Ensure that MADV_COLD, MADV_PAGEOUT do not remove poison markers. */ +TEST_F(guard_pages, cold_pageout) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- /* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
- /* Ensured poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Now mark cold. This should have no impact on poison markers. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_COLD), 0);
- /* Should remain poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* OK, now page out. This should equally, have no effect on markers. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_PAGEOUT), 0);
- /* Should remain poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Ensure that guard pages do not break userfaultd. */ +TEST_F(guard_pages, uffd) +{
- const unsigned long page_size = self->page_size;
- int uffd;
- char *ptr;
- int i;
- struct uffdio_api api = {
.api = UFFD_API,
.features = 0,
- };
- struct uffdio_register reg;
- struct uffdio_range range;
- /* Set up uffd. */
- uffd = userfaultfd(0);
- if (uffd == -1 && errno == EPERM)
ksft_exit_skip("No uffd permissions\n");
Same comment here about a user friendly message that say what user shoul do
- ASSERT_NE(uffd, -1);
- ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
- /* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Register the range with uffd. */
- range.start = (unsigned long)ptr;
- range.len = 10 * page_size;
- reg.range = range;
- reg.mode = UFFDIO_REGISTER_MODE_MISSING;
- ASSERT_EQ(ioctl(uffd, UFFDIO_REGISTER, ®), 0);
- /* Poison the range. This should not trigger the uffd. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
- /* The poisoning should behave as usual with no uffd intervention. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(ioctl(uffd, UFFDIO_UNREGISTER, &range), 0);
- close(uffd);
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+TEST_HARNESS_MAIN
thanks, -- Shuah
On Thu, Oct 17, 2024 at 03:24:49PM -0600, Shuah Khan wrote:
On 10/17/24 14:42, Lorenzo Stoakes wrote:
Utilise the kselftest harmness to implement tests for the guard page
Splleing NIT - harmness -> harness
implementation.
We start by implement basic tests asserting that guard pages can be
implmenting? By the way checkpatch will catch spelling stuuf. Please see comments about warnings below.
Thanks. The majority of the checkpatch warnings are invalid so I missed this. Will fix on respin.
established (poisoned), cleared (remedied) and that touching poisoned pages result in SIGSEGV. We also assert that, in remedying a range, non-poison pages remain intact.
We then examine different operations on regions containing poison markers behave to ensure correct behaviour:
- Operations over multiple VMAs operate as expected.
- Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in batches works correctly.
- Ensuring that munmap() correctly tears down poison markers.
- Using mprotect() to adjust protection bits does not in any way override or cause issues with poison markers.
- Ensuring that splitting and merging VMAs around poison markers causes no issue - i.e. that a marker which 'belongs' to one VMA can function just as well 'belonging' to another.
- Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
- Ensuring that mlock()'ing a range containing poison markers does not cause issues.
- Ensuring that mremap() can move a poisoned range and retain poison markers.
- Ensuring that mremap() can expand a poisoned range and retain poison markers (perhaps moving the range).
- Ensuring that mremap() can shrink a poisoned range and retain poison markers.
- Ensuring that forking a process correctly retains poison markers.
- Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
- Ensuring that lazyfree simply clears poison markers.
- Ensuring that userfaultfd can co-exist with guard pages.
- Ensuring that madvise(..., MADV_POPULATE_READ) and madvise(..., MADV_POPULATE_WRITE) error out when encountering poison markers.
- Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do not remove poison markers.
Good summary of test. Does the test require root access? If so does it check and skip appropriately?
Thanks and some do, in those cases we skip.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++ 3 files changed, 1170 insertions(+) create mode 100644 tools/testing/selftests/mm/guard-pages.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 689bbd520296..8f01f4da1c0d 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -54,3 +54,4 @@ droppable hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 +guard-pages diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 02e1204971b0..15c734d6cfec 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable +TEST_GEN_FILES += guard-pages ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c new file mode 100644 index 000000000000..2ab0ff3ba5a0 --- /dev/null +++ b/tools/testing/selftests/mm/guard-pages.c @@ -0,0 +1,1168 @@ +// SPDX-License-Identifier: GPL-2.0-or-later
+#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <assert.h> +#include <fcntl.h> +#include <setjmp.h> +#include <errno.h> +#include <linux/userfaultfd.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <sys/syscall.h> +#include <sys/uio.h> +#include <unistd.h>
+/* These may not yet be available in the uAPI so define if not. */
+#ifndef MADV_GUARD_POISON +#define MADV_GUARD_POISON 102 +#endif
+#ifndef MADV_GUARD_UNPOISON +#define MADV_GUARD_UNPOISON 103 +#endif
+volatile bool signal_jump_set;
Can you add a comment about why volatile is needed.
I'm not sure it's really necessary, it's completely standard to do this with signal handling and is one of the exceptions to the 'volatile considered harmful' rule.
By the way did you happen to run checkpatck on this. There are several instances where single statement blocks with braces {}
I noticed a few and ran checkpatch on your patch. There are 45 warnings regarding codeing style.
Please run checkpatch and clean them up so we can avoid followup checkpatch cleanup patches.
No sorry I won't, checkpatch isn't infallible and series trying to 'clean up' things that aren't issues will be a waste of everybody's time.
There are cases where removing the braces results in compilation error so obviously we can't remove them, in others it seriously hurts readability.
+sigjmp_buf signal_jmp_buf;
+static int userfaultfd(int flags) +{
- return syscall(SYS_userfaultfd, flags);
+}
+static void handle_fatal(int c) +{
- if (!signal_jump_set)
return;
- siglongjmp(signal_jmp_buf, c);
+}
+static int pidfd_open(pid_t pid, unsigned int flags) +{
- return syscall(SYS_pidfd_open, pid, flags);
+}
+/*
- Enable our signal catcher and try to read/write the specified buffer. The
- return value indicates whether the read/write succeeds without a fatal
- signal.
- */
+static bool try_access_buf(char *ptr, bool write) +{
- bool failed;
- /* Tell signal handler to jump back here on fatal signal. */
- signal_jump_set = true;
- /* If a fatal signal arose, we will jump back here and failed is set. */
- failed = sigsetjmp(signal_jmp_buf, 0) != 0;
- if (!failed) {
if (write) {
*ptr = 'x';
} else {
const volatile char *chr = ptr;
/* Force read. */
(void)*chr;
}
- }
- signal_jump_set = false;
- return !failed;
+}
+/* Try and read from a buffer, return true if no fatal signal. */ +static bool try_read_buf(char *ptr) +{
- return try_access_buf(ptr, false);
+}
+/* Try and write to a buffer, return true if no fatal signal. */ +static bool try_write_buf(char *ptr) +{
- return try_access_buf(ptr, true);
+}
+/*
- Try and BOTH read from AND write to a buffer, return true if BOTH operations
- succeed.
- */
+static bool try_read_write_buf(char *ptr) +{
- return try_read_buf(ptr) && try_write_buf(ptr);
+}
+FIXTURE(guard_pages) +{
- unsigned long page_size;
+};
+FIXTURE_SETUP(guard_pages) +{
- struct sigaction act = {
.sa_handler = &handle_fatal,
.sa_flags = SA_NODEFER,
- };
- sigemptyset(&act.sa_mask);
- if (sigaction(SIGSEGV, &act, NULL)) {
perror("sigaction");
ksft_exit_fail();
Replase the above two with ksft_exit_fail_perror() There is no need for perror("sigaction"); followed by ksft_exit_fail();
Ack will do on respin. Thanks for the tip! Wasn't aware of that.
- }
- self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
+};
+FIXTURE_TEARDOWN(guard_pages) +{
- struct sigaction act = {
.sa_handler = SIG_DFL,
.sa_flags = SA_NODEFER,
- };
- sigemptyset(&act.sa_mask);
- sigaction(SIGSEGV, &act, NULL);
+}
+TEST_F(guard_pages, basic) +{
- const unsigned long NUM_PAGES = 10;
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- ptr = mmap(NULL, NUM_PAGES * page_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANON, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Trivially assert we can touch the first page. */
- ASSERT_TRUE(try_read_write_buf(ptr));
- ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
- /* Establish that 1st page SIGSEGV's. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- /* Ensure we can touch everything else.*/
- for (i = 1; i < NUM_PAGES; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Establish a guard page at the end of the mapping. */
- ASSERT_EQ(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size,
MADV_GUARD_POISON), 0);
- /* Check that both guard pages result in SIGSEGV. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size]));
- /* Unpoison the first. */
- ASSERT_FALSE(madvise(ptr, page_size, MADV_GUARD_UNPOISON));
- /* Make sure we can touch it. */
- ASSERT_TRUE(try_read_write_buf(ptr));
- /* Unpoison the last. */
- ASSERT_FALSE(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size,
MADV_GUARD_UNPOISON));
- /* Make sure we can touch it. */
- ASSERT_TRUE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size]));
- /*
* Test setting a _range_ of pages, namely the first 3. The first of
* these be faulted in, so this also tests that we can poison backed
* pages.
*/
- ASSERT_EQ(madvise(ptr, 3 * page_size, MADV_GUARD_POISON), 0);
- /* Make sure they are all poisoned. */
- for (i = 0; i < 3; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
This one here and
Refer you to other comments around this, it hurts readability to do that.
- /* Make sure the rest are not. */
- for (i = 3; i < NUM_PAGES; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Unpoison them. */
- ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_UNPOISON), 0);
- /* Now make sure we can touch everything. */
- for (i = 0; i < NUM_PAGES; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Now unpoison everything, make sure we don't remove existing entries */
- ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_UNPOISON), 0);
- for (i = 0; i < NUM_PAGES * page_size; i += page_size) {
ASSERT_EQ(ptr[i], 'x');
- }
- ASSERT_EQ(munmap(ptr, NUM_PAGES * page_size), 0);
+}
+/* Assert that operations applied across multiple VMAs work as expected. */ +TEST_F(guard_pages, multi_vma) +{
- const unsigned long page_size = self->page_size;
- char *ptr_region, *ptr, *ptr1, *ptr2, *ptr3;
- int i;
- /* Reserve a 100 page region over which we can install VMAs. */
- ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_region, MAP_FAILED);
- /* Place a VMA of 10 pages size at the start of the region. */
- ptr1 = mmap(ptr_region, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr1, MAP_FAILED);
- /* Place a VMA of 5 pages size 50 pages into the region. */
- ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr2, MAP_FAILED);
- /* Place a VMA of 20 pages size at the end of the region. */
- ptr3 = mmap(&ptr_region[80 * page_size], 20 * page_size,
PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr3, MAP_FAILED);
- /* Unmap gaps. */
- ASSERT_EQ(munmap(&ptr_region[10 * page_size], 40 * page_size), 0);
- ASSERT_EQ(munmap(&ptr_region[55 * page_size], 25 * page_size), 0);
- /*
* We end up with VMAs like this:
*
* 0 10 .. 50 55 .. 80 100
* [---] [---] [---]
*/
- /* Now poison the whole range and make sure all VMAs are poisoned. */
- /*
* madvise() is certifiable and lets you perform operations over gaps,
* everything works, but it indicates an error and errno is set to
* -ENOMEM. Also if anything runs out of memory it is set to
* -ENOMEM. You are meant to guess which is which.
*/
- ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_POISON), -1);
- ASSERT_EQ(errno, ENOMEM);
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr1[i * page_size]));
- }
- for (i = 0; i < 5; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr2[i * page_size]));
- }
- for (i = 0; i < 20; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr3[i * page_size]));
- }
- /* Now unpoison the range and assert the opposite. */
- ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_UNPOISON), -1);
- ASSERT_EQ(errno, ENOMEM);
- for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr1[i * page_size]));
- }
- for (i = 0; i < 5; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr2[i * page_size]));
- }
- for (i = 0; i < 20; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr3[i * page_size]));
- }
- /* Now map incompatible VMAs in the gaps. */
- ptr = mmap(&ptr_region[10 * page_size], 40 * page_size,
PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- ptr = mmap(&ptr_region[55 * page_size], 25 * page_size,
PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /*
* We end up with VMAs like this:
*
* 0 10 .. 50 55 .. 80 100
* [---][xxxx][---][xxxx][---]
*
* Where 'x' signifies VMAs that cannot be merged with those adjacent to
* them.
*/
- /* Multiple VMAs adjacent to one another should result in no error. */
- ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_POISON), 0);
- for (i = 0; i < 100; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr_region[i * page_size]));
- }
- ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_UNPOISON), 0);
- for (i = 0; i < 100; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr_region[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr_region, 100 * page_size), 0);
+}
+/*
- Assert that batched operations performed using process_madvise() work as
- expected.
- */
+TEST_F(guard_pages, process_madvise) +{
- const unsigned long page_size = self->page_size;
- pid_t pid = getpid();
- int pidfd = pidfd_open(pid, 0);
- char *ptr_region, *ptr1, *ptr2, *ptr3;
- ssize_t count;
- struct iovec vec[6];
- ASSERT_NE(pidfd, -1);
- /* Reserve region to map over. */
- ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_region, MAP_FAILED);
- /* 10 pages offset 1 page into reserve region. */
- ptr1 = mmap(&ptr_region[page_size], 10 * page_size,
PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr1, MAP_FAILED);
- /* We want poison markers at start/end of each VMA. */
- vec[0].iov_base = ptr1;
- vec[0].iov_len = page_size;
- vec[1].iov_base = &ptr1[9 * page_size];
- vec[1].iov_len = page_size;
- /* 5 pages offset 50 pages into reserve region. */
- ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr2, MAP_FAILED);
- vec[2].iov_base = ptr2;
- vec[2].iov_len = page_size;
- vec[3].iov_base = &ptr2[4 * page_size];
- vec[3].iov_len = page_size;
- /* 20 pages offset 79 pages into reserve region. */
- ptr3 = mmap(&ptr_region[79 * page_size], 20 * page_size,
PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr3, MAP_FAILED);
- vec[4].iov_base = ptr3;
- vec[4].iov_len = page_size;
- vec[5].iov_base = &ptr3[19 * page_size];
- vec[5].iov_len = page_size;
- /* Free surrounding VMAs. */
- ASSERT_EQ(munmap(ptr_region, page_size), 0);
- ASSERT_EQ(munmap(&ptr_region[11 * page_size], 39 * page_size), 0);
- ASSERT_EQ(munmap(&ptr_region[55 * page_size], 24 * page_size), 0);
- ASSERT_EQ(munmap(&ptr_region[99 * page_size], page_size), 0);
- /* Now poison in one step. */
- count = process_madvise(pidfd, vec, 6, MADV_GUARD_POISON, 0);
- /* OK we don't have permission to do this, skip. */
- if (count == -1 && errno == EPERM)
ksft_exit_skip("No process_madvise() permissions\n");
Can you make this a bit more informative? What would user do if they see this message? Are they supposed to run the test as root?
OK can update on respin.
- /* Returns the number of bytes advised. */
- ASSERT_EQ(count, 6 * page_size);
- /* Now make sure the poisoning was applied. */
- ASSERT_FALSE(try_read_write_buf(ptr1));
- ASSERT_FALSE(try_read_write_buf(&ptr1[9 * page_size]));
- ASSERT_FALSE(try_read_write_buf(ptr2));
- ASSERT_FALSE(try_read_write_buf(&ptr2[4 * page_size]));
- ASSERT_FALSE(try_read_write_buf(ptr3));
- ASSERT_FALSE(try_read_write_buf(&ptr3[19 * page_size]));
- /* Now do the same with unpoison... */
- count = process_madvise(pidfd, vec, 6, MADV_GUARD_UNPOISON, 0);
- /* ...and everything should now succeed. */
- ASSERT_TRUE(try_read_write_buf(ptr1));
- ASSERT_TRUE(try_read_write_buf(&ptr1[9 * page_size]));
- ASSERT_TRUE(try_read_write_buf(ptr2));
- ASSERT_TRUE(try_read_write_buf(&ptr2[4 * page_size]));
- ASSERT_TRUE(try_read_write_buf(ptr3));
- ASSERT_TRUE(try_read_write_buf(&ptr3[19 * page_size]));
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr1, 10 * page_size), 0);
- ASSERT_EQ(munmap(ptr2, 5 * page_size), 0);
- ASSERT_EQ(munmap(ptr3, 20 * page_size), 0);
- close(pidfd);
+}
+/* Assert that unmapping ranges does not leave poison behind. */ +TEST_F(guard_pages, munmap) +{
- const unsigned long page_size = self->page_size;
- char *ptr, *ptr_new1, *ptr_new2;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison first and last pages. */
- ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
- ASSERT_EQ(madvise(&ptr[9 * page_size], page_size, MADV_GUARD_POISON), 0);
- /* Assert that they are poisoned. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[9 * page_size]));
- /* Unmap them. */
- ASSERT_EQ(munmap(ptr, page_size), 0);
- ASSERT_EQ(munmap(&ptr[9 * page_size], page_size), 0);
- /* Map over them.*/
- ptr_new1 = mmap(ptr, page_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_new1, MAP_FAILED);
- ptr_new2 = mmap(&ptr[9 * page_size], page_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_new2, MAP_FAILED);
- /* Assert that they are now not poisoned. */
- ASSERT_TRUE(try_read_write_buf(ptr_new1));
- ASSERT_TRUE(try_read_write_buf(ptr_new2));
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Assert that mprotect() operations have no bearing on guard poison markers. */ +TEST_F(guard_pages, mprotect) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison the middle of the range. */
- ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size,
MADV_GUARD_POISON), 0);
- /* Assert that it is indeed poisoned. */
- ASSERT_FALSE(try_read_write_buf(&ptr[5 * page_size]));
- ASSERT_FALSE(try_read_write_buf(&ptr[6 * page_size]));
- /* Now make these pages read-only. */
- ASSERT_EQ(mprotect(&ptr[5 * page_size], 2 * page_size, PROT_READ), 0);
- /* Make sure the range is still poisoned. */
- ASSERT_FALSE(try_read_buf(&ptr[5 * page_size]));
- ASSERT_FALSE(try_read_buf(&ptr[6 * page_size]));
- /* Make sure we can poison again without issue.*/
- ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size,
MADV_GUARD_POISON), 0);
- /* Make sure the range is, yet again, still poisoned. */
- ASSERT_FALSE(try_read_buf(&ptr[5 * page_size]));
- ASSERT_FALSE(try_read_buf(&ptr[6 * page_size]));
- /* Now unpoison the whole range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
- /* Make sure the whole range is readable. */
- for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_buf(&ptr[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Split and merge VMAs and make sure guard pages still behave. */ +TEST_F(guard_pages, split_merge) +{
- const unsigned long page_size = self->page_size;
- char *ptr, *ptr_new;
- int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison the whole range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
- /* Make sure the whole range is poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Now unmap some pages in the range so we split. */
- ASSERT_EQ(munmap(&ptr[2 * page_size], page_size), 0);
- ASSERT_EQ(munmap(&ptr[5 * page_size], page_size), 0);
- ASSERT_EQ(munmap(&ptr[8 * page_size], page_size), 0);
- /* Make sure the remaining ranges are poisoned post-split. */
- for (i = 0; i < 2; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- for (i = 2; i < 5; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- for (i = 6; i < 8; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- for (i = 9; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Now map them again - the unmap will have cleared the poison. */
- ptr_new = mmap(&ptr[2 * page_size], page_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_new, MAP_FAILED);
- ptr_new = mmap(&ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_new, MAP_FAILED);
- ptr_new = mmap(&ptr[8 * page_size], page_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_new, MAP_FAILED);
- /* Now make sure poisoning is as expected. */
- for (i = 0; i < 10; i++) {
bool result = try_read_write_buf(&ptr[i * page_size]);
if (i == 2 || i == 5 || i == 8) {
ASSERT_TRUE(result);
} else {
ASSERT_FALSE(result);
}
- }
- /* Now poison everything again. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
- /* Make sure the whole range is poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Now split the range into three. */
- ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0);
- ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0);
- /* Make sure the whole range is poisoned for read. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_buf(&ptr[i * page_size]));
- }
- /* Now reset protection bits so we merge the whole thing. */
- ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0);
- ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size,
PROT_READ | PROT_WRITE), 0);
- /* Make sure the whole range is still poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Split range into 3 again... */
- ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0);
- ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0);
- /* ...and unpoison the whole range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
- /* Make sure the whole range is remedied for read. */
- for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_buf(&ptr[i * page_size]));
- }
- /* Merge them again. */
- ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0);
- ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size,
PROT_READ | PROT_WRITE), 0);
- /* Now ensure the merged range is remedied for read/write. */
- for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Assert that MADV_DONTNEED does not remove guard poison markers. */ +TEST_F(guard_pages, dontneed) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Back the whole range. */
- for (i = 0; i < 10; i++) {
ptr[i * page_size] = 'y';
- }
- /* Poison every other page. */
- for (i = 0; i < 10; i += 2) {
ASSERT_EQ(madvise(&ptr[i * page_size],
page_size, MADV_GUARD_POISON), 0);
- }
- /* Indicate that we don't need any of the range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_DONTNEED), 0);
- /* Check to ensure poison markers are still in place. */
- for (i = 0; i < 10; i++) {
bool result = try_read_buf(&ptr[i * page_size]);
if (i % 2 == 0) {
ASSERT_FALSE(result);
} else {
ASSERT_TRUE(result);
/* Make sure we really did get reset to zero page. */
ASSERT_EQ(ptr[i * page_size], '\0');
}
/* Now write... */
result = try_write_buf(&ptr[i * page_size]);
/* ...and make sure same result. */
if (i % 2 == 0) {
You don't need { here
ASSERT_FALSE(result);
} else {
ASSERT_TRUE(result);
}
Same here.
Removing these results in compilation failure so we can't :) this is probably/possibly a bug in how the assert macros work but fixing that is out of scope for this series.
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Assert that mlock()'ed pages work correctly with poison markers. */ +TEST_F(guard_pages, mlock) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Populate. */
- for (i = 0; i < 10; i++) {
ptr[i * page_size] = 'y';
- }
- /* Lock. */
- ASSERT_EQ(mlock(ptr, 10 * page_size), 0);
- /* Now try to poison, should fail with EINVAL. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), -1);
- ASSERT_EQ(errno, EINVAL);
- /* OK unlock. */
- ASSERT_EQ(munlock(ptr, 10 * page_size), 0);
- /* Poison first half of range, should now succeed. */
- ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0);
- /* Make sure poison works. */
- for (i = 0; i < 10; i++) {
bool result = try_read_write_buf(&ptr[i * page_size]);
if (i < 5) {
ASSERT_FALSE(result);
} else {
ASSERT_TRUE(result);
ASSERT_EQ(ptr[i * page_size], 'x');
}
- }
- /*
* Now lock the latter part of the range. We can't lock the poisoned
* pages, as this would result in the pages being populated and the
* poisoning would cause this to error out.
*/
- ASSERT_EQ(mlock(&ptr[5 * page_size], 5 * page_size), 0);
- /*
* Now unpoison, we do not permit mlock()'d ranges to be remedied as it is
* a non-destructive operation.
*/
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
- /* Now check that everything is remedied. */
- for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/*
- Assert that moving, extending and shrinking memory via mremap() retains
- poison markers where possible.
- Moving a mapping alone should retain markers as they are.
- */
+TEST_F(guard_pages, mremap_move) +{
- const unsigned long page_size = self->page_size;
- char *ptr, *ptr_new;
- /* Map 5 pages. */
- ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Place poison markers at both ends of the 5 page span. */
- ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
- ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0);
- /* Make sure the poison is in effect. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
- /* Map a new region we will move this range into. Doing this ensures
* that we have reserved a range to map into.
*/
- ptr_new = mmap(NULL, 5 * page_size, PROT_NONE, MAP_ANON | MAP_PRIVATE,
-1, 0);
- ASSERT_NE(ptr_new, MAP_FAILED);
- ASSERT_EQ(mremap(ptr, 5 * page_size, 5 * page_size,
MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new), ptr_new);
- /* Make sure the poison is retained. */
- ASSERT_FALSE(try_read_write_buf(ptr_new));
- ASSERT_FALSE(try_read_write_buf(&ptr_new[4 * page_size]));
- /*
* Clean up - we only need reference the new pointer as we overwrote the
* PROT_NONE range and moved the existing one.
*/
- munmap(ptr_new, 5 * page_size);
+}
+/*
- Assert that moving, extending and shrinking memory via mremap() retains
- poison markers where possible.
- Expanding should retain, only now in different position. The user will have
- to unpoison manually to fix up (they'd have to do the same if it were a
- PROT_NONE mapping)
- */
+TEST_F(guard_pages, mremap_expand) +{
- const unsigned long page_size = self->page_size;
- char *ptr, *ptr_new;
- /* Map 10 pages... */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* ...But unmap the last 5 so we can ensure we can expand into them. */
- ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0);
- /* Place poison markers at both ends of the 5 page span. */
- ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
- ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0);
- /* Make sure the poison is in effect. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
- /* Now expand to 10 pages. */
- ptr = mremap(ptr, 5 * page_size, 10 * page_size, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Make sure the poison is retained in its original positions. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
- /* Reserve a region which we can move to and expand into. */
- ptr_new = mmap(NULL, 20 * page_size, PROT_NONE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr_new, MAP_FAILED);
- /* Now move and expand into it. */
- ptr = mremap(ptr, 10 * page_size, 20 * page_size,
MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new);
- ASSERT_EQ(ptr, ptr_new);
- /* Again, make sure the poison is retained in its original
* positions. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
- /*
* A real user would have to unpoison, but would reasonably expect all
* characteristics of the mapping to be retained, including poison
* markers.
*/
- /* Cleanup. */
- munmap(ptr, 20 * page_size);
+} +/*
- Assert that moving, extending and shrinking memory via mremap() retains
- poison markers where possible.
- Shrinking will result in markers that are shrunk over being removed. Again,
- if the user were using a PROT_NONE mapping they'd have to manually fix this
- up also so this is OK.
- */
+TEST_F(guard_pages, mremap_shrink) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- /* Map 5 pages. */
- ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Place poison markers at both ends of the 5 page span. */
- ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
- ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0);
- /* Make sure the poison is in effect. */
- ASSERT_FALSE(try_read_write_buf(ptr));
- ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
- /* Now shrink to 3 pages. */
- ptr = mremap(ptr, 5 * page_size, 3 * page_size, MREMAP_MAYMOVE);
- ASSERT_NE(ptr, MAP_FAILED);
- /* We expect the poison marker at the start to be retained... */
- ASSERT_FALSE(try_read_write_buf(ptr));
- /* ...But remaining pages will not have poison markers. */
- for (i = 1; i < 3; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i + page_size]));
- }
- /*
* As with expansion, a real user would have to unpoison and fixup. But
* you'd have to do similar manual things with PROT_NONE mappings too.
*/
- /*
* If we expand back to the original size, the end marker will, of
* course, no longer be present.
*/
- ptr = mremap(ptr, 3 * page_size, 5 * page_size, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Again, we expect the poison marker at the start to be retained... */
- ASSERT_FALSE(try_read_write_buf(ptr));
- /* ...But remaining pages will not have poison markers. */
- for (i = 1; i < 5; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i + page_size]));
- }
- /* Cleanup. */
- munmap(ptr, 5 * page_size);
+}
+/*
- Assert that forking a process with VMAs that do not have VM_WIPEONFORK set
- retain guard pages.
- */
+TEST_F(guard_pages, fork) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- pid_t pid;
- int i;
- /* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison the first 5 pages. */
- ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0);
- pid = fork();
- ASSERT_NE(pid, -1);
- if (!pid) {
/* This is the child process now. */
/* Assert that the poisoning is in effect. */
for (i = 0; i < 10; i++) {
bool result = try_read_write_buf(&ptr[i * page_size]);
if (i < 5) {
ASSERT_FALSE(result);
} else {
ASSERT_TRUE(result);
}
}
/* Now unpoison the range.*/
ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
exit(0);
- }
- /* Parent process. */
- /* Parent simply waits on child. */
- waitpid(pid, NULL, 0);
- /* Child unpoison does not impact parent page table state. */
- for (i = 0; i < 10; i++) {
bool result = try_read_write_buf(&ptr[i * page_size]);
if (i < 5) {
ASSERT_FALSE(result);
} else {
ASSERT_TRUE(result);
}
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/*
- Assert that forking a process with VMAs that do have VM_WIPEONFORK set
- behave as expected.
- */
+TEST_F(guard_pages, fork_wipeonfork) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- pid_t pid;
- int i;
- /* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Mark wipe on fork. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_WIPEONFORK), 0);
- /* Poison the first 5 pages. */
- ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0);
- pid = fork();
- ASSERT_NE(pid, -1);
- if (!pid) {
/* This is the child process now. */
/* Poison will have been wiped. */
for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
}
exit(0);
- }
- /* Parent process. */
- waitpid(pid, NULL, 0);
- /* Poison should be in effect.*/
- for (i = 0; i < 10; i++) {
bool result = try_read_write_buf(&ptr[i * page_size]);
if (i < 5) {
ASSERT_FALSE(result);
} else {
ASSERT_TRUE(result);
}
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Ensure that MADV_FREE frees poison entries as expected. */ +TEST_F(guard_pages, lazyfree) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- /* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
- /* Ensure poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Lazyfree range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_FREE), 0);
- /* This should simply clear the poison markers. */
- for (i = 0; i < 10; i++) {
ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Ensure that MADV_POPULATE_READ, MADV_POPULATE_WRITE behave as expected. */ +TEST_F(guard_pages, populate) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- /* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
- /* Populate read should error out... */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_READ), -1);
- ASSERT_EQ(errno, EFAULT);
- /* ...as should populate write. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_WRITE), -1);
- ASSERT_EQ(errno, EFAULT);
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Ensure that MADV_COLD, MADV_PAGEOUT do not remove poison markers. */ +TEST_F(guard_pages, cold_pageout) +{
- const unsigned long page_size = self->page_size;
- char *ptr;
- int i;
- /* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Poison range. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
- /* Ensured poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Now mark cold. This should have no impact on poison markers. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_COLD), 0);
- /* Should remain poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* OK, now page out. This should equally, have no effect on markers. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_PAGEOUT), 0);
- /* Should remain poisoned. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+/* Ensure that guard pages do not break userfaultd. */ +TEST_F(guard_pages, uffd) +{
- const unsigned long page_size = self->page_size;
- int uffd;
- char *ptr;
- int i;
- struct uffdio_api api = {
.api = UFFD_API,
.features = 0,
- };
- struct uffdio_register reg;
- struct uffdio_range range;
- /* Set up uffd. */
- uffd = userfaultfd(0);
- if (uffd == -1 && errno == EPERM)
ksft_exit_skip("No uffd permissions\n");
Same comment here about a user friendly message that say what user shoul do
Ack will update on respin.
- ASSERT_NE(uffd, -1);
- ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
- /* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
- ASSERT_NE(ptr, MAP_FAILED);
- /* Register the range with uffd. */
- range.start = (unsigned long)ptr;
- range.len = 10 * page_size;
- reg.range = range;
- reg.mode = UFFDIO_REGISTER_MODE_MISSING;
- ASSERT_EQ(ioctl(uffd, UFFDIO_REGISTER, ®), 0);
- /* Poison the range. This should not trigger the uffd. */
- ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
- /* The poisoning should behave as usual with no uffd intervention. */
- for (i = 0; i < 10; i++) {
ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
- }
- /* Cleanup. */
- ASSERT_EQ(ioctl(uffd, UFFDIO_UNREGISTER, &range), 0);
- close(uffd);
- ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+TEST_HARNESS_MAIN
thanks, -- Shuah
On 10/18/24 01:12, Lorenzo Stoakes wrote:
On Thu, Oct 17, 2024 at 03:24:49PM -0600, Shuah Khan wrote:
On 10/17/24 14:42, Lorenzo Stoakes wrote:
Utilise the kselftest harmness to implement tests for the guard page
Splleing NIT - harmness -> harness
implementation.
We start by implement basic tests asserting that guard pages can be
implmenting? By the way checkpatch will catch spelling stuuf. Please see comments about warnings below.
Thanks. The majority of the checkpatch warnings are invalid so I missed this. Will fix on respin.
established (poisoned), cleared (remedied) and that touching poisoned pages result in SIGSEGV. We also assert that, in remedying a range, non-poison pages remain intact.
We then examine different operations on regions containing poison markers behave to ensure correct behaviour:
- Operations over multiple VMAs operate as expected.
- Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in batches works correctly.
- Ensuring that munmap() correctly tears down poison markers.
- Using mprotect() to adjust protection bits does not in any way override or cause issues with poison markers.
- Ensuring that splitting and merging VMAs around poison markers causes no issue - i.e. that a marker which 'belongs' to one VMA can function just as well 'belonging' to another.
- Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
- Ensuring that mlock()'ing a range containing poison markers does not cause issues.
- Ensuring that mremap() can move a poisoned range and retain poison markers.
- Ensuring that mremap() can expand a poisoned range and retain poison markers (perhaps moving the range).
- Ensuring that mremap() can shrink a poisoned range and retain poison markers.
- Ensuring that forking a process correctly retains poison markers.
- Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
- Ensuring that lazyfree simply clears poison markers.
- Ensuring that userfaultfd can co-exist with guard pages.
- Ensuring that madvise(..., MADV_POPULATE_READ) and madvise(..., MADV_POPULATE_WRITE) error out when encountering poison markers.
- Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do not remove poison markers.
Good summary of test. Does the test require root access? If so does it check and skip appropriately?
Thanks and some do, in those cases we skip.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++ 3 files changed, 1170 insertions(+) create mode 100644 tools/testing/selftests/mm/guard-pages.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 689bbd520296..8f01f4da1c0d 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -54,3 +54,4 @@ droppable hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 +guard-pages diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 02e1204971b0..15c734d6cfec 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable +TEST_GEN_FILES += guard-pages ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c new file mode 100644 index 000000000000..2ab0ff3ba5a0 --- /dev/null +++ b/tools/testing/selftests/mm/guard-pages.c @@ -0,0 +1,1168 @@ +// SPDX-License-Identifier: GPL-2.0-or-later
+#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <assert.h> +#include <fcntl.h> +#include <setjmp.h> +#include <errno.h> +#include <linux/userfaultfd.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <sys/syscall.h> +#include <sys/uio.h> +#include <unistd.h>
+/* These may not yet be available in the uAPI so define if not. */
+#ifndef MADV_GUARD_POISON +#define MADV_GUARD_POISON 102 +#endif
+#ifndef MADV_GUARD_UNPOISON +#define MADV_GUARD_UNPOISON 103 +#endif
+volatile bool signal_jump_set;
Can you add a comment about why volatile is needed.
I'm not sure it's really necessary, it's completely standard to do this with signal handling and is one of the exceptions to the 'volatile considered harmful' rule.
By the way did you happen to run checkpatck on this. There are several instances where single statement blocks with braces {}
I noticed a few and ran checkpatch on your patch. There are 45 warnings regarding codeing style.
Please run checkpatch and clean them up so we can avoid followup checkpatch cleanup patches.
No sorry I won't, checkpatch isn't infallible and series trying to 'clean up' things that aren't issues will be a waste of everybody's time.
Sorry - this violates the coding styles and makes it hard to read.
See process/coding-style.rst:
Do not unnecessarily use braces where a single statement will do.
.. code-block:: c
if (condition) action();
and
.. code-block:: c
if (condition) do_this(); else do_that();
This does not apply if only one branch of a conditional statement is a single statement; in the latter case use braces in both branches:
.. code-block:: c
if (condition) { do_this(); do_that(); } else { otherwise(); }
Also, use braces when a loop contains more than a single simple statement:
.. code-block:: c
while (condition) { if (test) do_something(); }
thanks, -- Shuah
On Fri, Oct 18, 2024 at 09:32:17AM -0600, Shuah Khan wrote:
On 10/18/24 01:12, Lorenzo Stoakes wrote:
On Thu, Oct 17, 2024 at 03:24:49PM -0600, Shuah Khan wrote:
On 10/17/24 14:42, Lorenzo Stoakes wrote:
Utilise the kselftest harmness to implement tests for the guard page
Splleing NIT - harmness -> harness
implementation.
We start by implement basic tests asserting that guard pages can be
implmenting? By the way checkpatch will catch spelling stuuf. Please see comments about warnings below.
Thanks. The majority of the checkpatch warnings are invalid so I missed this. Will fix on respin.
established (poisoned), cleared (remedied) and that touching poisoned pages result in SIGSEGV. We also assert that, in remedying a range, non-poison pages remain intact.
We then examine different operations on regions containing poison markers behave to ensure correct behaviour:
- Operations over multiple VMAs operate as expected.
- Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in batches works correctly.
- Ensuring that munmap() correctly tears down poison markers.
- Using mprotect() to adjust protection bits does not in any way override or cause issues with poison markers.
- Ensuring that splitting and merging VMAs around poison markers causes no issue - i.e. that a marker which 'belongs' to one VMA can function just as well 'belonging' to another.
- Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
- Ensuring that mlock()'ing a range containing poison markers does not cause issues.
- Ensuring that mremap() can move a poisoned range and retain poison markers.
- Ensuring that mremap() can expand a poisoned range and retain poison markers (perhaps moving the range).
- Ensuring that mremap() can shrink a poisoned range and retain poison markers.
- Ensuring that forking a process correctly retains poison markers.
- Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
- Ensuring that lazyfree simply clears poison markers.
- Ensuring that userfaultfd can co-exist with guard pages.
- Ensuring that madvise(..., MADV_POPULATE_READ) and madvise(..., MADV_POPULATE_WRITE) error out when encountering poison markers.
- Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do not remove poison markers.
Good summary of test. Does the test require root access? If so does it check and skip appropriately?
Thanks and some do, in those cases we skip.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++ 3 files changed, 1170 insertions(+) create mode 100644 tools/testing/selftests/mm/guard-pages.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 689bbd520296..8f01f4da1c0d 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -54,3 +54,4 @@ droppable hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 +guard-pages diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 02e1204971b0..15c734d6cfec 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable +TEST_GEN_FILES += guard-pages ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c new file mode 100644 index 000000000000..2ab0ff3ba5a0 --- /dev/null +++ b/tools/testing/selftests/mm/guard-pages.c @@ -0,0 +1,1168 @@ +// SPDX-License-Identifier: GPL-2.0-or-later
+#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <assert.h> +#include <fcntl.h> +#include <setjmp.h> +#include <errno.h> +#include <linux/userfaultfd.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <sys/syscall.h> +#include <sys/uio.h> +#include <unistd.h>
+/* These may not yet be available in the uAPI so define if not. */
+#ifndef MADV_GUARD_POISON +#define MADV_GUARD_POISON 102 +#endif
+#ifndef MADV_GUARD_UNPOISON +#define MADV_GUARD_UNPOISON 103 +#endif
+volatile bool signal_jump_set;
Can you add a comment about why volatile is needed.
I'm not sure it's really necessary, it's completely standard to do this with signal handling and is one of the exceptions to the 'volatile considered harmful' rule.
By the way did you happen to run checkpatck on this. There are several instances where single statement blocks with braces {}
I noticed a few and ran checkpatch on your patch. There are 45 warnings regarding codeing style.
Please run checkpatch and clean them up so we can avoid followup checkpatch cleanup patches.
No sorry I won't, checkpatch isn't infallible and series trying to 'clean up' things that aren't issues will be a waste of everybody's time.
Sorry - this violates the coding styles and makes it hard to read.
See process/coding-style.rst:
Do not unnecessarily use braces where a single statement will do.
.. code-block:: c
if (condition) action();
and
.. code-block:: c
if (condition) do_this(); else do_that();
This does not apply if only one branch of a conditional statement is a single statement; in the latter case use braces in both branches:
.. code-block:: c
if (condition) { do_this(); do_that(); } else { otherwise(); }
Also, use braces when a loop contains more than a single simple statement:
.. code-block:: c
while (condition) { if (test) do_something(); }
thanks, -- Shuah
Shuah, quoting coding standards to an experienced kernel developer (maintainer now) is maybe not the best way to engage here + it may have been more productive for you to first engage on why it is I'm deviating here.
Firstly, as I said, the code _does not compile_ if I do not use braces in many cases. This is probably an issue with the macros, but it is out of scope for this series for me to fix that.
'Fixing' these cases results in:
CC guard-pages guard-pages.c: In function ‘guard_pages_split_merge’: guard-pages.c:566:17: error: ‘else’ without a previous ‘if’ 566 | else | ^~~~ guard-pages.c: In function ‘guard_pages_dontneed’: guard-pages.c:666:17: error: ‘else’ without a previous ‘if’ 666 | else | ^~~~ guard-pages.c: In function ‘guard_pages_fork’: guard-pages.c:957:17: error: ‘else’ without a previous ‘if’ 957 | else | ^~~~ guard-pages.c: In function ‘guard_pages_fork_wipeonfork’: guard-pages.c:1010:17: error: ‘else’ without a previous ‘if’ 1010 | else | ^~~~
In other cases I am simply not a fan of single line loops where there is a lot of compound stuff going on:
for (i = 0; i < 10; i++) { ASSERT_FALSE(try_read_write_buf(&ptr1[i * page_size])); }
vs.
for (i = 0; i < 10; i++) ASSERT_FALSE(try_read_write_buf(&ptr1[i * page_size]));
When there are very many loops like this. This is kind of a test-specific thing, you'd maybe put more effort into splitting this up + have less repetition in non-test code.
I'm not going to die on the hill of single-line-for-loops though, so if you insist I'll change those.
However I simply _cannot_ change the if/else blocks that cause compilation errors.
This is what I mean about checkpatch being fallible. It's also fallible in other cases, like variable declarations via macro (understandably).
Expecting checkpatch to give zero warnings is simply unattainable, unfortunately.
As you seem adamant about strict adherence to checkpatch, and I always try to be accommodating where I can be, I suggest I fix everything _except where it breaks the compilation_ does that work for you?
Thanks.
On Fri, Oct 18, 2024 at 05:07:20PM +0100, Lorenzo Stoakes wrote: [snip]
Firstly, as I said, the code _does not compile_ if I do not use braces in many cases. This is probably an issue with the macros, but it is out of scope for this series for me to fix that.
'Fixing' these cases results in:
CC guard-pages guard-pages.c: In function ‘guard_pages_split_merge’: guard-pages.c:566:17: error: ‘else’ without a previous ‘if’ 566 | else | ^~~~ guard-pages.c: In function ‘guard_pages_dontneed’: guard-pages.c:666:17: error: ‘else’ without a previous ‘if’ 666 | else | ^~~~ guard-pages.c: In function ‘guard_pages_fork’: guard-pages.c:957:17: error: ‘else’ without a previous ‘if’ 957 | else | ^~~~ guard-pages.c: In function ‘guard_pages_fork_wipeonfork’: guard-pages.c:1010:17: error: ‘else’ without a previous ‘if’ 1010 | else | ^~~~
[snip]
An added note - the fact this happens makes the macros suspect everywhere, and I am concerned single-lining them might break or cause failures not to propagate perhaps, which led to me _never_ single-lining them I believe and accounts for a bunch of the warnings.
I can go through and manually test each one to make sure before I single-line them if necessary.
On 10/18/24 10:07, Lorenzo Stoakes wrote:
On Fri, Oct 18, 2024 at 09:32:17AM -0600, Shuah Khan wrote:
On 10/18/24 01:12, Lorenzo Stoakes wrote:
On Thu, Oct 17, 2024 at 03:24:49PM -0600, Shuah Khan wrote:
On 10/17/24 14:42, Lorenzo Stoakes wrote:
Utilise the kselftest harmness to implement tests for the guard page
Splleing NIT - harmness -> harness
implementation.
We start by implement basic tests asserting that guard pages can be
implmenting? By the way checkpatch will catch spelling stuuf. Please see comments about warnings below.
Thanks. The majority of the checkpatch warnings are invalid so I missed this. Will fix on respin.
established (poisoned), cleared (remedied) and that touching poisoned pages result in SIGSEGV. We also assert that, in remedying a range, non-poison pages remain intact.
We then examine different operations on regions containing poison markers behave to ensure correct behaviour:
- Operations over multiple VMAs operate as expected.
- Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in batches works correctly.
- Ensuring that munmap() correctly tears down poison markers.
- Using mprotect() to adjust protection bits does not in any way override or cause issues with poison markers.
- Ensuring that splitting and merging VMAs around poison markers causes no issue - i.e. that a marker which 'belongs' to one VMA can function just as well 'belonging' to another.
- Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
- Ensuring that mlock()'ing a range containing poison markers does not cause issues.
- Ensuring that mremap() can move a poisoned range and retain poison markers.
- Ensuring that mremap() can expand a poisoned range and retain poison markers (perhaps moving the range).
- Ensuring that mremap() can shrink a poisoned range and retain poison markers.
- Ensuring that forking a process correctly retains poison markers.
- Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
- Ensuring that lazyfree simply clears poison markers.
- Ensuring that userfaultfd can co-exist with guard pages.
- Ensuring that madvise(..., MADV_POPULATE_READ) and madvise(..., MADV_POPULATE_WRITE) error out when encountering poison markers.
- Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do not remove poison markers.
Good summary of test. Does the test require root access? If so does it check and skip appropriately?
Thanks and some do, in those cases we skip.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++ 3 files changed, 1170 insertions(+) create mode 100644 tools/testing/selftests/mm/guard-pages.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 689bbd520296..8f01f4da1c0d 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -54,3 +54,4 @@ droppable hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 +guard-pages diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 02e1204971b0..15c734d6cfec 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable +TEST_GEN_FILES += guard-pages ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c new file mode 100644 index 000000000000..2ab0ff3ba5a0 --- /dev/null +++ b/tools/testing/selftests/mm/guard-pages.c @@ -0,0 +1,1168 @@ +// SPDX-License-Identifier: GPL-2.0-or-later
+#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <assert.h> +#include <fcntl.h> +#include <setjmp.h> +#include <errno.h> +#include <linux/userfaultfd.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <sys/syscall.h> +#include <sys/uio.h> +#include <unistd.h>
+/* These may not yet be available in the uAPI so define if not. */
+#ifndef MADV_GUARD_POISON +#define MADV_GUARD_POISON 102 +#endif
+#ifndef MADV_GUARD_UNPOISON +#define MADV_GUARD_UNPOISON 103 +#endif
+volatile bool signal_jump_set;
Can you add a comment about why volatile is needed.
I'm not sure it's really necessary, it's completely standard to do this with signal handling and is one of the exceptions to the 'volatile considered harmful' rule.
By the way did you happen to run checkpatck on this. There are several instances where single statement blocks with braces {}
I noticed a few and ran checkpatch on your patch. There are 45 warnings regarding codeing style.
Please run checkpatch and clean them up so we can avoid followup checkpatch cleanup patches.
No sorry I won't, checkpatch isn't infallible and series trying to 'clean up' things that aren't issues will be a waste of everybody's time.
Sorry - this violates the coding styles and makes it hard to read.
See process/coding-style.rst:
Do not unnecessarily use braces where a single statement will do.
.. code-block:: c
if (condition) action();
and
.. code-block:: c
if (condition) do_this(); else do_that();
This does not apply if only one branch of a conditional statement is a single statement; in the latter case use braces in both branches:
.. code-block:: c
if (condition) { do_this(); do_that(); } else { otherwise(); }
Also, use braces when a loop contains more than a single simple statement:
.. code-block:: c
while (condition) { if (test) do_something(); }
thanks, -- Shuah
Shuah, quoting coding standards to an experienced kernel developer (maintainer now) is maybe not the best way to engage here + it may have been more productive for you to first engage on why it is I'm deviating here.
Firstly, as I said, the code _does not compile_ if I do not use braces in many cases. This is probably an issue with the macros, but it is out of scope for this series for me to fix that.
I am not trying to throw the book at you. When I see 45 of them I have to ask the reasons why.
'Fixing' these cases results in:
CC guard-pages guard-pages.c: In function ‘guard_pages_split_merge’: guard-pages.c:566:17: error: ‘else’ without a previous ‘if’ 566 | else | ^~~~ guard-pages.c: In function ‘guard_pages_dontneed’: guard-pages.c:666:17: error: ‘else’ without a previous ‘if’ 666 | else | ^~~~ guard-pages.c: In function ‘guard_pages_fork’: guard-pages.c:957:17: error: ‘else’ without a previous ‘if’ 957 | else | ^~~~ guard-pages.c: In function ‘guard_pages_fork_wipeonfork’: guard-pages.c:1010:17: error: ‘else’ without a previous ‘if’ 1010 | else | ^~~~
In other cases I am simply not a fan of single line loops where there is a lot of compound stuff going on:
for (i = 0; i < 10; i++) { ASSERT_FALSE(try_read_write_buf(&ptr1[i * page_size])); }
vs.
for (i = 0; i < 10; i++) ASSERT_FALSE(try_read_write_buf(&ptr1[i * page_size]));
When there are very many loops like this. This is kind of a test-specific thing, you'd maybe put more effort into splitting this up + have less repetition in non-test code.
I'm not going to die on the hill of single-line-for-loops though, so if you insist I'll change those.
However I simply _cannot_ change the if/else blocks that cause compilation errors.
This is what I mean about checkpatch being fallible. It's also fallible in other cases, like variable declarations via macro (understandably).
Expecting checkpatch to give zero warnings is simply unattainable, unfortunately.
As you seem adamant about strict adherence to checkpatch, and I always try to be accommodating where I can be, I suggest I fix everything _except where it breaks the compilation_ does that work for you?
Yes Please. If you leave these in here as soon as the patch hits next, we start seeing a flood of patches. It becomes harder to fix these later due to merge conflicts.
It becomes a patch overload issue. Yes I would like to see the ones that don't result in compile errors fixed.
thanks, -- Shuah
Thanks.
On 10/18/24 10:07, Lorenzo Stoakes wrote:
On Fri, Oct 18, 2024 at 09:32:17AM -0600, Shuah Khan wrote:
On 10/18/24 01:12, Lorenzo Stoakes wrote:
On Thu, Oct 17, 2024 at 03:24:49PM -0600, Shuah Khan wrote:
On 10/17/24 14:42, Lorenzo Stoakes wrote:
Utilise the kselftest harmness to implement tests for the guard page
Splleing NIT - harmness -> harness
implementation.
We start by implement basic tests asserting that guard pages can be
implmenting? By the way checkpatch will catch spelling stuuf. Please see comments about warnings below.
Thanks. The majority of the checkpatch warnings are invalid so I missed this. Will fix on respin.
established (poisoned), cleared (remedied) and that touching poisoned pages result in SIGSEGV. We also assert that, in remedying a range, non-poison pages remain intact.
We then examine different operations on regions containing poison markers behave to ensure correct behaviour:
- Operations over multiple VMAs operate as expected.
- Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in batches works correctly.
- Ensuring that munmap() correctly tears down poison markers.
- Using mprotect() to adjust protection bits does not in any way override or cause issues with poison markers.
- Ensuring that splitting and merging VMAs around poison markers causes no issue - i.e. that a marker which 'belongs' to one VMA can function just as well 'belonging' to another.
- Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
- Ensuring that mlock()'ing a range containing poison markers does not cause issues.
- Ensuring that mremap() can move a poisoned range and retain poison markers.
- Ensuring that mremap() can expand a poisoned range and retain poison markers (perhaps moving the range).
- Ensuring that mremap() can shrink a poisoned range and retain poison markers.
- Ensuring that forking a process correctly retains poison markers.
- Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
- Ensuring that lazyfree simply clears poison markers.
- Ensuring that userfaultfd can co-exist with guard pages.
- Ensuring that madvise(..., MADV_POPULATE_READ) and madvise(..., MADV_POPULATE_WRITE) error out when encountering poison markers.
- Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do not remove poison markers.
Good summary of test. Does the test require root access? If so does it check and skip appropriately?
Thanks and some do, in those cases we skip.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++ 3 files changed, 1170 insertions(+) create mode 100644 tools/testing/selftests/mm/guard-pages.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 689bbd520296..8f01f4da1c0d 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -54,3 +54,4 @@ droppable hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 +guard-pages diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 02e1204971b0..15c734d6cfec 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable +TEST_GEN_FILES += guard-pages ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c new file mode 100644 index 000000000000..2ab0ff3ba5a0 --- /dev/null +++ b/tools/testing/selftests/mm/guard-pages.c @@ -0,0 +1,1168 @@ +// SPDX-License-Identifier: GPL-2.0-or-later
+#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <assert.h> +#include <fcntl.h> +#include <setjmp.h> +#include <errno.h> +#include <linux/userfaultfd.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <sys/syscall.h> +#include <sys/uio.h> +#include <unistd.h>
+/* These may not yet be available in the uAPI so define if not. */
+#ifndef MADV_GUARD_POISON +#define MADV_GUARD_POISON 102 +#endif
+#ifndef MADV_GUARD_UNPOISON +#define MADV_GUARD_UNPOISON 103 +#endif
+volatile bool signal_jump_set;
Can you add a comment about why volatile is needed.
I'm not sure it's really necessary, it's completely standard to do this with signal handling and is one of the exceptions to the 'volatile considered harmful' rule.
By the way did you happen to run checkpatck on this. There are several instances where single statement blocks with braces {}
I noticed a few and ran checkpatch on your patch. There are 45 warnings regarding codeing style.
Please run checkpatch and clean them up so we can avoid followup checkpatch cleanup patches.
No sorry I won't, checkpatch isn't infallible and series trying to 'clean up' things that aren't issues will be a waste of everybody's time.
Sorry - this violates the coding styles and makes it hard to read.
See process/coding-style.rst:
Do not unnecessarily use braces where a single statement will do.
.. code-block:: c
if (condition) action();
and
.. code-block:: c
if (condition) do_this(); else do_that();
This does not apply if only one branch of a conditional statement is a single statement; in the latter case use braces in both branches:
.. code-block:: c
if (condition) { do_this(); do_that(); } else { otherwise(); }
Also, use braces when a loop contains more than a single simple statement:
.. code-block:: c
while (condition) { if (test) do_something(); }
thanks, -- Shuah
Shuah, quoting coding standards to an experienced kernel developer (maintainer now) is maybe not the best way to engage here + it may have been more productive for you to first engage on why it is I'm deviating here.
This is not the only comment I gave you in this patch and your other patches.
thanks, -- Shuah
On Fri, Oct 18, 2024 at 10:25:55AM -0600, Shuah Khan wrote: [snip]
Sorry - this violates the coding styles and makes it hard to read.
See process/coding-style.rst:
Do not unnecessarily use braces where a single statement will do.
.. code-block:: c
if (condition) action();
and
.. code-block:: c
if (condition) do_this(); else do_that();
This does not apply if only one branch of a conditional statement is a single statement; in the latter case use braces in both branches:
.. code-block:: c
if (condition) { do_this(); do_that(); } else { otherwise(); }
Also, use braces when a loop contains more than a single simple statement:
.. code-block:: c
while (condition) { if (test) do_something(); }
thanks, -- Shuah
Shuah, quoting coding standards to an experienced kernel developer (maintainer now) is maybe not the best way to engage here + it may have been more productive for you to first engage on why it is I'm deviating here.
This is not the only comment I gave you in this patch and your other patches.
And to be clear - I absolutely appreciate your feedback and in all cases except ones which would result in things not compiling have acted (rather promptly) to address them.
I am simply pointing out that it's not my lack of knowledge of these rules that's the issue it's 3 things:
1. Some code doesn't compile following these rules 2. I therefore don't trust the macros in single-line statements 3. I am not a fan of for/while loops with heavily compound statements
1 and is unavoidable, 2 maybe is avoidable with auditing and 3 is subjective, and is something I have now undertaken to change.
I've heard a number of kernel developers' opinions on checkpatch and the overall consensus has been that, while the core style is sacrosanct, strict adherence to checkpatch warnings is not.
This may be worth a broader discussion (outside of this series). One must definitely account for cases where the syntactic analysis fails for instance so 'always strictly adhere' is out unfortunately (why I say it is fallible) however perhaps 'strict adherence except where it is obviously wrong' is a position one could take.
Your have a significantly different view on this and that's fine, as I said I always try to be accommodating.
Key thing is we've found a way forward which I will act on.
Thanks, Lorenzo
+CC linux-api (also should on future revisions)
On 10/17/24 22:42, Lorenzo Stoakes wrote:
Userland library functions such as allocators and threading implementations often require regions of memory to act as 'guard pages' - mappings which, when accessed, result in a fatal signal being sent to the accessing process.
The current means by which these are implemented is via a PROT_NONE mmap() mapping, which provides the required semantics however incur an overhead of a VMA for each such region.
With a great many processes and threads, this can rapidly add up and incur a significant memory penalty. It also has the added problem of preventing merges that might otherwise be permitted.
This series takes a different approach - an idea suggested by Vlasimil Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the provenance becomes a little tricky to ascertain after this - please forgive any omissions!) - rather than locating the guard pages at the VMA layer, instead placing them in page tables mapping the required ranges.
Early testing of the prototype version of this code suggests a 5 times speed up in memory mapping invocations (in conjunction with use of process_madvise()) and a 13% reduction in VMAs on an entirely idle android system and unoptimised code.
We expect with optimisation and a loaded system with a larger number of guard pages this could significantly increase, but in any case these numbers are encouraging.
This way, rather than having separate VMAs specifying which parts of a range are guard pages, instead we have a VMA spanning the entire range of memory a user is permitted to access and including ranges which are to be 'guarded'.
After mapping this, a user can specify which parts of the range should result in a fatal signal when accessed.
By restricting the ability to specify guard pages to memory mapped by existing VMAs, we can rely on the mappings being torn down when the mappings are ultimately unmapped and everything works simply as if the memory were not faulted in, from the point of view of the containing VMAs.
This mechanism in effect poisons memory ranges similar to hardware memory poisoning, only it is an entirely software-controlled form of poisoning.
Any poisoned region of memory is also able to 'unpoisoned', that is, to have its poison markers removed.
The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this poisoning.
Poisoning can be performed across multiple VMAs and any existing mappings will be cleared, that is zapped, before installing the poisoned page table mappings.
There is no concept of 'nested' poisoning, multiple attempts to poison a range will, after the first poisoning, have no effect.
Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned memory, so a user can safely unpoison a range of memory and clear only poison page table mappings leaving the rest intact.
The actual mechanism by which the page table entries are specified makes use of existing logic - PTE markers, which are used for the userfaultfd UFFDIO_POISON mechanism.
Unfortunately PTE_MARKER_POISONED is not suited for the guard page mechanism as it results in VM_FAULT_HWPOISON semantics in the fault handler, so we add our own specific PTE_MARKER_GUARD and adapt existing logic to handle it.
We also extend the generic page walk mechanism to allow for installation of PTEs (carefully restricted to memory management logic only to prevent unwanted abuse).
We ensure that zapping performed by, for instance, MADV_DONTNEED, does not remove guard poison markers, nor does forking (except when VM_WIPEONFORK is specified for a VMA which implies a total removal of memory characteristics).
It's important to note that the guard page implementation is emphatically NOT a security feature, so a user can remove the poisoning if they wish. We simply implement it in such a way as to provide the least surprising behaviour.
An extensive set of self-tests are provided which ensure behaviour is as expected and additionally self-documents expected behaviour of poisoned ranges.
Suggested-by: Vlastimil Babka vbabka@suze.cz
Please fix the domain typo (also in patch 3 :)
Thanks for implementing this, Vlastimil
Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com
v1
- Un-RFC'd as appears no major objections to approach but rather debate on implementation.
- Fixed issue with arches which need mmu_context.h and tlbfush.h. header imports in pagewalker logic to be able to use update_mmu_cache() as reported by the kernel test bot.
- Added comments in page walker logic to clarify who can use ops->install_pte and why as well as adding a check_ops_valid() helper function, as suggested by Christoph.
- Pass false in full parameter in pte_clear_not_present_full() as suggested by Jann.
- Stopped erroneously requiring a write lock for the poison operation as suggested by Jann and Suren.
- Moved anon_vma_prepare() to the start of madvise_guard_poison() to be consistent with how this is used elsewhere in the kernel as suggested by Jann.
- Avoid returning -EAGAIN if we are raced on page faults, just keep looping and duck out if a fatal signal is pending or a conditional reschedule is needed, as suggested by Jann.
- Avoid needlessly splitting huge PUDs and PMDs by specifying ACTION_CONTINUE, as suggested by Jann.
RFC https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (4): mm: pagewalk: add the ability to install PTEs mm: add PTE_MARKER_GUARD PTE marker mm: madvise: implement lightweight guard page mechanism selftests/mm: add self tests for guard page feature
arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/linux/mm_inline.h | 2 +- include/linux/pagewalk.h | 18 +- include/linux/swapops.h | 26 +- include/uapi/asm-generic/mman-common.h | 3 + mm/hugetlb.c | 3 + mm/internal.h | 6 + mm/madvise.c | 168 ++++ mm/memory.c | 18 +- mm/mprotect.c | 3 +- mm/mseal.c | 1 + mm/pagewalk.c | 200 ++-- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++ 18 files changed, 1564 insertions(+), 66 deletions(-) create mode 100644 tools/testing/selftests/mm/guard-pages.c
-- 2.46.2
On Fri, Oct 18, 2024 at 06:10:37PM +0200, Vlastimil Babka wrote:
+CC linux-api (also should on future revisions)
They're cc'd :) assuming Linux API linux-api@vger.kernel.org is correct right?
On 10/17/24 22:42, Lorenzo Stoakes wrote:
Userland library functions such as allocators and threading implementations often require regions of memory to act as 'guard pages' - mappings which, when accessed, result in a fatal signal being sent to the accessing process.
The current means by which these are implemented is via a PROT_NONE mmap() mapping, which provides the required semantics however incur an overhead of a VMA for each such region.
With a great many processes and threads, this can rapidly add up and incur a significant memory penalty. It also has the added problem of preventing merges that might otherwise be permitted.
This series takes a different approach - an idea suggested by Vlasimil Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the provenance becomes a little tricky to ascertain after this - please forgive any omissions!) - rather than locating the guard pages at the VMA layer, instead placing them in page tables mapping the required ranges.
Early testing of the prototype version of this code suggests a 5 times speed up in memory mapping invocations (in conjunction with use of process_madvise()) and a 13% reduction in VMAs on an entirely idle android system and unoptimised code.
We expect with optimisation and a loaded system with a larger number of guard pages this could significantly increase, but in any case these numbers are encouraging.
This way, rather than having separate VMAs specifying which parts of a range are guard pages, instead we have a VMA spanning the entire range of memory a user is permitted to access and including ranges which are to be 'guarded'.
After mapping this, a user can specify which parts of the range should result in a fatal signal when accessed.
By restricting the ability to specify guard pages to memory mapped by existing VMAs, we can rely on the mappings being torn down when the mappings are ultimately unmapped and everything works simply as if the memory were not faulted in, from the point of view of the containing VMAs.
This mechanism in effect poisons memory ranges similar to hardware memory poisoning, only it is an entirely software-controlled form of poisoning.
Any poisoned region of memory is also able to 'unpoisoned', that is, to have its poison markers removed.
The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this poisoning.
Poisoning can be performed across multiple VMAs and any existing mappings will be cleared, that is zapped, before installing the poisoned page table mappings.
There is no concept of 'nested' poisoning, multiple attempts to poison a range will, after the first poisoning, have no effect.
Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned memory, so a user can safely unpoison a range of memory and clear only poison page table mappings leaving the rest intact.
The actual mechanism by which the page table entries are specified makes use of existing logic - PTE markers, which are used for the userfaultfd UFFDIO_POISON mechanism.
Unfortunately PTE_MARKER_POISONED is not suited for the guard page mechanism as it results in VM_FAULT_HWPOISON semantics in the fault handler, so we add our own specific PTE_MARKER_GUARD and adapt existing logic to handle it.
We also extend the generic page walk mechanism to allow for installation of PTEs (carefully restricted to memory management logic only to prevent unwanted abuse).
We ensure that zapping performed by, for instance, MADV_DONTNEED, does not remove guard poison markers, nor does forking (except when VM_WIPEONFORK is specified for a VMA which implies a total removal of memory characteristics).
It's important to note that the guard page implementation is emphatically NOT a security feature, so a user can remove the poisoning if they wish. We simply implement it in such a way as to provide the least surprising behaviour.
An extensive set of self-tests are provided which ensure behaviour is as expected and additionally self-documents expected behaviour of poisoned ranges.
Suggested-by: Vlastimil Babka vbabka@suze.cz
Please fix the domain typo (also in patch 3 :)
Damnnn it! I can't believe I left that in. Sorry about that! Will fix on respin.
Hopefully not to suse.cs ;)
Thanks for implementing this, Vlastimil
Thanks!
Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com
v1
- Un-RFC'd as appears no major objections to approach but rather debate on implementation.
- Fixed issue with arches which need mmu_context.h and tlbfush.h. header imports in pagewalker logic to be able to use update_mmu_cache() as reported by the kernel test bot.
- Added comments in page walker logic to clarify who can use ops->install_pte and why as well as adding a check_ops_valid() helper function, as suggested by Christoph.
- Pass false in full parameter in pte_clear_not_present_full() as suggested by Jann.
- Stopped erroneously requiring a write lock for the poison operation as suggested by Jann and Suren.
- Moved anon_vma_prepare() to the start of madvise_guard_poison() to be consistent with how this is used elsewhere in the kernel as suggested by Jann.
- Avoid returning -EAGAIN if we are raced on page faults, just keep looping and duck out if a fatal signal is pending or a conditional reschedule is needed, as suggested by Jann.
- Avoid needlessly splitting huge PUDs and PMDs by specifying ACTION_CONTINUE, as suggested by Jann.
RFC https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (4): mm: pagewalk: add the ability to install PTEs mm: add PTE_MARKER_GUARD PTE marker mm: madvise: implement lightweight guard page mechanism selftests/mm: add self tests for guard page feature
arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/linux/mm_inline.h | 2 +- include/linux/pagewalk.h | 18 +- include/linux/swapops.h | 26 +- include/uapi/asm-generic/mman-common.h | 3 + mm/hugetlb.c | 3 + mm/internal.h | 6 + mm/madvise.c | 168 ++++ mm/memory.c | 18 +- mm/mprotect.c | 3 +- mm/mseal.c | 1 + mm/pagewalk.c | 200 ++-- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++ 18 files changed, 1564 insertions(+), 66 deletions(-) create mode 100644 tools/testing/selftests/mm/guard-pages.c
-- 2.46.2
On Fri, Oct 18, 2024 at 05:17:56PM +0100, Lorenzo Stoakes wrote:
On Fri, Oct 18, 2024 at 06:10:37PM +0200, Vlastimil Babka wrote:
+CC linux-api (also should on future revisions)
They're cc'd :) assuming Linux API linux-api@vger.kernel.org is correct right?
As discussed on IRC, no I was being a little slow here and hadn't realised you'd added them, apologies!
Will add them on future respins, sorry guys :)
On 10/17/24 22:42, Lorenzo Stoakes wrote:
Userland library functions such as allocators and threading implementations often require regions of memory to act as 'guard pages' - mappings which, when accessed, result in a fatal signal being sent to the accessing process.
The current means by which these are implemented is via a PROT_NONE mmap() mapping, which provides the required semantics however incur an overhead of a VMA for each such region.
With a great many processes and threads, this can rapidly add up and incur a significant memory penalty. It also has the added problem of preventing merges that might otherwise be permitted.
This series takes a different approach - an idea suggested by Vlasimil Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the provenance becomes a little tricky to ascertain after this - please forgive any omissions!) - rather than locating the guard pages at the VMA layer, instead placing them in page tables mapping the required ranges.
Early testing of the prototype version of this code suggests a 5 times speed up in memory mapping invocations (in conjunction with use of process_madvise()) and a 13% reduction in VMAs on an entirely idle android system and unoptimised code.
We expect with optimisation and a loaded system with a larger number of guard pages this could significantly increase, but in any case these numbers are encouraging.
This way, rather than having separate VMAs specifying which parts of a range are guard pages, instead we have a VMA spanning the entire range of memory a user is permitted to access and including ranges which are to be 'guarded'.
After mapping this, a user can specify which parts of the range should result in a fatal signal when accessed.
By restricting the ability to specify guard pages to memory mapped by existing VMAs, we can rely on the mappings being torn down when the mappings are ultimately unmapped and everything works simply as if the memory were not faulted in, from the point of view of the containing VMAs.
This mechanism in effect poisons memory ranges similar to hardware memory poisoning, only it is an entirely software-controlled form of poisoning.
Any poisoned region of memory is also able to 'unpoisoned', that is, to have its poison markers removed.
The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this poisoning.
Poisoning can be performed across multiple VMAs and any existing mappings will be cleared, that is zapped, before installing the poisoned page table mappings.
There is no concept of 'nested' poisoning, multiple attempts to poison a range will, after the first poisoning, have no effect.
Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned memory, so a user can safely unpoison a range of memory and clear only poison page table mappings leaving the rest intact.
The actual mechanism by which the page table entries are specified makes use of existing logic - PTE markers, which are used for the userfaultfd UFFDIO_POISON mechanism.
Unfortunately PTE_MARKER_POISONED is not suited for the guard page mechanism as it results in VM_FAULT_HWPOISON semantics in the fault handler, so we add our own specific PTE_MARKER_GUARD and adapt existing logic to handle it.
We also extend the generic page walk mechanism to allow for installation of PTEs (carefully restricted to memory management logic only to prevent unwanted abuse).
We ensure that zapping performed by, for instance, MADV_DONTNEED, does not remove guard poison markers, nor does forking (except when VM_WIPEONFORK is specified for a VMA which implies a total removal of memory characteristics).
It's important to note that the guard page implementation is emphatically NOT a security feature, so a user can remove the poisoning if they wish. We simply implement it in such a way as to provide the least surprising behaviour.
An extensive set of self-tests are provided which ensure behaviour is as expected and additionally self-documents expected behaviour of poisoned ranges.
Suggested-by: Vlastimil Babka vbabka@suze.cz
Please fix the domain typo (also in patch 3 :)
Damnnn it! I can't believe I left that in. Sorry about that! Will fix on respin.
Hopefully not to suse.cs ;)
Thanks for implementing this, Vlastimil
Thanks!
Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com
v1
- Un-RFC'd as appears no major objections to approach but rather debate on implementation.
- Fixed issue with arches which need mmu_context.h and tlbfush.h. header imports in pagewalker logic to be able to use update_mmu_cache() as reported by the kernel test bot.
- Added comments in page walker logic to clarify who can use ops->install_pte and why as well as adding a check_ops_valid() helper function, as suggested by Christoph.
- Pass false in full parameter in pte_clear_not_present_full() as suggested by Jann.
- Stopped erroneously requiring a write lock for the poison operation as suggested by Jann and Suren.
- Moved anon_vma_prepare() to the start of madvise_guard_poison() to be consistent with how this is used elsewhere in the kernel as suggested by Jann.
- Avoid returning -EAGAIN if we are raced on page faults, just keep looping and duck out if a fatal signal is pending or a conditional reschedule is needed, as suggested by Jann.
- Avoid needlessly splitting huge PUDs and PMDs by specifying ACTION_CONTINUE, as suggested by Jann.
RFC https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (4): mm: pagewalk: add the ability to install PTEs mm: add PTE_MARKER_GUARD PTE marker mm: madvise: implement lightweight guard page mechanism selftests/mm: add self tests for guard page feature
arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/linux/mm_inline.h | 2 +- include/linux/pagewalk.h | 18 +- include/linux/swapops.h | 26 +- include/uapi/asm-generic/mman-common.h | 3 + mm/hugetlb.c | 3 + mm/internal.h | 6 + mm/madvise.c | 168 ++++ mm/memory.c | 18 +- mm/mprotect.c | 3 +- mm/mseal.c | 1 + mm/pagewalk.c | 200 ++-- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++ 18 files changed, 1564 insertions(+), 66 deletions(-) create mode 100644 tools/testing/selftests/mm/guard-pages.c
-- 2.46.2
linux-kselftest-mirror@lists.linaro.org