On Fri, Oct 25, 2024 at 11:44:56PM +0200, Vlastimil Babka wrote:
On 10/23/24 18:24, Lorenzo Stoakes wrote:
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise.
Currently users must establish PROT_NONE ranges to achieve this.
However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs.
In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges.
The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - establishing PTE markers such that accesses to them cause a fault followed by a SIGSGEV signal being raised.
This is achieved through the PTE marker mechanism, which we have already extended to provide PTE_MARKER_GUARD, which we installed via the generic page walking logic which we have extended for this purpose.
These guard ranges are established with MADV_GUARD_INSTALL. If the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect).
Any existing guard entries will be left untouched. There is therefore no nesting of guarded pages.
Guarded ranges are NOT cleared by MADV_DONTNEED nor MADV_FREE (in both instances the memory range may be reused at which point a user would expect guards to still be in place), but they are cleared via MADV_GUARD_REMOVE, process teardown or unmapping of memory ranges.
The guard property can be removed from ranges via MADV_GUARD_REMOVE. The ranges over which this is applied, should they contain non-guard entries, will be untouched, with only guard entries being cleared.
We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive).
Racing page faults can cause repeated attempts to install guard pages that are interrupted, result in a zap, and this process can end up being repeated. If this happens more than would be expected in normal operation, we rescind locks and retry the whole thing, which avoids lock contention in this scenario.
Suggested-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Jann Horn jannh@google.com Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Reviewed-by: Vlastimil Babka vbabka@suse.cz
Thanks!
--- a/mm/internal.h +++ b/mm/internal.h @@ -423,6 +423,12 @@ extern unsigned long highest_memmap_pfn; */ #define MAX_RECLAIM_RETRIES 16
+/*
- Maximum number of attempts we make to install guard pages before we give up
- and return -ERESTARTNOINTR to have userspace try again.
- */
+#define MAX_MADVISE_GUARD_RETRIES 3
Can't we simply put this in mm/madvise.c ? Didn't find usage elsewhere.
Sure, will move if respin/can send a quick fixpatch next week if otherwise settled. Just felt vaguely 'neater' here for... spurious subjective squishy brained reasons :)