The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
It is permissible to permit the establishment of guard regions in read-only mappings because the guard regions only reduce access to the mapping, and when removed simply reinstate the existing attributes of the underlying VMA, meaning no access violations can occur.
While the change in kernel code introduced in this series is small, the majority of the effort here is spent in extending the testing to assert that the feature works correctly across numerous file-backed mapping scenarios.
Every single guard region self-test performed against anonymous memory (which is relevant and not anon-only) has now been updated to also be performed against shmem and a mapping of a file in the working directory.
This confirms that all cases also function correctly for file-backed guard regions.
In addition a number of other tests are added for specific file-backed mapping scenarios.
There are a number of other concerns that one might have with regard to guard regions, addressed below:
Readahead ~~~~~~~~~
Readahead is a process through which the page cache is populated on the assumption that sequential reads will occur, thus amortising I/O and, through a clever use of the PG_readahead folio flag establishing during major fault and checked upon minor fault, provides for asynchronous I/O to occur as dat is processed, reducing I/O stalls as data is faulted in.
Guard regions do not alter this mechanism which operations at the folio and fault level, but do of course prevent the faulting of folios that would otherwise be mapped.
In the instance of a major fault prior to a guard region, synchronous readahead will occur including populating folios in the page cache which the guard regions will, in the case of the mapping in question, prevent access to.
In addition, if PG_readahead is placed in a folio that is now inaccessible, this will prevent asynchronous readahead from occurring as it would otherwise do.
However, there are mechanisms for heuristically resetting this within readahead regardless, which will 'recover' correct readahead behaviour.
Readahead presumes sequential data access, the presence of a guard region clearly indicates that, at least in the guard region, no such sequential access will occur, as it cannot occur there.
So this should have very little impact on any real workload. The far more important point is as to whether readahead causes incorrect or inappropriate mapping of ranges disallowed by the presence of guard regions - this is not the case, as readahead does not 'pre-fault' memory in this fashion.
At any rate, any mechanism which would attempt to do so would hit the usual page fault paths, which correctly handle PTE markers as with anonymous mappings.
Fault-Around ~~~~~~~~~~~~
The fault-around logic, in a similar vein to readahead, attempts to improve efficiency with regard to file-backed memory mappings, however it differs in that it does not try to fetch folios into the page cache that are about to be accessed, but rather pre-maps a range of folios around the faulting address.
Guard regions making use of PTE markers makes this relatively trivial, as this case is already handled - see filemap_map_folio_range() and filemap_map_order0_folio() - in both instances, the solution is to simply keep the established page table mappings and let the fault handler take care of PTE markers, as per the comment:
/* * NOTE: If there're PTE markers, we'll leave them to be * handled in the specific fault path, and it'll prohibit * the fault-around logic. */
This works, as establishing guard regions results in page table mappings with PTE markers, and clearing them removes them.
Truncation ~~~~~~~~~~
File truncation will not eliminate existing guard regions, as the truncation operation will ultimately zap the range via unmap_mapping_range(), which specifically excludes PTE markers.
Zapping ~~~~~~~
Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(), which specifically deals with guard entries, leaving them intact except in instances such as process teardown or munmap() where they need to be removed.
Reclaim ~~~~~~~
When reclaim is performed on file-backed folios, it ultimately invokes try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte() will ultimately abort the operation for the guard region mapping. If large, then check_pte() will determine that this is a non-device private entry/device-exclusive entry 'swap' PTE and thus abort the operation in that instance.
Therefore, no odd things happen in the instance of reclaim being attempted upon a file-backed guard region.
Hole Punching ~~~~~~~~~~~~~
This updates the page cache and ultimately invokes unmap_mapping_range(), which explicitly leaves PTE markers in place.
Because the establishment of guard regions zapped any existing mappings to file-backed folios, once the guard regions are removed then the hole-punched region will be faulted in as usual and everything will behave as expected.
Lorenzo Stoakes (4): mm: allow guard regions in file-backed and read-only mappings selftests/mm: rename guard-pages to guard-regions tools/selftests: expand all guard region tests to file-backed tools/selftests: add file/shmem-backed mapping guard region tests
mm/madvise.c | 8 +- tools/testing/selftests/mm/.gitignore | 2 +- tools/testing/selftests/mm/Makefile | 2 +- .../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++-- 4 files changed, 821 insertions(+), 112 deletions(-) rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
-- 2.48.1
There is no reason to disallow guard regions in file-backed mappings - readahead and fault-around both function correctly in the presence of PTE markers, equally other operations relating to memory-mapped files function correctly.
Additionally, read-only mappings if introducing guard-regions, only restrict the mapping further, which means there is no violation of any access rights by permitting this to be so.
Removing this restriction allows for read-only mapped files (such as executable files) correctly which would otherwise not be permitted.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- mm/madvise.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 6ecead476a80..e01e93e179a8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) if (!allow_locked) disallowed |= VM_LOCKED;
- if (!vma_is_anonymous(vma)) - return false; - - if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE) - return false; - - return true; + return !(vma->vm_flags & disallowed); }
static bool is_guard_pte_marker(pte_t ptent)
On 2/13/25 19:17, Lorenzo Stoakes wrote:
There is no reason to disallow guard regions in file-backed mappings - readahead and fault-around both function correctly in the presence of PTE markers, equally other operations relating to memory-mapped files function correctly.
Additionally, read-only mappings if introducing guard-regions, only restrict the mapping further, which means there is no violation of any access rights by permitting this to be so.
Removing this restriction allows for read-only mapped files (such as executable files) correctly which would otherwise not be permitted.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Reviewed-by: Vlastimil Babka vbabka@suse.cz
mm/madvise.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 6ecead476a80..e01e93e179a8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) if (!allow_locked) disallowed |= VM_LOCKED;
- if (!vma_is_anonymous(vma))
return false;
- if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
return false;
- return true;
- return !(vma->vm_flags & disallowed);
} static bool is_guard_pte_marker(pte_t ptent)
On 13.02.25 19:17, Lorenzo Stoakes wrote:
There is no reason to disallow guard regions in file-backed mappings - readahead and fault-around both function correctly in the presence of PTE markers, equally other operations relating to memory-mapped files function correctly.
Additionally, read-only mappings if introducing guard-regions, only restrict the mapping further, which means there is no violation of any access rights by permitting this to be so.
Removing this restriction allows for read-only mapped files (such as executable files) correctly which would otherwise not be permitted.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
mm/madvise.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 6ecead476a80..e01e93e179a8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) if (!allow_locked) disallowed |= VM_LOCKED;
- if (!vma_is_anonymous(vma))
return false;
- if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
return false;
- return true;
- return !(vma->vm_flags & disallowed); }
static bool is_guard_pte_marker(pte_t ptent)
Acked-by: David Hildenbrand david@redhat.com
I assume these markers cannot completely prevent us from allocating pages/folios for these underlying file/pageache ranges of these markers in case of shmem during page faults, right?
On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote:
On 13.02.25 19:17, Lorenzo Stoakes wrote:
There is no reason to disallow guard regions in file-backed mappings - readahead and fault-around both function correctly in the presence of PTE markers, equally other operations relating to memory-mapped files function correctly.
Additionally, read-only mappings if introducing guard-regions, only restrict the mapping further, which means there is no violation of any access rights by permitting this to be so.
Removing this restriction allows for read-only mapped files (such as executable files) correctly which would otherwise not be permitted.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
mm/madvise.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 6ecead476a80..e01e93e179a8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) if (!allow_locked) disallowed |= VM_LOCKED;
- if (!vma_is_anonymous(vma))
return false;
- if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
return false;
- return true;
- return !(vma->vm_flags & disallowed); } static bool is_guard_pte_marker(pte_t ptent)
Acked-by: David Hildenbrand david@redhat.com
Thanks!
I assume these markers cannot completely prevent us from allocating pages/folios for these underlying file/pageache ranges of these markers in case of shmem during page faults, right?
If the markers are in place, then page faulting will result in a segfault. If we faulted in a shmem page then installed markers (which would zap the range), then the page cache will be populated, but obviously subject to standard reclaim.
If we perform synchronous readahead prior to a guard region that includes (partially or fully) a guard region we might major fault entries into the page cache that are then not accessable _from that mapping_, this is rather unavoidable as this doesn't account for page table mappings and should be largely trivial overhead (also these folios are reclaimable).
-- Cheers,
David / dhildenb
On 18.02.25 17:12, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote:
On 13.02.25 19:17, Lorenzo Stoakes wrote:
There is no reason to disallow guard regions in file-backed mappings - readahead and fault-around both function correctly in the presence of PTE markers, equally other operations relating to memory-mapped files function correctly.
Additionally, read-only mappings if introducing guard-regions, only restrict the mapping further, which means there is no violation of any access rights by permitting this to be so.
Removing this restriction allows for read-only mapped files (such as executable files) correctly which would otherwise not be permitted.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
mm/madvise.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 6ecead476a80..e01e93e179a8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) if (!allow_locked) disallowed |= VM_LOCKED;
- if (!vma_is_anonymous(vma))
return false;
- if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
return false;
- return true;
- return !(vma->vm_flags & disallowed); } static bool is_guard_pte_marker(pte_t ptent)
Acked-by: David Hildenbrand david@redhat.com
Thanks!
I assume these markers cannot completely prevent us from allocating pages/folios for these underlying file/pageache ranges of these markers in case of shmem during page faults, right?
If the markers are in place, then page faulting will result in a segfault. If we faulted in a shmem page then installed markers (which would zap the range), then the page cache will be populated, but obviously subject to standard reclaim.
Well, yes, (a) if there is swap and (b), if the noswap option was not specified for tmpfs.
Okay, so installing a guard entry might require punshing a hole to get rid of any already-existing memory. But readahead (below) might mess it up.
If we perform synchronous readahead prior to a guard region that includes (partially or fully) a guard region we might major fault entries into the page cache that are then not accessable _from that mapping_, this is rather unavoidable as this doesn't account for page table mappings and should be largely trivial overhead (also these folios are reclaimable).
Right, that's what I had in mind: assume I have a single marker in a PMD, shmem might allocate a PMD THP to back that region, ignoring the marker hint. (so I think)
Thanks!
On Tue, Feb 18, 2025 at 05:17:20PM +0100, David Hildenbrand wrote:
On 18.02.25 17:12, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote:
On 13.02.25 19:17, Lorenzo Stoakes wrote:
There is no reason to disallow guard regions in file-backed mappings - readahead and fault-around both function correctly in the presence of PTE markers, equally other operations relating to memory-mapped files function correctly.
Additionally, read-only mappings if introducing guard-regions, only restrict the mapping further, which means there is no violation of any access rights by permitting this to be so.
Removing this restriction allows for read-only mapped files (such as executable files) correctly which would otherwise not be permitted.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
mm/madvise.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 6ecead476a80..e01e93e179a8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) if (!allow_locked) disallowed |= VM_LOCKED;
- if (!vma_is_anonymous(vma))
return false;
- if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
return false;
- return true;
- return !(vma->vm_flags & disallowed); } static bool is_guard_pte_marker(pte_t ptent)
Acked-by: David Hildenbrand david@redhat.com
Thanks!
I assume these markers cannot completely prevent us from allocating pages/folios for these underlying file/pageache ranges of these markers in case of shmem during page faults, right?
If the markers are in place, then page faulting will result in a segfault. If we faulted in a shmem page then installed markers (which would zap the range), then the page cache will be populated, but obviously subject to standard reclaim.
Well, yes, (a) if there is swap and (b), if the noswap option was not specified for tmpfs.
Right, yeah if you don't have it set up such that dropping a reference to the folio doesn't drop the page altogether.
I think this matches expectation though in that you'd get the same results from an MADV_DONTNEED followed by faulting the page again.
Okay, so installing a guard entry might require punshing a hole to get rid of any already-existing memory. But readahead (below) might mess it up.
Only if you are so concerned about avoiding the page cache being populated there that you want to do this :)
Readahead I think will not readahead into a holepunched region as the hole punching extends to the fs layer _I believe_ I have not checked the code for this, but I believe it actually changes the underlying file too right to say 'this part of the file is empty'?
(I did explicitly test hole punching with guard regions btw, by the by :)
If we perform synchronous readahead prior to a guard region that includes (partially or fully) a guard region we might major fault entries into the page cache that are then not accessable _from that mapping_, this is rather unavoidable as this doesn't account for page table mappings and should be largely trivial overhead (also these folios are reclaimable).
Right, that's what I had in mind: assume I have a single marker in a PMD, shmem might allocate a PMD THP to back that region, ignoring the marker hint. (so I think)
Thanks!
-- Cheers,
David / dhildenb
On 18.02.25 17:21, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 05:17:20PM +0100, David Hildenbrand wrote:
On 18.02.25 17:12, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote:
On 13.02.25 19:17, Lorenzo Stoakes wrote:
There is no reason to disallow guard regions in file-backed mappings - readahead and fault-around both function correctly in the presence of PTE markers, equally other operations relating to memory-mapped files function correctly.
Additionally, read-only mappings if introducing guard-regions, only restrict the mapping further, which means there is no violation of any access rights by permitting this to be so.
Removing this restriction allows for read-only mapped files (such as executable files) correctly which would otherwise not be permitted.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
mm/madvise.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 6ecead476a80..e01e93e179a8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) if (!allow_locked) disallowed |= VM_LOCKED;
- if (!vma_is_anonymous(vma))
return false;
- if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
return false;
- return true;
- return !(vma->vm_flags & disallowed); } static bool is_guard_pte_marker(pte_t ptent)
Acked-by: David Hildenbrand david@redhat.com
Thanks!
I assume these markers cannot completely prevent us from allocating pages/folios for these underlying file/pageache ranges of these markers in case of shmem during page faults, right?
If the markers are in place, then page faulting will result in a segfault. If we faulted in a shmem page then installed markers (which would zap the range), then the page cache will be populated, but obviously subject to standard reclaim.
Well, yes, (a) if there is swap and (b), if the noswap option was not specified for tmpfs.
Right, yeah if you don't have it set up such that dropping a reference to the folio doesn't drop the page altogether.
I think this matches expectation though in that you'd get the same results from an MADV_DONTNEED followed by faulting the page again.
It might make sense to document that: installing a guard behaves just like MADV_DONTNEED; in case of a file, that means that the pagecache is left untouched.
Okay, so installing a guard entry might require punshing a hole to get rid of any already-existing memory. But readahead (below) might mess it up.
Only if you are so concerned about avoiding the page cache being populated there that you want to do this :)
Readahead I think will not readahead into a holepunched region as the hole punching extends to the fs layer _I believe_ I have not checked the code for this, but I believe it actually changes the underlying file too right to say 'this part of the file is empty'?
Well, we are talking about shmem here ... not your ordinary fs backed by an actual file :)
On Tue, Feb 18, 2025 at 05:27:24PM +0100, David Hildenbrand wrote:
On 18.02.25 17:21, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 05:17:20PM +0100, David Hildenbrand wrote:
On 18.02.25 17:12, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote:
On 13.02.25 19:17, Lorenzo Stoakes wrote:
There is no reason to disallow guard regions in file-backed mappings - readahead and fault-around both function correctly in the presence of PTE markers, equally other operations relating to memory-mapped files function correctly.
Additionally, read-only mappings if introducing guard-regions, only restrict the mapping further, which means there is no violation of any access rights by permitting this to be so.
Removing this restriction allows for read-only mapped files (such as executable files) correctly which would otherwise not be permitted.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
mm/madvise.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 6ecead476a80..e01e93e179a8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) if (!allow_locked) disallowed |= VM_LOCKED;
- if (!vma_is_anonymous(vma))
return false;
- if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
return false;
- return true;
- return !(vma->vm_flags & disallowed); } static bool is_guard_pte_marker(pte_t ptent)
Acked-by: David Hildenbrand david@redhat.com
Thanks!
I assume these markers cannot completely prevent us from allocating pages/folios for these underlying file/pageache ranges of these markers in case of shmem during page faults, right?
If the markers are in place, then page faulting will result in a segfault. If we faulted in a shmem page then installed markers (which would zap the range), then the page cache will be populated, but obviously subject to standard reclaim.
Well, yes, (a) if there is swap and (b), if the noswap option was not specified for tmpfs.
Right, yeah if you don't have it set up such that dropping a reference to the folio doesn't drop the page altogether.
I think this matches expectation though in that you'd get the same results from an MADV_DONTNEED followed by faulting the page again.
It might make sense to document that: installing a guard behaves just like MADV_DONTNEED; in case of a file, that means that the pagecache is left untouched.
More docs noooo! :P I will update the man pages when this is more obviously heading for landing in 6.15 accordingly.
Current man page documentation on this is:
'If the region maps memory pages those mappings will be replaced as part of the operation'
I think something like:
'If the region maps pages those mappings will be replaced as part of the operation. When guard regions are removed via MADV_GUARD_REMOVE, faulting in the page will behave as if that region had MADV_DONTNEED applied to it, that is anonymous ranges will be backed by newly allocated zeroed pages and file-backed ranges will be backed by the underlying file pages.'
Probably something less wordy than this...
Okay, so installing a guard entry might require punshing a hole to get rid of any already-existing memory. But readahead (below) might mess it up.
Only if you are so concerned about avoiding the page cache being populated there that you want to do this :)
Readahead I think will not readahead into a holepunched region as the hole punching extends to the fs layer _I believe_ I have not checked the code for this, but I believe it actually changes the underlying file too right to say 'this part of the file is empty'?
Well, we are talking about shmem here ... not your ordinary fs backed by an actual file :)
I am talking about both, I multitask ;)
-- Cheers,
David / dhildenb
On 18.02.25 17:49, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 05:27:24PM +0100, David Hildenbrand wrote:
On 18.02.25 17:21, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 05:17:20PM +0100, David Hildenbrand wrote:
On 18.02.25 17:12, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote:
On 13.02.25 19:17, Lorenzo Stoakes wrote: > There is no reason to disallow guard regions in file-backed mappings - > readahead and fault-around both function correctly in the presence of PTE > markers, equally other operations relating to memory-mapped files function > correctly. > > Additionally, read-only mappings if introducing guard-regions, only > restrict the mapping further, which means there is no violation of any > access rights by permitting this to be so. > > Removing this restriction allows for read-only mapped files (such as > executable files) correctly which would otherwise not be permitted. > > Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com > --- > mm/madvise.c | 8 +------- > 1 file changed, 1 insertion(+), 7 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 6ecead476a80..e01e93e179a8 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) > if (!allow_locked) > disallowed |= VM_LOCKED; > - if (!vma_is_anonymous(vma)) > - return false; > - > - if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE) > - return false; > - > - return true; > + return !(vma->vm_flags & disallowed); > } > static bool is_guard_pte_marker(pte_t ptent)
Acked-by: David Hildenbrand david@redhat.com
Thanks!
I assume these markers cannot completely prevent us from allocating pages/folios for these underlying file/pageache ranges of these markers in case of shmem during page faults, right?
If the markers are in place, then page faulting will result in a segfault. If we faulted in a shmem page then installed markers (which would zap the range), then the page cache will be populated, but obviously subject to standard reclaim.
Well, yes, (a) if there is swap and (b), if the noswap option was not specified for tmpfs.
Right, yeah if you don't have it set up such that dropping a reference to the folio doesn't drop the page altogether.
I think this matches expectation though in that you'd get the same results from an MADV_DONTNEED followed by faulting the page again.
It might make sense to document that: installing a guard behaves just like MADV_DONTNEED; in case of a file, that means that the pagecache is left untouched.
More docs noooo! :P I will update the man pages when this is more obviously heading for landing in 6.15 accordingly.
Current man page documentation on this is:
'If the region maps memory pages those mappings will be replaced as part of the operation'
I think something like:
'If the region maps pages those mappings will be replaced as part of the operation. When guard regions are removed via MADV_GUARD_REMOVE, faulting in the page will behave as if that region had MADV_DONTNEED applied to it, that is anonymous ranges will be backed by newly allocated zeroed pages and file-backed ranges will be backed by the underlying file pages.'
Probably something less wordy than this...
Yeah, but sounds good to me.
Okay, so installing a guard entry might require punshing a hole to get rid of any already-existing memory. But readahead (below) might mess it up.
Only if you are so concerned about avoiding the page cache being populated there that you want to do this :)
Readahead I think will not readahead into a holepunched region as the hole punching extends to the fs layer _I believe_ I have not checked the code for this, but I believe it actually changes the underlying file too right to say 'this part of the file is empty'?
Well, we are talking about shmem here ... not your ordinary fs backed by an actual file :)
I am talking about both, I multitask ;)
For !shmem, we should indeed not be messing with a sparse file structure.
On Tue, Feb 18, 2025 at 06:00:36PM +0100, David Hildenbrand wrote:
On 18.02.25 17:49, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 05:27:24PM +0100, David Hildenbrand wrote:
On 18.02.25 17:21, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 05:17:20PM +0100, David Hildenbrand wrote:
On 18.02.25 17:12, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote: > On 13.02.25 19:17, Lorenzo Stoakes wrote: > > There is no reason to disallow guard regions in file-backed mappings - > > readahead and fault-around both function correctly in the presence of PTE > > markers, equally other operations relating to memory-mapped files function > > correctly. > > > > Additionally, read-only mappings if introducing guard-regions, only > > restrict the mapping further, which means there is no violation of any > > access rights by permitting this to be so. > > > > Removing this restriction allows for read-only mapped files (such as > > executable files) correctly which would otherwise not be permitted. > > > > Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com > > --- > > mm/madvise.c | 8 +------- > > 1 file changed, 1 insertion(+), 7 deletions(-) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 6ecead476a80..e01e93e179a8 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) > > if (!allow_locked) > > disallowed |= VM_LOCKED; > > - if (!vma_is_anonymous(vma)) > > - return false; > > - > > - if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE) > > - return false; > > - > > - return true; > > + return !(vma->vm_flags & disallowed); > > } > > static bool is_guard_pte_marker(pte_t ptent) > > Acked-by: David Hildenbrand david@redhat.com
Thanks!
> > I assume these markers cannot completely prevent us from allocating > pages/folios for these underlying file/pageache ranges of these markers in > case of shmem during page faults, right?
If the markers are in place, then page faulting will result in a segfault. If we faulted in a shmem page then installed markers (which would zap the range), then the page cache will be populated, but obviously subject to standard reclaim.
Well, yes, (a) if there is swap and (b), if the noswap option was not specified for tmpfs.
Right, yeah if you don't have it set up such that dropping a reference to the folio doesn't drop the page altogether.
I think this matches expectation though in that you'd get the same results from an MADV_DONTNEED followed by faulting the page again.
It might make sense to document that: installing a guard behaves just like MADV_DONTNEED; in case of a file, that means that the pagecache is left untouched.
More docs noooo! :P I will update the man pages when this is more obviously heading for landing in 6.15 accordingly.
Current man page documentation on this is:
'If the region maps memory pages those mappings will be replaced as part of the operation'
I think something like:
'If the region maps pages those mappings will be replaced as part of the operation. When guard regions are removed via MADV_GUARD_REMOVE, faulting in the page will behave as if that region had MADV_DONTNEED applied to it, that is anonymous ranges will be backed by newly allocated zeroed pages and file-backed ranges will be backed by the underlying file pages.'
Probably something less wordy than this...
Yeah, but sounds good to me.
Okay, so installing a guard entry might require punshing a hole to get rid of any already-existing memory. But readahead (below) might mess it up.
Only if you are so concerned about avoiding the page cache being populated there that you want to do this :)
Readahead I think will not readahead into a holepunched region as the hole punching extends to the fs layer _I believe_ I have not checked the code for this, but I believe it actually changes the underlying file too right to say 'this part of the file is empty'?
Well, we are talking about shmem here ... not your ordinary fs backed by an actual file :)
I am talking about both, I multitask ;)
For !shmem, we should indeed not be messing with a sparse file structure.
Ah right I get you, sorry missed your point here. Yeah MADV_REMOVE for shmem will just explicitly drop the pages which doesn't have any underlying file to impact as with actually file-backed memory.
-- Cheers,
David / dhildenb
The feature formerly referred to as guard pages is more correctly referred to as 'guard regions', as in fact no pages are ever allocated in the process of installing the regions.
To avoid confusion, rename the tests accordingly.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- tools/testing/selftests/mm/.gitignore | 2 +- tools/testing/selftests/mm/Makefile | 2 +- .../mm/{guard-pages.c => guard-regions.c} | 42 +++++++++---------- 3 files changed, 23 insertions(+), 23 deletions(-) rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (98%)
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 121000c28c10..c5241b193db8 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -57,4 +57,4 @@ droppable hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 -guard-pages +guard-regions diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 63ce39d024bb..8270895039d1 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -97,7 +97,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable -TEST_GEN_FILES += guard-pages +TEST_GEN_FILES += guard-regions
ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-regions.c similarity index 98% rename from tools/testing/selftests/mm/guard-pages.c rename to tools/testing/selftests/mm/guard-regions.c index ece37212a8a2..7a41cf9ffbdf 100644 --- a/tools/testing/selftests/mm/guard-pages.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -107,12 +107,12 @@ static bool try_read_write_buf(char *ptr) return try_read_buf(ptr) && try_write_buf(ptr); }
-FIXTURE(guard_pages) +FIXTURE(guard_regions) { unsigned long page_size; };
-FIXTURE_SETUP(guard_pages) +FIXTURE_SETUP(guard_regions) { struct sigaction act = { .sa_handler = &handle_fatal, @@ -126,7 +126,7 @@ FIXTURE_SETUP(guard_pages) self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); };
-FIXTURE_TEARDOWN(guard_pages) +FIXTURE_TEARDOWN(guard_regions) { struct sigaction act = { .sa_handler = SIG_DFL, @@ -137,7 +137,7 @@ FIXTURE_TEARDOWN(guard_pages) sigaction(SIGSEGV, &act, NULL); }
-TEST_F(guard_pages, basic) +TEST_F(guard_regions, basic) { const unsigned long NUM_PAGES = 10; const unsigned long page_size = self->page_size; @@ -231,7 +231,7 @@ TEST_F(guard_pages, basic) }
/* Assert that operations applied across multiple VMAs work as expected. */ -TEST_F(guard_pages, multi_vma) +TEST_F(guard_regions, multi_vma) { const unsigned long page_size = self->page_size; char *ptr_region, *ptr, *ptr1, *ptr2, *ptr3; @@ -367,7 +367,7 @@ TEST_F(guard_pages, multi_vma) * Assert that batched operations performed using process_madvise() work as * expected. */ -TEST_F(guard_pages, process_madvise) +TEST_F(guard_regions, process_madvise) { const unsigned long page_size = self->page_size; pid_t pid = getpid(); @@ -467,7 +467,7 @@ TEST_F(guard_pages, process_madvise) }
/* Assert that unmapping ranges does not leave guard markers behind. */ -TEST_F(guard_pages, munmap) +TEST_F(guard_regions, munmap) { const unsigned long page_size = self->page_size; char *ptr, *ptr_new1, *ptr_new2; @@ -505,7 +505,7 @@ TEST_F(guard_pages, munmap) }
/* Assert that mprotect() operations have no bearing on guard markers. */ -TEST_F(guard_pages, mprotect) +TEST_F(guard_regions, mprotect) { const unsigned long page_size = self->page_size; char *ptr; @@ -553,7 +553,7 @@ TEST_F(guard_pages, mprotect) }
/* Split and merge VMAs and make sure guard pages still behave. */ -TEST_F(guard_pages, split_merge) +TEST_F(guard_regions, split_merge) { const unsigned long page_size = self->page_size; char *ptr, *ptr_new; @@ -684,7 +684,7 @@ TEST_F(guard_pages, split_merge) }
/* Assert that MADV_DONTNEED does not remove guard markers. */ -TEST_F(guard_pages, dontneed) +TEST_F(guard_regions, dontneed) { const unsigned long page_size = self->page_size; char *ptr; @@ -737,7 +737,7 @@ TEST_F(guard_pages, dontneed) }
/* Assert that mlock()'ed pages work correctly with guard markers. */ -TEST_F(guard_pages, mlock) +TEST_F(guard_regions, mlock) { const unsigned long page_size = self->page_size; char *ptr; @@ -810,7 +810,7 @@ TEST_F(guard_pages, mlock) * * - Moving a mapping alone should retain markers as they are. */ -TEST_F(guard_pages, mremap_move) +TEST_F(guard_regions, mremap_move) { const unsigned long page_size = self->page_size; char *ptr, *ptr_new; @@ -857,7 +857,7 @@ TEST_F(guard_pages, mremap_move) * will have to remove guard pages manually to fix up (they'd have to do the * same if it were a PROT_NONE mapping). */ -TEST_F(guard_pages, mremap_expand) +TEST_F(guard_regions, mremap_expand) { const unsigned long page_size = self->page_size; char *ptr, *ptr_new; @@ -920,7 +920,7 @@ TEST_F(guard_pages, mremap_expand) * if the user were using a PROT_NONE mapping they'd have to manually fix this * up also so this is OK. */ -TEST_F(guard_pages, mremap_shrink) +TEST_F(guard_regions, mremap_shrink) { const unsigned long page_size = self->page_size; char *ptr; @@ -984,7 +984,7 @@ TEST_F(guard_pages, mremap_shrink) * Assert that forking a process with VMAs that do not have VM_WIPEONFORK set * retain guard pages. */ -TEST_F(guard_pages, fork) +TEST_F(guard_regions, fork) { const unsigned long page_size = self->page_size; char *ptr; @@ -1039,7 +1039,7 @@ TEST_F(guard_pages, fork) * Assert expected behaviour after we fork populated ranges of anonymous memory * and then guard and unguard the range. */ -TEST_F(guard_pages, fork_cow) +TEST_F(guard_regions, fork_cow) { const unsigned long page_size = self->page_size; char *ptr; @@ -1110,7 +1110,7 @@ TEST_F(guard_pages, fork_cow) * Assert that forking a process with VMAs that do have VM_WIPEONFORK set * behave as expected. */ -TEST_F(guard_pages, fork_wipeonfork) +TEST_F(guard_regions, fork_wipeonfork) { const unsigned long page_size = self->page_size; char *ptr; @@ -1160,7 +1160,7 @@ TEST_F(guard_pages, fork_wipeonfork) }
/* Ensure that MADV_FREE retains guard entries as expected. */ -TEST_F(guard_pages, lazyfree) +TEST_F(guard_regions, lazyfree) { const unsigned long page_size = self->page_size; char *ptr; @@ -1196,7 +1196,7 @@ TEST_F(guard_pages, lazyfree) }
/* Ensure that MADV_POPULATE_READ, MADV_POPULATE_WRITE behave as expected. */ -TEST_F(guard_pages, populate) +TEST_F(guard_regions, populate) { const unsigned long page_size = self->page_size; char *ptr; @@ -1222,7 +1222,7 @@ TEST_F(guard_pages, populate) }
/* Ensure that MADV_COLD, MADV_PAGEOUT do not remove guard markers. */ -TEST_F(guard_pages, cold_pageout) +TEST_F(guard_regions, cold_pageout) { const unsigned long page_size = self->page_size; char *ptr; @@ -1268,7 +1268,7 @@ TEST_F(guard_pages, cold_pageout) }
/* Ensure that guard pages do not break userfaultd. */ -TEST_F(guard_pages, uffd) +TEST_F(guard_regions, uffd) { const unsigned long page_size = self->page_size; int uffd;
On 2/13/25 19:17, Lorenzo Stoakes wrote:
The feature formerly referred to as guard pages is more correctly referred to as 'guard regions', as in fact no pages are ever allocated in the process of installing the regions.
To avoid confusion, rename the tests accordingly.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Acked-by: Vlastimil Babka vbabka@suse.cz
tools/testing/selftests/mm/.gitignore | 2 +- tools/testing/selftests/mm/Makefile | 2 +- .../mm/{guard-pages.c => guard-regions.c} | 42 +++++++++---------- 3 files changed, 23 insertions(+), 23 deletions(-) rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (98%)
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 121000c28c10..c5241b193db8 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -57,4 +57,4 @@ droppable hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 -guard-pages +guard-regions diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 63ce39d024bb..8270895039d1 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -97,7 +97,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable -TEST_GEN_FILES += guard-pages +TEST_GEN_FILES += guard-regions ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-regions.c similarity index 98% rename from tools/testing/selftests/mm/guard-pages.c rename to tools/testing/selftests/mm/guard-regions.c index ece37212a8a2..7a41cf9ffbdf 100644 --- a/tools/testing/selftests/mm/guard-pages.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -107,12 +107,12 @@ static bool try_read_write_buf(char *ptr) return try_read_buf(ptr) && try_write_buf(ptr); } -FIXTURE(guard_pages) +FIXTURE(guard_regions) { unsigned long page_size; }; -FIXTURE_SETUP(guard_pages) +FIXTURE_SETUP(guard_regions) { struct sigaction act = { .sa_handler = &handle_fatal, @@ -126,7 +126,7 @@ FIXTURE_SETUP(guard_pages) self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); }; -FIXTURE_TEARDOWN(guard_pages) +FIXTURE_TEARDOWN(guard_regions) { struct sigaction act = { .sa_handler = SIG_DFL, @@ -137,7 +137,7 @@ FIXTURE_TEARDOWN(guard_pages) sigaction(SIGSEGV, &act, NULL); } -TEST_F(guard_pages, basic) +TEST_F(guard_regions, basic) { const unsigned long NUM_PAGES = 10; const unsigned long page_size = self->page_size; @@ -231,7 +231,7 @@ TEST_F(guard_pages, basic) } /* Assert that operations applied across multiple VMAs work as expected. */ -TEST_F(guard_pages, multi_vma) +TEST_F(guard_regions, multi_vma) { const unsigned long page_size = self->page_size; char *ptr_region, *ptr, *ptr1, *ptr2, *ptr3; @@ -367,7 +367,7 @@ TEST_F(guard_pages, multi_vma)
- Assert that batched operations performed using process_madvise() work as
- expected.
*/ -TEST_F(guard_pages, process_madvise) +TEST_F(guard_regions, process_madvise) { const unsigned long page_size = self->page_size; pid_t pid = getpid(); @@ -467,7 +467,7 @@ TEST_F(guard_pages, process_madvise) } /* Assert that unmapping ranges does not leave guard markers behind. */ -TEST_F(guard_pages, munmap) +TEST_F(guard_regions, munmap) { const unsigned long page_size = self->page_size; char *ptr, *ptr_new1, *ptr_new2; @@ -505,7 +505,7 @@ TEST_F(guard_pages, munmap) } /* Assert that mprotect() operations have no bearing on guard markers. */ -TEST_F(guard_pages, mprotect) +TEST_F(guard_regions, mprotect) { const unsigned long page_size = self->page_size; char *ptr; @@ -553,7 +553,7 @@ TEST_F(guard_pages, mprotect) } /* Split and merge VMAs and make sure guard pages still behave. */ -TEST_F(guard_pages, split_merge) +TEST_F(guard_regions, split_merge) { const unsigned long page_size = self->page_size; char *ptr, *ptr_new; @@ -684,7 +684,7 @@ TEST_F(guard_pages, split_merge) } /* Assert that MADV_DONTNEED does not remove guard markers. */ -TEST_F(guard_pages, dontneed) +TEST_F(guard_regions, dontneed) { const unsigned long page_size = self->page_size; char *ptr; @@ -737,7 +737,7 @@ TEST_F(guard_pages, dontneed) } /* Assert that mlock()'ed pages work correctly with guard markers. */ -TEST_F(guard_pages, mlock) +TEST_F(guard_regions, mlock) { const unsigned long page_size = self->page_size; char *ptr; @@ -810,7 +810,7 @@ TEST_F(guard_pages, mlock)
- Moving a mapping alone should retain markers as they are.
*/ -TEST_F(guard_pages, mremap_move) +TEST_F(guard_regions, mremap_move) { const unsigned long page_size = self->page_size; char *ptr, *ptr_new; @@ -857,7 +857,7 @@ TEST_F(guard_pages, mremap_move)
- will have to remove guard pages manually to fix up (they'd have to do the
- same if it were a PROT_NONE mapping).
*/ -TEST_F(guard_pages, mremap_expand) +TEST_F(guard_regions, mremap_expand) { const unsigned long page_size = self->page_size; char *ptr, *ptr_new; @@ -920,7 +920,7 @@ TEST_F(guard_pages, mremap_expand)
- if the user were using a PROT_NONE mapping they'd have to manually fix this
- up also so this is OK.
*/ -TEST_F(guard_pages, mremap_shrink) +TEST_F(guard_regions, mremap_shrink) { const unsigned long page_size = self->page_size; char *ptr; @@ -984,7 +984,7 @@ TEST_F(guard_pages, mremap_shrink)
- Assert that forking a process with VMAs that do not have VM_WIPEONFORK set
- retain guard pages.
*/ -TEST_F(guard_pages, fork) +TEST_F(guard_regions, fork) { const unsigned long page_size = self->page_size; char *ptr; @@ -1039,7 +1039,7 @@ TEST_F(guard_pages, fork)
- Assert expected behaviour after we fork populated ranges of anonymous memory
- and then guard and unguard the range.
*/ -TEST_F(guard_pages, fork_cow) +TEST_F(guard_regions, fork_cow) { const unsigned long page_size = self->page_size; char *ptr; @@ -1110,7 +1110,7 @@ TEST_F(guard_pages, fork_cow)
- Assert that forking a process with VMAs that do have VM_WIPEONFORK set
- behave as expected.
*/ -TEST_F(guard_pages, fork_wipeonfork) +TEST_F(guard_regions, fork_wipeonfork) { const unsigned long page_size = self->page_size; char *ptr; @@ -1160,7 +1160,7 @@ TEST_F(guard_pages, fork_wipeonfork) } /* Ensure that MADV_FREE retains guard entries as expected. */ -TEST_F(guard_pages, lazyfree) +TEST_F(guard_regions, lazyfree) { const unsigned long page_size = self->page_size; char *ptr; @@ -1196,7 +1196,7 @@ TEST_F(guard_pages, lazyfree) } /* Ensure that MADV_POPULATE_READ, MADV_POPULATE_WRITE behave as expected. */ -TEST_F(guard_pages, populate) +TEST_F(guard_regions, populate) { const unsigned long page_size = self->page_size; char *ptr; @@ -1222,7 +1222,7 @@ TEST_F(guard_pages, populate) } /* Ensure that MADV_COLD, MADV_PAGEOUT do not remove guard markers. */ -TEST_F(guard_pages, cold_pageout) +TEST_F(guard_regions, cold_pageout) { const unsigned long page_size = self->page_size; char *ptr; @@ -1268,7 +1268,7 @@ TEST_F(guard_pages, cold_pageout) } /* Ensure that guard pages do not break userfaultd. */ -TEST_F(guard_pages, uffd) +TEST_F(guard_regions, uffd) { const unsigned long page_size = self->page_size; int uffd;
Extend the guard region tests to allow for test fixture variants for anon, shmem, and local file files.
This allows us to assert that each of the expected behaviours of anonymous memory also applies correctly to file-backed (both shmem and an a file created locally in the current working directory) and thus asserts the same correctness guarantees as all the remaining tests do.
The fixture teardown is now performed in the parent process rather than child forked ones, meaning cleanup is always performed, including unlinking any generated temporary files.
Additionally the variant fixture data type now contains an enum value indicating the type of backing store and the mmap() invocation is abstracted to allow for the mapping of whichever backing store the variant is testing.
We adjust tests as necessary to account for the fact they may now reference files rather than anonymous memory.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- tools/testing/selftests/mm/guard-regions.c | 290 +++++++++++++++------ 1 file changed, 205 insertions(+), 85 deletions(-)
diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index 7a41cf9ffbdf..0469c783f4fa 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -6,6 +6,7 @@ #include <assert.h> #include <errno.h> #include <fcntl.h> +#include <linux/limits.h> #include <linux/userfaultfd.h> #include <setjmp.h> #include <signal.h> @@ -37,6 +38,79 @@ static sigjmp_buf signal_jmp_buf; */ #define FORCE_READ(x) (*(volatile typeof(x) *)x)
+/* + * How is the test backing the mapping being tested? + */ +enum backing_type { + ANON_BACKED, + SHMEM_BACKED, + LOCAL_FILE_BACKED, +}; + +FIXTURE(guard_regions) +{ + unsigned long page_size; + char path[PATH_MAX]; + int fd; +}; + +FIXTURE_VARIANT(guard_regions) +{ + enum backing_type backing; +}; + +FIXTURE_VARIANT_ADD(guard_regions, anon) +{ + .backing = ANON_BACKED, +}; + +FIXTURE_VARIANT_ADD(guard_regions, shmem) +{ + .backing = SHMEM_BACKED, +}; + +FIXTURE_VARIANT_ADD(guard_regions, file) +{ + .backing = LOCAL_FILE_BACKED, +}; + +static bool is_anon_backed(const FIXTURE_VARIANT(guard_regions) * variant) +{ + switch (variant->backing) { + case ANON_BACKED: + case SHMEM_BACKED: + return true; + default: + return false; + } +} + +static void *mmap_(FIXTURE_DATA(guard_regions) * self, + const FIXTURE_VARIANT(guard_regions) * variant, + void *addr, size_t length, int prot, int extra_flags, + off_t offset) +{ + int fd; + int flags = extra_flags; + + switch (variant->backing) { + case ANON_BACKED: + flags |= MAP_PRIVATE | MAP_ANON; + fd = -1; + break; + case SHMEM_BACKED: + case LOCAL_FILE_BACKED: + flags |= MAP_SHARED; + fd = self->fd; + break; + default: + ksft_exit_fail(); + break; + } + + return mmap(addr, length, prot, flags, fd, offset); +} + static int userfaultfd(int flags) { return syscall(SYS_userfaultfd, flags); @@ -107,12 +181,7 @@ static bool try_read_write_buf(char *ptr) return try_read_buf(ptr) && try_write_buf(ptr); }
-FIXTURE(guard_regions) -{ - unsigned long page_size; -}; - -FIXTURE_SETUP(guard_regions) +static void setup_sighandler(void) { struct sigaction act = { .sa_handler = &handle_fatal, @@ -122,11 +191,9 @@ FIXTURE_SETUP(guard_regions) sigemptyset(&act.sa_mask); if (sigaction(SIGSEGV, &act, NULL)) ksft_exit_fail_perror("sigaction"); +}
- self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); -}; - -FIXTURE_TEARDOWN(guard_regions) +static void teardown_sighandler(void) { struct sigaction act = { .sa_handler = SIG_DFL, @@ -137,6 +204,48 @@ FIXTURE_TEARDOWN(guard_regions) sigaction(SIGSEGV, &act, NULL); }
+static int open_file(const char *prefix, char *path) +{ + int fd; + + snprintf(path, PATH_MAX, "%sguard_regions_test_file_XXXXXX", prefix); + fd = mkstemp(path); + if (fd < 0) + ksft_exit_fail_perror("mkstemp"); + + return fd; +} + +FIXTURE_SETUP(guard_regions) +{ + self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); + setup_sighandler(); + + if (variant->backing == ANON_BACKED) + return; + + self->fd = open_file( + variant->backing == SHMEM_BACKED ? "/tmp/" : "", + self->path); + + /* We truncate file to at least 100 pages, tests can modify as needed. */ + ASSERT_EQ(ftruncate(self->fd, 100 * self->page_size), 0); +}; + +FIXTURE_TEARDOWN_PARENT(guard_regions) +{ + teardown_sighandler(); + + if (variant->backing == ANON_BACKED) + return; + + if (self->fd >= 0) + close(self->fd); + + if (self->path[0] != '\0') + unlink(self->path); +} + TEST_F(guard_regions, basic) { const unsigned long NUM_PAGES = 10; @@ -144,8 +253,8 @@ TEST_F(guard_regions, basic) char *ptr; int i;
- ptr = mmap(NULL, NUM_PAGES * page_size, PROT_READ | PROT_WRITE, - MAP_PRIVATE | MAP_ANON, -1, 0); + ptr = mmap_(self, variant, NULL, NUM_PAGES * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Trivially assert we can touch the first page. */ @@ -238,25 +347,23 @@ TEST_F(guard_regions, multi_vma) int i;
/* Reserve a 100 page region over which we can install VMAs. */ - ptr_region = mmap(NULL, 100 * page_size, PROT_NONE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr_region = mmap_(self, variant, NULL, 100 * page_size, + PROT_NONE, 0, 0); ASSERT_NE(ptr_region, MAP_FAILED);
/* Place a VMA of 10 pages size at the start of the region. */ - ptr1 = mmap(ptr_region, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ptr1 = mmap_(self, variant, ptr_region, 10 * page_size, + PROT_READ | PROT_WRITE, MAP_FIXED, 0); ASSERT_NE(ptr1, MAP_FAILED);
/* Place a VMA of 5 pages size 50 pages into the region. */ - ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size, - PROT_READ | PROT_WRITE, - MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ptr2 = mmap_(self, variant, &ptr_region[50 * page_size], 5 * page_size, + PROT_READ | PROT_WRITE, MAP_FIXED, 0); ASSERT_NE(ptr2, MAP_FAILED);
/* Place a VMA of 20 pages size at the end of the region. */ - ptr3 = mmap(&ptr_region[80 * page_size], 20 * page_size, - PROT_READ | PROT_WRITE, - MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ptr3 = mmap_(self, variant, &ptr_region[80 * page_size], 20 * page_size, + PROT_READ | PROT_WRITE, MAP_FIXED, 0); ASSERT_NE(ptr3, MAP_FAILED);
/* Unmap gaps. */ @@ -326,13 +433,11 @@ TEST_F(guard_regions, multi_vma) }
/* Now map incompatible VMAs in the gaps. */ - ptr = mmap(&ptr_region[10 * page_size], 40 * page_size, - PROT_READ | PROT_WRITE | PROT_EXEC, - MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, &ptr_region[10 * page_size], 40 * page_size, + PROT_READ | PROT_WRITE | PROT_EXEC, MAP_FIXED, 0); ASSERT_NE(ptr, MAP_FAILED); - ptr = mmap(&ptr_region[55 * page_size], 25 * page_size, - PROT_READ | PROT_WRITE | PROT_EXEC, - MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, &ptr_region[55 * page_size], 25 * page_size, + PROT_READ | PROT_WRITE | PROT_EXEC, MAP_FIXED, 0); ASSERT_NE(ptr, MAP_FAILED);
/* @@ -379,8 +484,8 @@ TEST_F(guard_regions, process_madvise) ASSERT_NE(pidfd, -1);
/* Reserve region to map over. */ - ptr_region = mmap(NULL, 100 * page_size, PROT_NONE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr_region = mmap_(self, variant, NULL, 100 * page_size, + PROT_NONE, 0, 0); ASSERT_NE(ptr_region, MAP_FAILED);
/* @@ -388,9 +493,8 @@ TEST_F(guard_regions, process_madvise) * overwrite existing entries and test this code path against * overwriting existing entries. */ - ptr1 = mmap(&ptr_region[page_size], 10 * page_size, - PROT_READ | PROT_WRITE, - MAP_FIXED | MAP_ANON | MAP_PRIVATE | MAP_POPULATE, -1, 0); + ptr1 = mmap_(self, variant, &ptr_region[page_size], 10 * page_size, + PROT_READ | PROT_WRITE, MAP_FIXED | MAP_POPULATE, 0); ASSERT_NE(ptr1, MAP_FAILED); /* We want guard markers at start/end of each VMA. */ vec[0].iov_base = ptr1; @@ -399,9 +503,8 @@ TEST_F(guard_regions, process_madvise) vec[1].iov_len = page_size;
/* 5 pages offset 50 pages into reserve region. */ - ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size, - PROT_READ | PROT_WRITE, - MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ptr2 = mmap_(self, variant, &ptr_region[50 * page_size], 5 * page_size, + PROT_READ | PROT_WRITE, MAP_FIXED, 0); ASSERT_NE(ptr2, MAP_FAILED); vec[2].iov_base = ptr2; vec[2].iov_len = page_size; @@ -409,9 +512,8 @@ TEST_F(guard_regions, process_madvise) vec[3].iov_len = page_size;
/* 20 pages offset 79 pages into reserve region. */ - ptr3 = mmap(&ptr_region[79 * page_size], 20 * page_size, - PROT_READ | PROT_WRITE, - MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ptr3 = mmap_(self, variant, &ptr_region[79 * page_size], 20 * page_size, + PROT_READ | PROT_WRITE, MAP_FIXED, 0); ASSERT_NE(ptr3, MAP_FAILED); vec[4].iov_base = ptr3; vec[4].iov_len = page_size; @@ -472,8 +574,8 @@ TEST_F(guard_regions, munmap) const unsigned long page_size = self->page_size; char *ptr, *ptr_new1, *ptr_new2;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Guard first and last pages. */ @@ -489,11 +591,11 @@ TEST_F(guard_regions, munmap) ASSERT_EQ(munmap(&ptr[9 * page_size], page_size), 0);
/* Map over them.*/ - ptr_new1 = mmap(ptr, page_size, PROT_READ | PROT_WRITE, - MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ptr_new1 = mmap_(self, variant, ptr, page_size, PROT_READ | PROT_WRITE, + MAP_FIXED, 0); ASSERT_NE(ptr_new1, MAP_FAILED); - ptr_new2 = mmap(&ptr[9 * page_size], page_size, PROT_READ | PROT_WRITE, - MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ptr_new2 = mmap_(self, variant, &ptr[9 * page_size], page_size, + PROT_READ | PROT_WRITE, MAP_FIXED, 0); ASSERT_NE(ptr_new2, MAP_FAILED);
/* Assert that they are now not guarded. */ @@ -511,8 +613,8 @@ TEST_F(guard_regions, mprotect) char *ptr; int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Guard the middle of the range. */ @@ -559,8 +661,8 @@ TEST_F(guard_regions, split_merge) char *ptr, *ptr_new; int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Guard the whole range. */ @@ -601,14 +703,14 @@ TEST_F(guard_regions, split_merge) }
/* Now map them again - the unmap will have cleared the guards. */ - ptr_new = mmap(&ptr[2 * page_size], page_size, PROT_READ | PROT_WRITE, - MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ptr_new = mmap_(self, variant, &ptr[2 * page_size], page_size, + PROT_READ | PROT_WRITE, MAP_FIXED, 0); ASSERT_NE(ptr_new, MAP_FAILED); - ptr_new = mmap(&ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE, - MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ptr_new = mmap_(self, variant, &ptr[5 * page_size], page_size, + PROT_READ | PROT_WRITE, MAP_FIXED, 0); ASSERT_NE(ptr_new, MAP_FAILED); - ptr_new = mmap(&ptr[8 * page_size], page_size, PROT_READ | PROT_WRITE, - MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); + ptr_new = mmap_(self, variant, &ptr[8 * page_size], page_size, + PROT_READ | PROT_WRITE, MAP_FIXED, 0); ASSERT_NE(ptr_new, MAP_FAILED);
/* Now make sure guard pages are established. */ @@ -690,8 +792,8 @@ TEST_F(guard_regions, dontneed) char *ptr; int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Back the whole range. */ @@ -721,8 +823,16 @@ TEST_F(guard_regions, dontneed) ASSERT_FALSE(result); } else { ASSERT_TRUE(result); - /* Make sure we really did get reset to zero page. */ - ASSERT_EQ(*curr, '\0'); + switch (variant->backing) { + case ANON_BACKED: + /* If anon, then we get a zero page. */ + ASSERT_EQ(*curr, '\0'); + break; + default: + /* Otherwise, we get the file data. */ + ASSERT_EQ(*curr, 'y'); + break; + } }
/* Now write... */ @@ -743,8 +853,8 @@ TEST_F(guard_regions, mlock) char *ptr; int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Populate. */ @@ -816,8 +926,8 @@ TEST_F(guard_regions, mremap_move) char *ptr, *ptr_new;
/* Map 5 pages. */ - ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 5 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Place guard markers at both ends of the 5 page span. */ @@ -831,8 +941,7 @@ TEST_F(guard_regions, mremap_move) /* Map a new region we will move this range into. Doing this ensures * that we have reserved a range to map into. */ - ptr_new = mmap(NULL, 5 * page_size, PROT_NONE, MAP_ANON | MAP_PRIVATE, - -1, 0); + ptr_new = mmap_(self, variant, NULL, 5 * page_size, PROT_NONE, 0, 0); ASSERT_NE(ptr_new, MAP_FAILED);
ASSERT_EQ(mremap(ptr, 5 * page_size, 5 * page_size, @@ -863,8 +972,8 @@ TEST_F(guard_regions, mremap_expand) char *ptr, *ptr_new;
/* Map 10 pages... */ - ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED); /* ...But unmap the last 5 so we can ensure we can expand into them. */ ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0); @@ -888,8 +997,7 @@ TEST_F(guard_regions, mremap_expand) ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
/* Reserve a region which we can move to and expand into. */ - ptr_new = mmap(NULL, 20 * page_size, PROT_NONE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr_new = mmap_(self, variant, NULL, 20 * page_size, PROT_NONE, 0, 0); ASSERT_NE(ptr_new, MAP_FAILED);
/* Now move and expand into it. */ @@ -927,8 +1035,8 @@ TEST_F(guard_regions, mremap_shrink) int i;
/* Map 5 pages. */ - ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 5 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Place guard markers at both ends of the 5 page span. */ @@ -992,8 +1100,8 @@ TEST_F(guard_regions, fork) int i;
/* Map 10 pages. */ - ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Establish guard pages in the first 5 pages. */ @@ -1046,9 +1154,12 @@ TEST_F(guard_regions, fork_cow) pid_t pid; int i;
+ if (variant->backing != ANON_BACKED) + SKIP(return, "CoW only supported on anon mappings"); + /* Map 10 pages. */ - ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Populate range. */ @@ -1117,9 +1228,12 @@ TEST_F(guard_regions, fork_wipeonfork) pid_t pid; int i;
+ if (variant->backing != ANON_BACKED) + SKIP(return, "Wipe on fork only supported on anon mappings"); + /* Map 10 pages. */ - ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Mark wipe on fork. */ @@ -1166,9 +1280,12 @@ TEST_F(guard_regions, lazyfree) char *ptr; int i;
+ if (variant->backing != ANON_BACKED) + SKIP(return, "MADV_FREE only supported on anon mappings"); + /* Map 10 pages. */ - ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Guard range. */ @@ -1202,8 +1319,8 @@ TEST_F(guard_regions, populate) char *ptr;
/* Map 10 pages. */ - ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Guard range. */ @@ -1229,8 +1346,8 @@ TEST_F(guard_regions, cold_pageout) int i;
/* Map 10 pages. */ - ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Guard range. */ @@ -1281,6 +1398,9 @@ TEST_F(guard_regions, uffd) struct uffdio_register reg; struct uffdio_range range;
+ if (!is_anon_backed(variant)) + SKIP(return, "uffd only works on anon backing"); + /* Set up uffd. */ uffd = userfaultfd(0); if (uffd == -1 && errno == EPERM) @@ -1290,8 +1410,8 @@ TEST_F(guard_regions, uffd) ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
/* Map 10 pages. */ - ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, - MAP_ANON | MAP_PRIVATE, -1, 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); ASSERT_NE(ptr, MAP_FAILED);
/* Register the range with uffd. */
On 2/13/25 19:17, Lorenzo Stoakes wrote:
Extend the guard region tests to allow for test fixture variants for anon, shmem, and local file files.
This allows us to assert that each of the expected behaviours of anonymous memory also applies correctly to file-backed (both shmem and an a file created locally in the current working directory) and thus asserts the same correctness guarantees as all the remaining tests do.
The fixture teardown is now performed in the parent process rather than child forked ones, meaning cleanup is always performed, including unlinking any generated temporary files.
Additionally the variant fixture data type now contains an enum value indicating the type of backing store and the mmap() invocation is abstracted to allow for the mapping of whichever backing store the variant is testing.
We adjust tests as necessary to account for the fact they may now reference files rather than anonymous memory.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Acked-by: Vlastimil Babka vbabka@suse.cz
Extend the guard region self tests to explicitly assert that guard regions work correctly for functionality specific to file-backed and shmem mappings.
In addition to testing all of the existing guard region functionality that is currently tested against anonymous mappings against file-backed and shmem mappings (except those which are exclusive to anonymous mapping), we now also:
* Test that MADV_SEQUENTIAL does not cause unexpected readahead behaviour. * Test that MAP_PRIVATE behaves as expected with guard regions installed in both a shared and private mapping of an fd. * Test that a read-only file can correctly establish guard regions. * Test a probable fault-around case does not interfere with guard regions (or vice-versa). * Test that truncation does not eliminate guard regions. * Test that hole punching functions as expected in the presence of guard regions. * Test that a read-only mapping of a memfd write sealed mapping can have guard regions established within it and function correctly without violation of the seal. * Test that guard regions installed into a mapping of the anonymous zero page function correctly.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- tools/testing/selftests/mm/guard-regions.c | 595 +++++++++++++++++++++ 1 file changed, 595 insertions(+)
diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index 0469c783f4fa..ea9b5815e828 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -216,6 +216,58 @@ static int open_file(const char *prefix, char *path) return fd; }
+/* Establish a varying pattern in a buffer. */ +static void set_pattern(char *ptr, size_t num_pages, size_t page_size) +{ + size_t i; + + for (i = 0; i < num_pages; i++) { + char *ptr2 = &ptr[i * page_size]; + + memset(ptr2, 'a' + (i % 26), page_size); + } +} + +/* + * Check that a buffer contains the pattern set by set_pattern(), starting at a + * page offset of pgoff within the buffer. + */ +static bool check_pattern_offset(char *ptr, size_t num_pages, size_t page_size, + size_t pgoff) +{ + size_t i; + + for (i = 0; i < num_pages * page_size; i++) { + size_t offset = pgoff * page_size + i; + char actual = ptr[offset]; + char expected = 'a' + ((offset / page_size) % 26); + + if (actual != expected) + return false; + } + + return true; +} + +/* Check that a buffer contains the pattern set by set_pattern(). */ +static bool check_pattern(char *ptr, size_t num_pages, size_t page_size) +{ + return check_pattern_offset(ptr, num_pages, page_size, 0); +} + +/* Determine if a buffer contains only repetitions of a specified char. */ +static bool is_buf_eq(char *buf, size_t size, char chr) +{ + size_t i; + + for (i = 0; i < size; i++) { + if (buf[i] != chr) + return false; + } + + return true; +} + FIXTURE_SETUP(guard_regions) { self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); @@ -1437,4 +1489,547 @@ TEST_F(guard_regions, uffd) ASSERT_EQ(munmap(ptr, 10 * page_size), 0); }
+/* + * Mark a region within a file-backed mapping using MADV_SEQUENTIAL so we + * aggressively read-ahead, then install guard regions and assert that it + * behaves correctly. + * + * We page out using MADV_PAGEOUT before checking guard regions so we drop page + * cache folios, meaning we maximise the possibility of some broken readahead. + */ +TEST_F(guard_regions, madvise_sequential) +{ + char *ptr; + int i; + const unsigned long page_size = self->page_size; + + if (variant->backing == ANON_BACKED) + SKIP(return, "MADV_SEQUENTIAL meaningful only for file-backed"); + + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Establish a pattern of data in the file. */ + set_pattern(ptr, 10, page_size); + ASSERT_TRUE(check_pattern(ptr, 10, page_size)); + + /* Mark it as being accessed sequentially. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_SEQUENTIAL), 0); + + /* Mark every other page a guard page. */ + for (i = 0; i < 10; i += 2) { + char *ptr2 = &ptr[i * page_size]; + + ASSERT_EQ(madvise(ptr2, page_size, MADV_GUARD_INSTALL), 0); + } + + /* Now page it out. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_PAGEOUT), 0); + + /* Now make sure pages are as expected. */ + for (i = 0; i < 10; i++) { + char *chrp = &ptr[i * page_size]; + + if (i % 2 == 0) { + bool result = try_read_write_buf(chrp); + + ASSERT_FALSE(result); + } else { + ASSERT_EQ(*chrp, 'a' + i); + } + } + + /* Now remove guard pages. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0); + + /* Now make sure all data is as expected. */ + if (!check_pattern(ptr, 10, page_size)) + ASSERT_TRUE(false); + + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* + * Check that file-backed mappings implement guard regions with MAP_PRIVATE + * correctly. + */ +TEST_F(guard_regions, map_private) +{ + const unsigned long page_size = self->page_size; + char *ptr_shared, *ptr_private; + int i; + + if (variant->backing == ANON_BACKED) + SKIP(return, "MAP_PRIVATE test specific to file-backed"); + + ptr_shared = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0); + ASSERT_NE(ptr_shared, MAP_FAILED); + + /* Manually mmap(), do not use mmap_() wrapper so we can force MAP_PRIVATE. */ + ptr_private = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, self->fd, 0); + ASSERT_NE(ptr_private, MAP_FAILED); + + /* Set pattern in shared mapping. */ + set_pattern(ptr_shared, 10, page_size); + + /* Install guard regions in every other page in the shared mapping. */ + for (i = 0; i < 10; i += 2) { + char *ptr = &ptr_shared[i * page_size]; + + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0); + } + + for (i = 0; i < 10; i++) { + /* Every even shared page should be guarded. */ + ASSERT_EQ(try_read_buf(&ptr_shared[i * page_size]), i % 2 != 0); + /* Private mappings should always be readable. */ + ASSERT_TRUE(try_read_buf(&ptr_private[i * page_size])); + } + + /* Install guard regions in every other page in the private mapping. */ + for (i = 0; i < 10; i += 2) { + char *ptr = &ptr_private[i * page_size]; + + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0); + } + + for (i = 0; i < 10; i++) { + /* Every even shared page should be guarded. */ + ASSERT_EQ(try_read_buf(&ptr_shared[i * page_size]), i % 2 != 0); + /* Every odd private page should be guarded. */ + ASSERT_EQ(try_read_buf(&ptr_private[i * page_size]), i % 2 != 0); + } + + /* Remove guard regions from shared mapping. */ + ASSERT_EQ(madvise(ptr_shared, 10 * page_size, MADV_GUARD_REMOVE), 0); + + for (i = 0; i < 10; i++) { + /* Shared mappings should always be readable. */ + ASSERT_TRUE(try_read_buf(&ptr_shared[i * page_size])); + /* Every even private page should be guarded. */ + ASSERT_EQ(try_read_buf(&ptr_private[i * page_size]), i % 2 != 0); + } + + /* Remove guard regions from private mapping. */ + ASSERT_EQ(madvise(ptr_private, 10 * page_size, MADV_GUARD_REMOVE), 0); + + for (i = 0; i < 10; i++) { + /* Shared mappings should always be readable. */ + ASSERT_TRUE(try_read_buf(&ptr_shared[i * page_size])); + /* Private mappings should always be readable. */ + ASSERT_TRUE(try_read_buf(&ptr_private[i * page_size])); + } + + /* Ensure patterns are intact. */ + ASSERT_TRUE(check_pattern(ptr_shared, 10, page_size)); + ASSERT_TRUE(check_pattern(ptr_private, 10, page_size)); + + /* Now write out every other page to MAP_PRIVATE. */ + for (i = 0; i < 10; i += 2) { + char *ptr = &ptr_private[i * page_size]; + + memset(ptr, 'a' + i, page_size); + } + + /* + * At this point the mapping is: + * + * 0123456789 + * SPSPSPSPSP + * + * Where S = shared, P = private mappings. + */ + + /* Now mark the beginning of the mapping guarded. */ + ASSERT_EQ(madvise(ptr_private, 5 * page_size, MADV_GUARD_INSTALL), 0); + + /* + * This renders the mapping: + * + * 0123456789 + * xxxxxPSPSP + */ + + for (i = 0; i < 10; i++) { + char *ptr = &ptr_private[i * page_size]; + + /* Ensure guard regions as expected. */ + ASSERT_EQ(try_read_buf(ptr), i >= 5); + /* The shared mapping should always succeed. */ + ASSERT_TRUE(try_read_buf(&ptr_shared[i * page_size])); + } + + /* Remove the guard regions altogether. */ + ASSERT_EQ(madvise(ptr_private, 10 * page_size, MADV_GUARD_REMOVE), 0); + + /* + * + * We now expect the mapping to be: + * + * 0123456789 + * SSSSSPSPSP + * + * As we removed guard regions, the private pages from the first 5 will + * have been zapped, so on fault will reestablish the shared mapping. + */ + + for (i = 0; i < 10; i++) { + char *ptr = &ptr_private[i * page_size]; + + /* + * Assert that shared mappings in the MAP_PRIVATE mapping match + * the shared mapping. + */ + if (i < 5 || i % 2 == 0) { + char *ptr_s = &ptr_shared[i * page_size]; + + ASSERT_EQ(memcmp(ptr, ptr_s, page_size), 0); + continue; + } + + /* Everything else is a private mapping. */ + ASSERT_TRUE(is_buf_eq(ptr, page_size, 'a' + i)); + } + + ASSERT_EQ(munmap(ptr_shared, 10 * page_size), 0); + ASSERT_EQ(munmap(ptr_private, 10 * page_size), 0); +} + +/* Test that guard regions established over a read-only mapping function correctly. */ +TEST_F(guard_regions, readonly_file) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + if (variant->backing == ANON_BACKED) + SKIP(return, "Read-only test specific to file-backed"); + + /* Map shared so we can populate with pattern, populate it, unmap. */ + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + set_pattern(ptr, 10, page_size); + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); + /* Close the fd so we can re-open read-only. */ + ASSERT_EQ(close(self->fd), 0); + + /* Re-open read-only. */ + self->fd = open(self->path, O_RDONLY); + ASSERT_NE(self->fd, -1); + /* Re-map read-only. */ + ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Mark every other page guarded. */ + for (i = 0; i < 10; i += 2) { + char *ptr_pg = &ptr[i * page_size]; + + ASSERT_EQ(madvise(ptr_pg, page_size, MADV_GUARD_INSTALL), 0); + } + + /* Assert that the guard regions are in place.*/ + for (i = 0; i < 10; i++) { + char *ptr_pg = &ptr[i * page_size]; + + ASSERT_EQ(try_read_buf(ptr_pg), i % 2 != 0); + } + + /* Remove guard regions. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0); + + /* Ensure the data is as expected. */ + ASSERT_TRUE(check_pattern(ptr, 10, page_size)); + + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +TEST_F(guard_regions, fault_around) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + if (variant->backing == ANON_BACKED) + SKIP(return, "Fault-around test specific to file-backed"); + + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Establish a pattern in the backing file. */ + set_pattern(ptr, 10, page_size); + + /* + * Now drop it from the page cache so we get major faults when next we + * map it. + */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_PAGEOUT), 0); + + /* Unmap and remap 'to be sure'. */ + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Now make every even page guarded. */ + for (i = 0; i < 10; i += 2) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_EQ(madvise(ptr_p, page_size, MADV_GUARD_INSTALL), 0); + } + + /* Now fault in every odd page. This should trigger fault-around. */ + for (i = 1; i < 10; i += 2) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_TRUE(try_read_buf(ptr_p)); + } + + /* Finally, ensure that guard regions are intact as expected. */ + for (i = 0; i < 10; i++) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_EQ(try_read_buf(ptr_p), i % 2 != 0); + } + + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +TEST_F(guard_regions, truncation) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + if (variant->backing == ANON_BACKED) + SKIP(return, "Truncation test specific to file-backed"); + + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* + * Establish a pattern in the backing file, just so there is data + * there. + */ + set_pattern(ptr, 10, page_size); + + /* Now make every even page guarded. */ + for (i = 0; i < 10; i += 2) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_EQ(madvise(ptr_p, page_size, MADV_GUARD_INSTALL), 0); + } + + /* Now assert things are as expected. */ + for (i = 0; i < 10; i++) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_EQ(try_read_write_buf(ptr_p), i % 2 != 0); + } + + /* Now truncate to actually used size (initialised to 100). */ + ASSERT_EQ(ftruncate(self->fd, 10 * page_size), 0); + + /* Here the guard regions will remain intact. */ + for (i = 0; i < 10; i++) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_EQ(try_read_write_buf(ptr_p), i % 2 != 0); + } + + /* Now truncate to half the size, then truncate again to the full size. */ + ASSERT_EQ(ftruncate(self->fd, 5 * page_size), 0); + ASSERT_EQ(ftruncate(self->fd, 10 * page_size), 0); + + /* Again, guard pages will remain intact. */ + for (i = 0; i < 10; i++) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_EQ(try_read_write_buf(ptr_p), i % 2 != 0); + } + + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +TEST_F(guard_regions, hole_punch) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + if (variant->backing == ANON_BACKED) + SKIP(return, "Truncation test specific to file-backed"); + + /* Establish pattern in mapping. */ + ptr = mmap_(self, variant, NULL, 10 * page_size, + PROT_READ | PROT_WRITE, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + set_pattern(ptr, 10, page_size); + + /* Install a guard region in the middle of the mapping. */ + ASSERT_EQ(madvise(&ptr[3 * page_size], 4 * page_size, + MADV_GUARD_INSTALL), 0); + + /* + * The buffer will now be: + * + * 0123456789 + * ***xxxx*** + * + * Where * is data and x is the guard region. + */ + + /* Ensure established. */ + for (i = 0; i < 10; i++) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_EQ(try_read_buf(ptr_p), i < 3 || i >= 7); + } + + /* Now hole punch the guarded region. */ + ASSERT_EQ(madvise(&ptr[3 * page_size], 4 * page_size, + MADV_REMOVE), 0); + + /* Ensure guard regions remain. */ + for (i = 0; i < 10; i++) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_EQ(try_read_buf(ptr_p), i < 3 || i >= 7); + } + + /* Now remove guard region throughout. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0); + + /* Check that the pattern exists in non-hole punched region. */ + ASSERT_TRUE(check_pattern(ptr, 3, page_size)); + /* Check that hole punched region is zeroed. */ + ASSERT_TRUE(is_buf_eq(&ptr[3 * page_size], 4 * page_size, '\0')); + /* Check that the pattern exists in the remainder of the file. */ + ASSERT_TRUE(check_pattern_offset(ptr, 3, page_size, 7)); + + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + +/* + * Ensure that a memfd works correctly with guard regions, that we can write + * seal it then open the mapping read-only and still establish guard regions + * within, remove those guard regions and have everything work correctly. + */ +TEST_F(guard_regions, memfd_write_seal) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + if (variant->backing != SHMEM_BACKED) + SKIP(return, "memfd write seal test specific to shmem"); + + /* OK, we need a memfd, so close existing one. */ + ASSERT_EQ(close(self->fd), 0); + + /* Create and truncate memfd. */ + self->fd = memfd_create("guard_regions_memfd_seals_test", + MFD_ALLOW_SEALING); + ASSERT_NE(self->fd, -1); + ASSERT_EQ(ftruncate(self->fd, 10 * page_size), 0); + + /* Map, set pattern, unmap. */ + ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + set_pattern(ptr, 10, page_size); + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); + + /* Write-seal the memfd. */ + ASSERT_EQ(fcntl(self->fd, F_ADD_SEALS, F_SEAL_WRITE), 0); + + /* Now map the memfd readonly. */ + ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Ensure pattern is as expected. */ + ASSERT_TRUE(check_pattern(ptr, 10, page_size)); + + /* Now make every even page guarded. */ + for (i = 0; i < 10; i += 2) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_EQ(madvise(ptr_p, page_size, MADV_GUARD_INSTALL), 0); + } + + /* Now assert things are as expected. */ + for (i = 0; i < 10; i++) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_EQ(try_read_buf(ptr_p), i % 2 != 0); + } + + /* Now remove guard regions. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0); + + /* Ensure pattern is as expected. */ + ASSERT_TRUE(check_pattern(ptr, 10, page_size)); + + /* Ensure write seal intact. */ + for (i = 0; i < 10; i++) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_FALSE(try_write_buf(ptr_p)); + } + + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + + +/* + * Since we are now permitted to establish guard regions in read-only anonymous + * mappings, for the sake of thoroughness, though it probably has no practical + * use, test that guard regions function with a mapping to the anonymous zero + * page. + */ +TEST_F(guard_regions, anon_zeropage) +{ + const unsigned long page_size = self->page_size; + char *ptr; + int i; + + if (!is_anon_backed(variant)) + SKIP(return, "anon zero page test specific to anon/shmem"); + + /* Obtain a read-only i.e. anon zero page mapping. */ + ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* Now make every even page guarded. */ + for (i = 0; i < 10; i += 2) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_EQ(madvise(ptr_p, page_size, MADV_GUARD_INSTALL), 0); + } + + /* Now assert things are as expected. */ + for (i = 0; i < 10; i++) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_EQ(try_read_buf(ptr_p), i % 2 != 0); + } + + /* Now remove all guard regions. */ + ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0); + + /* Now assert things are as expected. */ + for (i = 0; i < 10; i++) { + char *ptr_p = &ptr[i * page_size]; + + ASSERT_TRUE(try_read_buf(ptr_p)); + } + + /* Ensure zero page...*/ + ASSERT_TRUE(is_buf_eq(ptr, 10 * page_size, '\0')); + + ASSERT_EQ(munmap(ptr, 10 * page_size), 0); +} + TEST_HARNESS_MAIN
On 2/13/25 19:17, Lorenzo Stoakes wrote:
Extend the guard region self tests to explicitly assert that guard regions work correctly for functionality specific to file-backed and shmem mappings.
In addition to testing all of the existing guard region functionality that is currently tested against anonymous mappings against file-backed and shmem mappings (except those which are exclusive to anonymous mapping), we now also:
- Test that MADV_SEQUENTIAL does not cause unexpected readahead behaviour.
- Test that MAP_PRIVATE behaves as expected with guard regions installed in both a shared and private mapping of an fd.
- Test that a read-only file can correctly establish guard regions.
- Test a probable fault-around case does not interfere with guard regions (or vice-versa).
- Test that truncation does not eliminate guard regions.
- Test that hole punching functions as expected in the presence of guard regions.
- Test that a read-only mapping of a memfd write sealed mapping can have guard regions established within it and function correctly without violation of the seal.
- Test that guard regions installed into a mapping of the anonymous zero page function correctly.
Impressive.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Acked-by: Vlastimil Babka vbabka@suse.cz
On 2/13/25 19:16, Lorenzo Stoakes wrote:
The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
Do you plan to address mlock later too? I guess somebody might find use for those. Is there some fundamental issue or just that we need to define some good semantics for corner cases? (i.e. if pages are already populated in the mlocked area, discarding them by replacing with guard pte's goes against that, so do we allow it or not?).
Otherwise nice discussion of all the potential issues here!
On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote:
On 2/13/25 19:16, Lorenzo Stoakes wrote:
The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
Do you plan to address mlock later too? I guess somebody might find use for those. Is there some fundamental issue or just that we need to define some good semantics for corner cases? (i.e. if pages are already populated in the mlocked area, discarding them by replacing with guard pte's goes against that, so do we allow it or not?).
Yeah that's the fundamental issue with mlock, it does not interact with the zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked _afterwards_ there will be no zapping).
We could potentially expose an equivalent, as there are for other flags, a _LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make this explicit.
That is probably the most sensible option, if there is a need for this!
Otherwise nice discussion of all the potential issues here!
Thanks! :)
On 18.02.25 14:05, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote:
On 2/13/25 19:16, Lorenzo Stoakes wrote:
The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
Do you plan to address mlock later too? I guess somebody might find use for those. Is there some fundamental issue or just that we need to define some good semantics for corner cases? (i.e. if pages are already populated in the mlocked area, discarding them by replacing with guard pte's goes against that, so do we allow it or not?).
Yeah that's the fundamental issue with mlock, it does not interact with the zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked _afterwards_ there will be no zapping).
We could potentially expose an equivalent, as there are for other flags, a _LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make this explicit.
That is probably the most sensible option, if there is a need for this!
mlock is weird, because it assumes that memory will be faulted in in the whole VMA.
You'd likely have to populate + mlock the page when removing the guard. Also not sure what happens if one does an mlock()/mlockall() after already installing PTE markers.
__mm_populate() would skip whole VMAs in case populate_vma_page_range() fails. And I would assume populate_vma_page_range() fails on the first guard when it triggers a page fault.
OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where MADV_DONTNEED_LOCKED originates from:
commit 9457056ac426e5ed0671356509c8dcce69f8dee0 Author: Johannes Weiner hannes@cmpxchg.org Date: Thu Mar 24 18:14:12 2022 -0700
mm: madvise: MADV_DONTNEED_LOCKED
MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT and MCL_ONFAULT allowing to mlock without populating, there are valid use cases for depopulating locked ranges as well.
Adding support for that would be indeed nice.
On Tue, Feb 18, 2025 at 03:35:19PM +0100, David Hildenbrand wrote:
On 18.02.25 14:05, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote:
On 2/13/25 19:16, Lorenzo Stoakes wrote:
The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
Do you plan to address mlock later too? I guess somebody might find use for those. Is there some fundamental issue or just that we need to define some good semantics for corner cases? (i.e. if pages are already populated in the mlocked area, discarding them by replacing with guard pte's goes against that, so do we allow it or not?).
Yeah that's the fundamental issue with mlock, it does not interact with the zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked _afterwards_ there will be no zapping).
We could potentially expose an equivalent, as there are for other flags, a _LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make this explicit.
That is probably the most sensible option, if there is a need for this!
mlock is weird, because it assumes that memory will be faulted in in the whole VMA.
You'd likely have to populate + mlock the page when removing the guard.
Right yeah that'd be super weird. And I don't want to add that logic.
Also not sure what happens if one does an mlock()/mlockall() after already installing PTE markers.
The existing logic already handles non-present cases by skipping them, in mlock_pte_range():
for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) { ptent = ptep_get(pte); if (!pte_present(ptent)) continue;
... }
Which covers off guard regions. Removing the guard regions after this will leave you in a weird situation where these entries will be zapped... maybe we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case also populate?
Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE if you since locked the range? The code currently allows this on the proviso that 'you aren't zapping locked mappings' but leaves the VMA in a state such that some entries would not be locked.
It'd be pretty weird to lock guard regions like this.
Having said all that, given what you say below, maybe it's not an issue after all?...
__mm_populate() would skip whole VMAs in case populate_vma_page_range() fails. And I would assume populate_vma_page_range() fails on the first guard when it triggers a page fault.
OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where MADV_DONTNEED_LOCKED originates from:
commit 9457056ac426e5ed0671356509c8dcce69f8dee0 Author: Johannes Weiner hannes@cmpxchg.org Date: Thu Mar 24 18:14:12 2022 -0700
mm: madvise: MADV_DONTNEED_LOCKED MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT and MCL_ONFAULT allowing to mlock without populating, there are valid use cases for depopulating locked ranges as well.
...Hm this seems to imply the current guard remove stuff isn't quite so bad, so maybe the assumption that VM_LOCKED implies 'everything is populated' isn't quite as stringent then.
The restriction is as simple as:
if (behavior != MADV_DONTNEED_LOCKED) forbidden |= VM_LOCKED;
Adding support for that would be indeed nice.
I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED locked ranges, but I really don't understand why you'd want to add guard regions to mlock()'ed regions?
Then again we're currently asymmetric as you can add them _before_ mlock()'ing...
-- Cheers,
David / dhildenb
Right yeah that'd be super weird. And I don't want to add that logic.
Also not sure what happens if one does an mlock()/mlockall() after already installing PTE markers.
The existing logic already handles non-present cases by skipping them, in mlock_pte_range():
for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) { ptent = ptep_get(pte); if (!pte_present(ptent)) continue;
...
}
I *think* that code only updates already-mapped folios, to properly call mlock_folio()/munlock_folio().
It is not the code that populates pages on mlock()/mlockall(). I think all that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should apply.
See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
Which covers off guard regions. Removing the guard regions after this will leave you in a weird situation where these entries will be zapped... maybe we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case also populate?
Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE if you since locked the range? The code currently allows this on the proviso that 'you aren't zapping locked mappings' but leaves the VMA in a state such that some entries would not be locked.
It'd be pretty weird to lock guard regions like this.
Having said all that, given what you say below, maybe it's not an issue after all?...
__mm_populate() would skip whole VMAs in case populate_vma_page_range() fails. And I would assume populate_vma_page_range() fails on the first guard when it triggers a page fault.
OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where MADV_DONTNEED_LOCKED originates from:
commit 9457056ac426e5ed0671356509c8dcce69f8dee0 Author: Johannes Weiner hannes@cmpxchg.org Date: Thu Mar 24 18:14:12 2022 -0700
mm: madvise: MADV_DONTNEED_LOCKED MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT and MCL_ONFAULT allowing to mlock without populating, there are valid use cases for depopulating locked ranges as well.
...Hm this seems to imply the current guard remove stuff isn't quite so bad, so maybe the assumption that VM_LOCKED implies 'everything is populated' isn't quite as stringent then.
Right, with MCL_ONFAULT at least. Without MCL_ONFAULT, the assumption is that everything is populated (unless, apparently one uses MADV_DONTNEED_LOCKED or population failed, maybe).
VM_LOCKONFAULT seems to be the sane case. I wonder why MADV_DONTNEED_LOCKED didn't explicitly check for that one ... maybe there is a history to that.
The restriction is as simple as:
if (behavior != MADV_DONTNEED_LOCKED) forbidden |= VM_LOCKED;
Adding support for that would be indeed nice.
I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED locked ranges, but I really don't understand why you'd want to add guard regions to mlock()'ed regions?
Somme apps use mlockall(), and it might be nice to just be able to use guard pages as if "Nothing happened".
E.g., QEMU has the option to use mlockall().
Then again we're currently asymmetric as you can add them _before_ mlock()'ing...
Right.
On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
Right yeah that'd be super weird. And I don't want to add that logic.
Also not sure what happens if one does an mlock()/mlockall() after already installing PTE markers.
The existing logic already handles non-present cases by skipping them, in mlock_pte_range():
for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) { ptent = ptep_get(pte); if (!pte_present(ptent)) continue;
...
}
I *think* that code only updates already-mapped folios, to properly call mlock_folio()/munlock_folio().
Guard regions _are_ 'already mapped' :) so it leaves them in place.
do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range() implies this will be invoked.
It is not the code that populates pages on mlock()/mlockall(). I think all that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should apply.
OK I want to correct what I said earlier.
Installing a guard region then attempting mlock() will result in an error. The populate will -EFAULT and stop at the guard region, which causes mlock() to error out.
This is a partial failure, so the VMA is split and has VM_LOCKED applied, but the populate halts at the guard region.
This is ok as per previous discussion on aggregate operation failure, there can be no expectation of 'unwinding' of partially successful operations that form part of a requested aggregate one.
However, given there's stuff to clean up, and on error a user _may_ wish to then remove guard regions and try again, I guess there's no harm in keeping the code as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.
See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly' yeah wrongly, very wrongly...
Which covers off guard regions. Removing the guard regions after this will leave you in a weird situation where these entries will be zapped... maybe we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case also populate?
Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
See above, no we should not :P this is only good for cleanup after mlock() failure, although no sane program should really be trying to do this, a sane program would give up here (and it's a _programmatic error_ to try to mlock() a range with guard regions).
Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE if you since locked the range? The code currently allows this on the proviso that 'you aren't zapping locked mappings' but leaves the VMA in a state such that some entries would not be locked.
It'd be pretty weird to lock guard regions like this.
Having said all that, given what you say below, maybe it's not an issue after all?...
__mm_populate() would skip whole VMAs in case populate_vma_page_range() fails. And I would assume populate_vma_page_range() fails on the first guard when it triggers a page fault.
OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where MADV_DONTNEED_LOCKED originates from:
commit 9457056ac426e5ed0671356509c8dcce69f8dee0 Author: Johannes Weiner hannes@cmpxchg.org Date: Thu Mar 24 18:14:12 2022 -0700
mm: madvise: MADV_DONTNEED_LOCKED MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT and MCL_ONFAULT allowing to mlock without populating, there are valid use cases for depopulating locked ranges as well.
...Hm this seems to imply the current guard remove stuff isn't quite so bad, so maybe the assumption that VM_LOCKED implies 'everything is populated' isn't quite as stringent then.
Right, with MCL_ONFAULT at least. Without MCL_ONFAULT, the assumption is that everything is populated (unless, apparently one uses MADV_DONTNEED_LOCKED or population failed, maybe).
VM_LOCKONFAULT seems to be the sane case. I wonder why MADV_DONTNEED_LOCKED didn't explicitly check for that one ... maybe there is a history to that.
Yeah weird.
The restriction is as simple as:
if (behavior != MADV_DONTNEED_LOCKED) forbidden |= VM_LOCKED;
Adding support for that would be indeed nice.
I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED locked ranges, but I really don't understand why you'd want to add guard regions to mlock()'ed regions?
Somme apps use mlockall(), and it might be nice to just be able to use guard pages as if "Nothing happened".
Sadly I think not given above :P
E.g., QEMU has the option to use mlockall().
Then again we're currently asymmetric as you can add them _before_ mlock()'ing...
Right.
-- Cheers,
David / dhildenb
I think the _LOCKED idea is therefore kaput, because it just won't work properly because populating guard regions fails.
It fails because it tries to 'touch' the memory, but 'touching' guard region memory causes a segfault. This kind of breaks the idea of mlock()'ing guard regions.
I think adding workarounds to make this possible in any way is not really worth it (and would probably be pretty gross).
We already document that 'mlock()ing lightweight guard regions will fail' as per man page so this is all in line with that.
On 18.02.25 17:43, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
Right yeah that'd be super weird. And I don't want to add that logic.
Also not sure what happens if one does an mlock()/mlockall() after already installing PTE markers.
The existing logic already handles non-present cases by skipping them, in mlock_pte_range():
for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) { ptent = ptep_get(pte); if (!pte_present(ptent)) continue;
...
}
I *think* that code only updates already-mapped folios, to properly call mlock_folio()/munlock_folio().
Guard regions _are_ 'already mapped' :) so it leaves them in place.
"mapped folios" -- there is no folio mapped. Yes, the VMA is in place.
do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range() implies this will be invoked.
Yes, to process any already mapped folios, to then continue population later.
It is not the code that populates pages on mlock()/mlockall(). I think all that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should apply.
OK I want to correct what I said earlier.
Installing a guard region then attempting mlock() will result in an error. The populate will -EFAULT and stop at the guard region, which causes mlock() to error out.
Right, that's my expectation.
This is a partial failure, so the VMA is split and has VM_LOCKED applied, but the populate halts at the guard region.
This is ok as per previous discussion on aggregate operation failure, there can be no expectation of 'unwinding' of partially successful operations that form part of a requested aggregate one.
However, given there's stuff to clean up, and on error a user _may_ wish to then remove guard regions and try again, I guess there's no harm in keeping the code as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.
Likely yes; it's all weird code.
See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly' yeah wrongly, very wrongly...
Which covers off guard regions. Removing the guard regions after this will leave you in a weird situation where these entries will be zapped... maybe we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case also populate?
Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
See above, no we should not :P this is only good for cleanup after mlock() failure, although no sane program should really be trying to do this, a sane program would give up here (and it's a _programmatic error_ to try to mlock() a range with guard regions).
Somme apps use mlockall(), and it might be nice to just be able to
use guard
pages as if "Nothing happened".
Sadly I think not given above :P
QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when requested to then exit(); if it fails.
Assume glibc or any lib uses it, QEMU would have no real way of figuring that out or instructing offending libraries to disabled that, at least for now ...
... turning RT VMs less usable if any library uses guard regions. :(
There is upcoming support for MCL_ONFAULT in QEMU [1] (see below).
[1] https://lkml.kernel.org/r/20250212173823.214429-3-peterx@redhat.com
E.g., QEMU has the option to use mlockall().
Then again we're currently asymmetric as you can add them _before_ mlock()'ing...
Right.
-- Cheers,
David / dhildenb
I think the _LOCKED idea is therefore kaput, because it just won't work properly because populating guard regions fails.
Right, I think basic VM_LOCKED is out of the picture. VM_LOCKONFAULT might be interesting, because we are skipping the population stage.
It fails because it tries to 'touch' the memory, but 'touching' guard region memory causes a segfault. This kind of breaks the idea of mlock()'ing guard regions.
I think adding workarounds to make this possible in any way is not really worth it (and would probably be pretty gross).
We already document that 'mlock()ing lightweight guard regions will fail' as per man page so this is all in line with that.
Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as allowing install/remove of guard regions when that flag is set.
On Tue, Feb 18, 2025 at 06:14:00PM +0100, David Hildenbrand wrote:
On 18.02.25 17:43, Lorenzo Stoakes wrote:
On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
Right yeah that'd be super weird. And I don't want to add that logic.
Also not sure what happens if one does an mlock()/mlockall() after already installing PTE markers.
The existing logic already handles non-present cases by skipping them, in mlock_pte_range():
for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) { ptent = ptep_get(pte); if (!pte_present(ptent)) continue;
...
}
I *think* that code only updates already-mapped folios, to properly call mlock_folio()/munlock_folio().
Guard regions _are_ 'already mapped' :) so it leaves them in place.
"mapped folios" -- there is no folio mapped. Yes, the VMA is in place.
We're engaging in a moot discussion on this I think but I mean it appears to operate by walking page tables if they are populated, which they will be for guard regions, but when it finds it's non-present it will skip.
This amounts to the same thing as not doing anything, obviously.
do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range() implies this will be invoked.
Yes, to process any already mapped folios, to then continue population later.
It is not the code that populates pages on mlock()/mlockall(). I think all that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should apply.
OK I want to correct what I said earlier.
Installing a guard region then attempting mlock() will result in an error. The populate will -EFAULT and stop at the guard region, which causes mlock() to error out.
Right, that's my expectation.
OK good!
This is a partial failure, so the VMA is split and has VM_LOCKED applied, but the populate halts at the guard region.
This is ok as per previous discussion on aggregate operation failure, there can be no expectation of 'unwinding' of partially successful operations that form part of a requested aggregate one.
However, given there's stuff to clean up, and on error a user _may_ wish to then remove guard regions and try again, I guess there's no harm in keeping the code as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.
Likely yes; it's all weird code.
See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly' yeah wrongly, very wrongly...
Which covers off guard regions. Removing the guard regions after this will leave you in a weird situation where these entries will be zapped... maybe we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case also populate?
Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
See above, no we should not :P this is only good for cleanup after mlock() failure, although no sane program should really be trying to do this, a sane program would give up here (and it's a _programmatic error_ to try to mlock() a range with guard regions).
Somme apps use mlockall(), and it might be nice to just be able to use
guard
pages as if "Nothing happened".
Sadly I think not given above :P
QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when requested to then exit(); if it fails.
Hm under what circumstances? I use qemu extensively to test this stuff with no issues. Unless you mean it's using it in the 'host' code somehow.
Assume glibc or any lib uses it, QEMU would have no real way of figuring that out or instructing offending libraries to disabled that, at least for now ...
... turning RT VMs less usable if any library uses guard regions. :(
This seems really stupid, to be honest. Unfortunately there's no way around this, if software does stupid things then they get stupid prizes. There are other ways mlock() and faulting in can fail too.
There is upcoming support for MCL_ONFAULT in QEMU [1] (see below).
Good.
[1] https://lkml.kernel.org/r/20250212173823.214429-3-peterx@redhat.com
E.g., QEMU has the option to use mlockall().
Then again we're currently asymmetric as you can add them _before_ mlock()'ing...
Right.
-- Cheers,
David / dhildenb
I think the _LOCKED idea is therefore kaput, because it just won't work properly because populating guard regions fails.
Right, I think basic VM_LOCKED is out of the picture. VM_LOCKONFAULT might be interesting, because we are skipping the population stage.
It fails because it tries to 'touch' the memory, but 'touching' guard region memory causes a segfault. This kind of breaks the idea of mlock()'ing guard regions.
I think adding workarounds to make this possible in any way is not really worth it (and would probably be pretty gross).
We already document that 'mlock()ing lightweight guard regions will fail' as per man page so this is all in line with that.
Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as allowing install/remove of guard regions when that flag is set.
We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we disallow.
-- Cheers,
David / dhildenb
QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when requested to then exit(); if it fails.
Hm under what circumstances? I use qemu extensively to test this stuff with no issues. Unless you mean it's using it in the 'host' code somehow.
-overcommit mem-lock=on
or (legacy)
-realtime mlock=on
I think.
[...]
It fails because it tries to 'touch' the memory, but 'touching' guard region memory causes a segfault. This kind of breaks the idea of mlock()'ing guard regions.
I think adding workarounds to make this possible in any way is not really worth it (and would probably be pretty gross).
We already document that 'mlock()ing lightweight guard regions will fail' as per man page so this is all in line with that.
Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as allowing install/remove of guard regions when that flag is set.
We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we disallow.
See mlock2();
SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags) { vm_flags_t vm_flags = VM_LOCKED;
if (flags & ~MLOCK_ONFAULT) return -EINVAL;
if (flags & MLOCK_ONFAULT) vm_flags |= VM_LOCKONFAULT;
return do_mlock(start, len, vm_flags); }
VM_LOCKONFAULT always as VM_LOCKED set as well.
On Tue, Feb 18, 2025 at 06:25:35PM +0100, David Hildenbrand wrote:
QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when requested to then exit(); if it fails.
Hm under what circumstances? I use qemu extensively to test this stuff with no issues. Unless you mean it's using it in the 'host' code somehow.
-overcommit mem-lock=on
or (legacy)
-realtime mlock=on
I think.
[...]
Thanks
It fails because it tries to 'touch' the memory, but 'touching' guard region memory causes a segfault. This kind of breaks the idea of mlock()'ing guard regions.
I think adding workarounds to make this possible in any way is not really worth it (and would probably be pretty gross).
We already document that 'mlock()ing lightweight guard regions will fail' as per man page so this is all in line with that.
Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as allowing install/remove of guard regions when that flag is set.
We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we disallow.
See mlock2();
SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags) { vm_flags_t vm_flags = VM_LOCKED;
if (flags & ~MLOCK_ONFAULT) return -EINVAL;
if (flags & MLOCK_ONFAULT) vm_flags |= VM_LOCKONFAULT;
return do_mlock(start, len, vm_flags); }
VM_LOCKONFAULT always as VM_LOCKED set as well.
OK cool, that makes sense.
As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew again, then... :P if only somebody would write it down in a book...
Yeah then that makes sense to check explicitly for (VM_LOCKED | VM_LOCKONFAULT) in any MADV_GUARD_INSTALL_LOCKED variant as obviously this would be passively excluded right now.
-- Cheers,
David / dhildenb
See mlock2();
SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags) { vm_flags_t vm_flags = VM_LOCKED;
if (flags & ~MLOCK_ONFAULT) return -EINVAL;
if (flags & MLOCK_ONFAULT) vm_flags |= VM_LOCKONFAULT;
return do_mlock(start, len, vm_flags); }
VM_LOCKONFAULT always as VM_LOCKED set as well.
OK cool, that makes sense.
As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew again, then... :P if only somebody would write it down in a book...
And if there would be such a book in the makings, I would consider hurrying up and preordering it ;)
On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
Hi Lorenzo,
Thank you for your work on this.
Have we thought about how guard regions are represented in /proc/*/[s]maps?
In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps.
-- Kalesh
It is permissible to permit the establishment of guard regions in read-only mappings because the guard regions only reduce access to the mapping, and when removed simply reinstate the existing attributes of the underlying VMA, meaning no access violations can occur.
While the change in kernel code introduced in this series is small, the majority of the effort here is spent in extending the testing to assert that the feature works correctly across numerous file-backed mapping scenarios.
Every single guard region self-test performed against anonymous memory (which is relevant and not anon-only) has now been updated to also be performed against shmem and a mapping of a file in the working directory.
This confirms that all cases also function correctly for file-backed guard regions.
In addition a number of other tests are added for specific file-backed mapping scenarios.
There are a number of other concerns that one might have with regard to guard regions, addressed below:
Readahead
Readahead is a process through which the page cache is populated on the assumption that sequential reads will occur, thus amortising I/O and, through a clever use of the PG_readahead folio flag establishing during major fault and checked upon minor fault, provides for asynchronous I/O to occur as dat is processed, reducing I/O stalls as data is faulted in. Guard regions do not alter this mechanism which operations at the folio and fault level, but do of course prevent the faulting of folios that would otherwise be mapped. In the instance of a major fault prior to a guard region, synchronous readahead will occur including populating folios in the page cache which the guard regions will, in the case of the mapping in question, prevent access to. In addition, if PG_readahead is placed in a folio that is now inaccessible, this will prevent asynchronous readahead from occurring as it would otherwise do. However, there are mechanisms for heuristically resetting this within readahead regardless, which will 'recover' correct readahead behaviour. Readahead presumes sequential data access, the presence of a guard region clearly indicates that, at least in the guard region, no such sequential access will occur, as it cannot occur there. So this should have very little impact on any real workload. The far more important point is as to whether readahead causes incorrect or inappropriate mapping of ranges disallowed by the presence of guard regions - this is not the case, as readahead does not 'pre-fault' memory in this fashion. At any rate, any mechanism which would attempt to do so would hit the usual page fault paths, which correctly handle PTE markers as with anonymous mappings. Fault-Around
The fault-around logic, in a similar vein to readahead, attempts to improve efficiency with regard to file-backed memory mappings, however it differs in that it does not try to fetch folios into the page cache that are about to be accessed, but rather pre-maps a range of folios around the faulting address.
Guard regions making use of PTE markers makes this relatively trivial, as this case is already handled - see filemap_map_folio_range() and filemap_map_order0_folio() - in both instances, the solution is to simply keep the established page table mappings and let the fault handler take care of PTE markers, as per the comment:
/* * NOTE: If there're PTE markers, we'll leave them to be * handled in the specific fault path, and it'll prohibit * the fault-around logic. */
This works, as establishing guard regions results in page table mappings with PTE markers, and clearing them removes them.
Truncation
File truncation will not eliminate existing guard regions, as the truncation operation will ultimately zap the range via unmap_mapping_range(), which specifically excludes PTE markers. Zapping ~~~~~~~ Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(), which specifically deals with guard entries, leaving them intact except in instances such as process teardown or munmap() where they need to be removed. Reclaim ~~~~~~~ When reclaim is performed on file-backed folios, it ultimately invokes try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte() will ultimately abort the operation for the guard region mapping. If large, then check_pte() will determine that this is a non-device private entry/device-exclusive entry 'swap' PTE and thus abort the operation in that instance. Therefore, no odd things happen in the instance of reclaim being attempted upon a file-backed guard region. Hole Punching
This updates the page cache and ultimately invokes unmap_mapping_range(), which explicitly leaves PTE markers in place.
Because the establishment of guard regions zapped any existing mappings to file-backed folios, once the guard regions are removed then the hole-punched region will be faulted in as usual and everything will behave as expected.
Lorenzo Stoakes (4): mm: allow guard regions in file-backed and read-only mappings selftests/mm: rename guard-pages to guard-regions tools/selftests: expand all guard region tests to file-backed tools/selftests: add file/shmem-backed mapping guard region tests
mm/madvise.c | 8 +- tools/testing/selftests/mm/.gitignore | 2 +- tools/testing/selftests/mm/Makefile | 2 +- .../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++-- 4 files changed, 821 insertions(+), 112 deletions(-) rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
-- 2.48.1
On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh kaleshsingh@google.com wrote:
On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
Hi Lorenzo,
Thank you for your work on this.
Have we thought about how guard regions are represented in /proc/*/[s]maps?
In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps.
To clarify why the applications may not be aware of their guard regions -- in the case of the ELF mappings these PROT_NONE (guard regions) would be installed by the dynamic loader; or may be inherited from the parent (zygote in Android's case).
-- Kalesh
It is permissible to permit the establishment of guard regions in read-only mappings because the guard regions only reduce access to the mapping, and when removed simply reinstate the existing attributes of the underlying VMA, meaning no access violations can occur.
While the change in kernel code introduced in this series is small, the majority of the effort here is spent in extending the testing to assert that the feature works correctly across numerous file-backed mapping scenarios.
Every single guard region self-test performed against anonymous memory (which is relevant and not anon-only) has now been updated to also be performed against shmem and a mapping of a file in the working directory.
This confirms that all cases also function correctly for file-backed guard regions.
In addition a number of other tests are added for specific file-backed mapping scenarios.
There are a number of other concerns that one might have with regard to guard regions, addressed below:
Readahead
Readahead is a process through which the page cache is populated on the assumption that sequential reads will occur, thus amortising I/O and, through a clever use of the PG_readahead folio flag establishing during major fault and checked upon minor fault, provides for asynchronous I/O to occur as dat is processed, reducing I/O stalls as data is faulted in. Guard regions do not alter this mechanism which operations at the folio and fault level, but do of course prevent the faulting of folios that would otherwise be mapped. In the instance of a major fault prior to a guard region, synchronous readahead will occur including populating folios in the page cache which the guard regions will, in the case of the mapping in question, prevent access to. In addition, if PG_readahead is placed in a folio that is now inaccessible, this will prevent asynchronous readahead from occurring as it would otherwise do. However, there are mechanisms for heuristically resetting this within readahead regardless, which will 'recover' correct readahead behaviour. Readahead presumes sequential data access, the presence of a guard region clearly indicates that, at least in the guard region, no such sequential access will occur, as it cannot occur there. So this should have very little impact on any real workload. The far more important point is as to whether readahead causes incorrect or inappropriate mapping of ranges disallowed by the presence of guard regions - this is not the case, as readahead does not 'pre-fault' memory in this fashion. At any rate, any mechanism which would attempt to do so would hit the usual page fault paths, which correctly handle PTE markers as with anonymous mappings. Fault-Around
The fault-around logic, in a similar vein to readahead, attempts to improve efficiency with regard to file-backed memory mappings, however it differs in that it does not try to fetch folios into the page cache that are about to be accessed, but rather pre-maps a range of folios around the faulting address.
Guard regions making use of PTE markers makes this relatively trivial, as this case is already handled - see filemap_map_folio_range() and filemap_map_order0_folio() - in both instances, the solution is to simply keep the established page table mappings and let the fault handler take care of PTE markers, as per the comment:
/* * NOTE: If there're PTE markers, we'll leave them to be * handled in the specific fault path, and it'll prohibit * the fault-around logic. */
This works, as establishing guard regions results in page table mappings with PTE markers, and clearing them removes them.
Truncation
File truncation will not eliminate existing guard regions, as the truncation operation will ultimately zap the range via unmap_mapping_range(), which specifically excludes PTE markers. Zapping ~~~~~~~ Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(), which specifically deals with guard entries, leaving them intact except in instances such as process teardown or munmap() where they need to be removed. Reclaim ~~~~~~~ When reclaim is performed on file-backed folios, it ultimately invokes try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte() will ultimately abort the operation for the guard region mapping. If large, then check_pte() will determine that this is a non-device private entry/device-exclusive entry 'swap' PTE and thus abort the operation in that instance. Therefore, no odd things happen in the instance of reclaim being attempted upon a file-backed guard region. Hole Punching
This updates the page cache and ultimately invokes unmap_mapping_range(), which explicitly leaves PTE markers in place.
Because the establishment of guard regions zapped any existing mappings to file-backed folios, once the guard regions are removed then the hole-punched region will be faulted in as usual and everything will behave as expected.
Lorenzo Stoakes (4): mm: allow guard regions in file-backed and read-only mappings selftests/mm: rename guard-pages to guard-regions tools/selftests: expand all guard region tests to file-backed tools/selftests: add file/shmem-backed mapping guard region tests
mm/madvise.c | 8 +- tools/testing/selftests/mm/.gitignore | 2 +- tools/testing/selftests/mm/Makefile | 2 +- .../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++-- 4 files changed, 821 insertions(+), 112 deletions(-) rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
-- 2.48.1
On Wed, Feb 19, 2025 at 12:35:12AM -0800, Kalesh Singh wrote:
On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh kaleshsingh@google.com wrote:
On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
Hi Lorenzo,
Thank you for your work on this.
Have we thought about how guard regions are represented in /proc/*/[s]maps?
In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps.
To clarify why the applications may not be aware of their guard regions -- in the case of the ELF mappings these PROT_NONE (guard regions) would be installed by the dynamic loader; or may be inherited from the parent (zygote in Android's case).
I'm still really confused as to how you propose to use guard regions in this scenario. If it's trying to 'clone' some set of mappings somehow then guard regions are just not compatible with your use case at all.
I'm not sure what would be honestly... you maybe could find a way to modify the loader to interpret PROT_NONE mappings as a cue to use guard regions instead if that worked?
You can't both not have a VMA (something that describes mapping with common attributes) and have page table state along with somehow having something that describes a mapping with said page table state :)
It's a sort of non-such-beast.
Everything is a trade-off, guard regions provide the option to retain state regarding faulting behaviour in page tables rather the VMAs, entirely opt-in and at the behest of the user.
Now extended to file-backed and shmem!
While I hoped it would help your scenario based on your talk at LPC, this is very clearly an important change for guard regions (for shmem most notably) and removes an important limitation and exposes the possibility of using it for things such as ELF mappings for those who don't also require specific [s]maps state (which explicitly NEEDS VMA metadata state).
I suspect you can find a way of making this work in your scenario though.
An alternative if you're willing to wait for page table state to be acquired is to expose a new /proc/$pid/... endpoint that explicitly outputs guard regions.
Would this solve your problem...? I had always considered implementing this if there was a demand.
Maybe /proc/$pid/gmaps? :P 'g' could mean guard regions or... ;)
-- Kalesh
It is permissible to permit the establishment of guard regions in read-only mappings because the guard regions only reduce access to the mapping, and when removed simply reinstate the existing attributes of the underlying VMA, meaning no access violations can occur.
While the change in kernel code introduced in this series is small, the majority of the effort here is spent in extending the testing to assert that the feature works correctly across numerous file-backed mapping scenarios.
Every single guard region self-test performed against anonymous memory (which is relevant and not anon-only) has now been updated to also be performed against shmem and a mapping of a file in the working directory.
This confirms that all cases also function correctly for file-backed guard regions.
In addition a number of other tests are added for specific file-backed mapping scenarios.
There are a number of other concerns that one might have with regard to guard regions, addressed below:
Readahead
Readahead is a process through which the page cache is populated on the assumption that sequential reads will occur, thus amortising I/O and, through a clever use of the PG_readahead folio flag establishing during major fault and checked upon minor fault, provides for asynchronous I/O to occur as dat is processed, reducing I/O stalls as data is faulted in. Guard regions do not alter this mechanism which operations at the folio and fault level, but do of course prevent the faulting of folios that would otherwise be mapped. In the instance of a major fault prior to a guard region, synchronous readahead will occur including populating folios in the page cache which the guard regions will, in the case of the mapping in question, prevent access to. In addition, if PG_readahead is placed in a folio that is now inaccessible, this will prevent asynchronous readahead from occurring as it would otherwise do. However, there are mechanisms for heuristically resetting this within readahead regardless, which will 'recover' correct readahead behaviour. Readahead presumes sequential data access, the presence of a guard region clearly indicates that, at least in the guard region, no such sequential access will occur, as it cannot occur there. So this should have very little impact on any real workload. The far more important point is as to whether readahead causes incorrect or inappropriate mapping of ranges disallowed by the presence of guard regions - this is not the case, as readahead does not 'pre-fault' memory in this fashion. At any rate, any mechanism which would attempt to do so would hit the usual page fault paths, which correctly handle PTE markers as with anonymous mappings. Fault-Around
The fault-around logic, in a similar vein to readahead, attempts to improve efficiency with regard to file-backed memory mappings, however it differs in that it does not try to fetch folios into the page cache that are about to be accessed, but rather pre-maps a range of folios around the faulting address.
Guard regions making use of PTE markers makes this relatively trivial, as this case is already handled - see filemap_map_folio_range() and filemap_map_order0_folio() - in both instances, the solution is to simply keep the established page table mappings and let the fault handler take care of PTE markers, as per the comment:
/* * NOTE: If there're PTE markers, we'll leave them to be * handled in the specific fault path, and it'll prohibit * the fault-around logic. */
This works, as establishing guard regions results in page table mappings with PTE markers, and clearing them removes them.
Truncation
File truncation will not eliminate existing guard regions, as the truncation operation will ultimately zap the range via unmap_mapping_range(), which specifically excludes PTE markers. Zapping ~~~~~~~ Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(), which specifically deals with guard entries, leaving them intact except in instances such as process teardown or munmap() where they need to be removed. Reclaim ~~~~~~~ When reclaim is performed on file-backed folios, it ultimately invokes try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte() will ultimately abort the operation for the guard region mapping. If large, then check_pte() will determine that this is a non-device private entry/device-exclusive entry 'swap' PTE and thus abort the operation in that instance. Therefore, no odd things happen in the instance of reclaim being attempted upon a file-backed guard region. Hole Punching
This updates the page cache and ultimately invokes unmap_mapping_range(), which explicitly leaves PTE markers in place.
Because the establishment of guard regions zapped any existing mappings to file-backed folios, once the guard regions are removed then the hole-punched region will be faulted in as usual and everything will behave as expected.
Lorenzo Stoakes (4): mm: allow guard regions in file-backed and read-only mappings selftests/mm: rename guard-pages to guard-regions tools/selftests: expand all guard region tests to file-backed tools/selftests: add file/shmem-backed mapping guard region tests
mm/madvise.c | 8 +- tools/testing/selftests/mm/.gitignore | 2 +- tools/testing/selftests/mm/Makefile | 2 +- .../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++-- 4 files changed, 821 insertions(+), 112 deletions(-) rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
-- 2.48.1
* Kalesh Singh kaleshsingh@google.com [250219 03:35]:
On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh kaleshsingh@google.com wrote:
On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
Hi Lorenzo,
Thank you for your work on this.
Have we thought about how guard regions are represented in /proc/*/[s]maps?
In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps.
To clarify why the applications may not be aware of their guard regions -- in the case of the ELF mappings these PROT_NONE (guard regions) would be installed by the dynamic loader; or may be inherited from the parent (zygote in Android's case).
The guard regions are a method to reduce vma count and speed up guard installation and removal. This very much will change the proc interface by its design, since it's removing information from the maps for the advantage of speed and less resources. I thought that was pretty clear from the start.
The proc interface is also not a very good way to programmatically get this information, as I'm sure you are aware. I'm also sure you know that reading that file takes the mmap read lock, and tearing can occur.
I think this implies you are taking a lot of time to get the information you want in the way you are getting it (parsing a text file, and not doing any mmap write work)?
If this is a common pattern, I think we need a better interface to query this type of information. We have an ioctl going in for getting vma information, but that will lack these new guard regions as well.
David mentioned /proc/self/pagemap - that's certainly worth exploring, and doesn't involve text parsing.
If it's not that common, then maybe your zygote/loader can communicate the information in a way that does not involve read locking the programs vm area?
Are you really parsing the same library information for each application launched to find the guards instead of asking or being told what they are? I must be missing something..
Thanks, Liam
On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
Hi Lorenzo,
Thank you for your work on this.
You're welcome.
Have we thought about how guard regions are represented in /proc/*/[s]maps?
This is off-topic here but... Yes, extensively. No they do not appear there.
I thought you had attended LPC and my talk where I mentioned this purposefully as a drawback?
I went out of my way to advertise this limitation at the LPC talk, in the original series, etc. so it's a little disappointing that this is being brought up so late, but nobody else has raised objections to this issue so I think in general it's not a limitation that matters in practice.
In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps.
No, we cannot have these, sorry.
Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this feature is to avoid having to accumulate VMAs for regions which are not intended to be accessible.
Secondly, there is no practical means for this to be accomplished in /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates they have guard regions.
This is intentional, because setting such metadata is simply not practical - why? Because when you try to split the VMA, how do you know which bit gets the metadata and which doesn't? You can't without _reading page tables_.
/proc/$pid/smaps _does_ read page tables, but we can't start pretending VMAs exist when they don't, this would be completely inaccurate, would break assumptions for things like mremap (which require a single VMA) and would be unworkable.
The best that _could_ be achieved is to have a marker in /proc/$pid/smaps saying 'hey this region has guard regions somewhere'.
But I haven't seen any demand for this and presumably this wouldn't help your imagined program?
I don't really understand your use case though, what programs would read /proc/maps, then... try to use /proc/$pid/mem or whatnot to arbitrarily read regions? Such applications would be in danger of SIGBUS in any case if they were to read invalid portions of file-backed mappings, and have no way of knowing this, so they seem fundamentally broken as it is?
-- Kalesh
It is permissible to permit the establishment of guard regions in read-only mappings because the guard regions only reduce access to the mapping, and when removed simply reinstate the existing attributes of the underlying VMA, meaning no access violations can occur.
While the change in kernel code introduced in this series is small, the majority of the effort here is spent in extending the testing to assert that the feature works correctly across numerous file-backed mapping scenarios.
Every single guard region self-test performed against anonymous memory (which is relevant and not anon-only) has now been updated to also be performed against shmem and a mapping of a file in the working directory.
This confirms that all cases also function correctly for file-backed guard regions.
In addition a number of other tests are added for specific file-backed mapping scenarios.
There are a number of other concerns that one might have with regard to guard regions, addressed below:
Readahead
Readahead is a process through which the page cache is populated on the assumption that sequential reads will occur, thus amortising I/O and, through a clever use of the PG_readahead folio flag establishing during major fault and checked upon minor fault, provides for asynchronous I/O to occur as dat is processed, reducing I/O stalls as data is faulted in. Guard regions do not alter this mechanism which operations at the folio and fault level, but do of course prevent the faulting of folios that would otherwise be mapped. In the instance of a major fault prior to a guard region, synchronous readahead will occur including populating folios in the page cache which the guard regions will, in the case of the mapping in question, prevent access to. In addition, if PG_readahead is placed in a folio that is now inaccessible, this will prevent asynchronous readahead from occurring as it would otherwise do. However, there are mechanisms for heuristically resetting this within readahead regardless, which will 'recover' correct readahead behaviour. Readahead presumes sequential data access, the presence of a guard region clearly indicates that, at least in the guard region, no such sequential access will occur, as it cannot occur there. So this should have very little impact on any real workload. The far more important point is as to whether readahead causes incorrect or inappropriate mapping of ranges disallowed by the presence of guard regions - this is not the case, as readahead does not 'pre-fault' memory in this fashion. At any rate, any mechanism which would attempt to do so would hit the usual page fault paths, which correctly handle PTE markers as with anonymous mappings. Fault-Around
The fault-around logic, in a similar vein to readahead, attempts to improve efficiency with regard to file-backed memory mappings, however it differs in that it does not try to fetch folios into the page cache that are about to be accessed, but rather pre-maps a range of folios around the faulting address.
Guard regions making use of PTE markers makes this relatively trivial, as this case is already handled - see filemap_map_folio_range() and filemap_map_order0_folio() - in both instances, the solution is to simply keep the established page table mappings and let the fault handler take care of PTE markers, as per the comment:
/* * NOTE: If there're PTE markers, we'll leave them to be * handled in the specific fault path, and it'll prohibit * the fault-around logic. */
This works, as establishing guard regions results in page table mappings with PTE markers, and clearing them removes them.
Truncation
File truncation will not eliminate existing guard regions, as the truncation operation will ultimately zap the range via unmap_mapping_range(), which specifically excludes PTE markers. Zapping ~~~~~~~ Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(), which specifically deals with guard entries, leaving them intact except in instances such as process teardown or munmap() where they need to be removed. Reclaim ~~~~~~~ When reclaim is performed on file-backed folios, it ultimately invokes try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte() will ultimately abort the operation for the guard region mapping. If large, then check_pte() will determine that this is a non-device private entry/device-exclusive entry 'swap' PTE and thus abort the operation in that instance. Therefore, no odd things happen in the instance of reclaim being attempted upon a file-backed guard region. Hole Punching
This updates the page cache and ultimately invokes unmap_mapping_range(), which explicitly leaves PTE markers in place.
Because the establishment of guard regions zapped any existing mappings to file-backed folios, once the guard regions are removed then the hole-punched region will be faulted in as usual and everything will behave as expected.
Lorenzo Stoakes (4): mm: allow guard regions in file-backed and read-only mappings selftests/mm: rename guard-pages to guard-regions tools/selftests: expand all guard region tests to file-backed tools/selftests: add file/shmem-backed mapping guard region tests
mm/madvise.c | 8 +- tools/testing/selftests/mm/.gitignore | 2 +- tools/testing/selftests/mm/Makefile | 2 +- .../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++-- 4 files changed, 821 insertions(+), 112 deletions(-) rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
-- 2.48.1
On 19.02.25 10:03, Lorenzo Stoakes wrote:
On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
Hi Lorenzo,
Thank you for your work on this.
You're welcome.
Have we thought about how guard regions are represented in /proc/*/[s]maps?
This is off-topic here but... Yes, extensively. No they do not appear there.
I thought you had attended LPC and my talk where I mentioned this purposefully as a drawback?
I went out of my way to advertise this limitation at the LPC talk, in the original series, etc. so it's a little disappointing that this is being brought up so late, but nobody else has raised objections to this issue so I think in general it's not a limitation that matters in practice.
In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps.
No, we cannot have these, sorry.
Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this feature is to avoid having to accumulate VMAs for regions which are not intended to be accessible.
Secondly, there is no practical means for this to be accomplished in /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates they have guard regions.
This is intentional, because setting such metadata is simply not practical
- why? Because when you try to split the VMA, how do you know which bit
gets the metadata and which doesn't? You can't without _reading page tables_.
/proc/$pid/smaps _does_ read page tables, but we can't start pretending VMAs exist when they don't, this would be completely inaccurate, would break assumptions for things like mremap (which require a single VMA) and would be unworkable.
The best that _could_ be achieved is to have a marker in /proc/$pid/smaps saying 'hey this region has guard regions somewhere'.
And then simply expose it in /proc/$pid/pagemap, which is a better interface for this pte-level information inside of VMAs. We should still have a spare bit for that purpose in the pagemap entries.
On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
On 19.02.25 10:03, Lorenzo Stoakes wrote:
On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
Hi Lorenzo,
Thank you for your work on this.
You're welcome.
Have we thought about how guard regions are represented in /proc/*/[s]maps?
This is off-topic here but... Yes, extensively. No they do not appear there.
I thought you had attended LPC and my talk where I mentioned this purposefully as a drawback?
I went out of my way to advertise this limitation at the LPC talk, in the original series, etc. so it's a little disappointing that this is being brought up so late, but nobody else has raised objections to this issue so I think in general it's not a limitation that matters in practice.
In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps.
No, we cannot have these, sorry.
Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this feature is to avoid having to accumulate VMAs for regions which are not intended to be accessible.
Secondly, there is no practical means for this to be accomplished in /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates they have guard regions.
This is intentional, because setting such metadata is simply not practical
- why? Because when you try to split the VMA, how do you know which bit
gets the metadata and which doesn't? You can't without _reading page tables_.
/proc/$pid/smaps _does_ read page tables, but we can't start pretending VMAs exist when they don't, this would be completely inaccurate, would break assumptions for things like mremap (which require a single VMA) and would be unworkable.
The best that _could_ be achieved is to have a marker in /proc/$pid/smaps saying 'hey this region has guard regions somewhere'.
And then simply expose it in /proc/$pid/pagemap, which is a better interface for this pte-level information inside of VMAs. We should still have a spare bit for that purpose in the pagemap entries.
Ah yeah thanks David forgot about that!
This is also a possibility if that'd solve your problems Kalesh?
This bit will be fought over haha
-- Cheers,
David / dhildenb
On Wed, Feb 19, 2025 at 1:17 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
On 19.02.25 10:03, Lorenzo Stoakes wrote:
On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
Hi Lorenzo,
Thank you for your work on this.
You're welcome.
Have we thought about how guard regions are represented in /proc/*/[s]maps?
This is off-topic here but... Yes, extensively. No they do not appear there.
I thought you had attended LPC and my talk where I mentioned this purposefully as a drawback?
I went out of my way to advertise this limitation at the LPC talk, in the original series, etc. so it's a little disappointing that this is being brought up so late, but nobody else has raised objections to this issue so I think in general it's not a limitation that matters in practice.
Sorry for raising this now, yes at the time I believe we discussed reducing the vma slab memory usage for the PROT_NONE mappings. I didn't imagine that apps could have dependencies on the mapped ELF ranges in /proc/self/[s]maps until recent breakages from a similar feature. Android itself doesn't depend on this but what I've seen is banking apps and apps that have obfuscation to prevent reverse engineering (the particulars of such obfuscation are a black box).
In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps.
No, we cannot have these, sorry.
Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this feature is to avoid having to accumulate VMAs for regions which are not intended to be accessible.
Secondly, there is no practical means for this to be accomplished in /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates they have guard regions.
This is intentional, because setting such metadata is simply not practical
- why? Because when you try to split the VMA, how do you know which bit
gets the metadata and which doesn't? You can't without _reading page tables_.
Yeah the splitting becomes complicated with any vm flags for this... meaning any attempt to expose this in /proc/*/maps have to unconditionally walk the page tables :(
/proc/$pid/smaps _does_ read page tables, but we can't start pretending VMAs exist when they don't, this would be completely inaccurate, would break assumptions for things like mremap (which require a single VMA) and would be unworkable.
The best that _could_ be achieved is to have a marker in /proc/$pid/smaps saying 'hey this region has guard regions somewhere'.
And then simply expose it in /proc/$pid/pagemap, which is a better interface for this pte-level information inside of VMAs. We should still have a spare bit for that purpose in the pagemap entries.
Ah yeah thanks David forgot about that!
This is also a possibility if that'd solve your problems Kalesh?
I'm not sure what is the correct interface to advertise these. Maybe smaps as you suggested since we already walk the page tables there? and pagemap bit for the exact pages as well? It won't solve this particular issue, as 1000s of in field apps do look at this through /proc/*/maps. But maybe we have to live with that...
We can argue that such apps are broken since they may trip on the SIGBUS off the end of the file -- usually this isn't the case for the ELF segment mappings.
This is still useful for other cases, I just wanted to get some ideas if this can be extended to further use cases.
Thanks, Kalesh
This bit will be fought over haha
-- Cheers,
David / dhildenb
On Wed, Feb 19, 2025 at 10:52:04AM -0800, Kalesh Singh wrote:
On Wed, Feb 19, 2025 at 1:17 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
On 19.02.25 10:03, Lorenzo Stoakes wrote:
On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones.
Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping.
Hi Lorenzo,
Thank you for your work on this.
You're welcome.
Have we thought about how guard regions are represented in /proc/*/[s]maps?
This is off-topic here but... Yes, extensively. No they do not appear there.
I thought you had attended LPC and my talk where I mentioned this purposefully as a drawback?
I went out of my way to advertise this limitation at the LPC talk, in the original series, etc. so it's a little disappointing that this is being brought up so late, but nobody else has raised objections to this issue so I think in general it's not a limitation that matters in practice.
Sorry for raising this now, yes at the time I believe we discussed reducing the vma slab memory usage for the PROT_NONE mappings. I didn't imagine that apps could have dependencies on the mapped ELF ranges in /proc/self/[s]maps until recent breakages from a similar feature. Android itself doesn't depend on this but what I've seen is banking apps and apps that have obfuscation to prevent reverse engineering (the particulars of such obfuscation are a black box).
Ack ok fair enough, sorry, but obviously you can understand it's frustrating when I went to great lengths to advertise this not only at the talk but in the original series.
Really important to have these discussions early. Not that really we can do much about this, as inherently this feature cannot give you what you need.
Is it _only_ banking apps that do this? And do they exclusively read /proc/$pid/maps? I mean there's nothing we can do about that, sorry. If that's immutable, then unless you do your own very, very, very slow custom android maps implementation (that will absolutely break the /proc/$pid/maps scalability efforts atm) this is just a no-go.
In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps.
No, we cannot have these, sorry.
Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this feature is to avoid having to accumulate VMAs for regions which are not intended to be accessible.
Secondly, there is no practical means for this to be accomplished in /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates they have guard regions.
This is intentional, because setting such metadata is simply not practical
- why? Because when you try to split the VMA, how do you know which bit
gets the metadata and which doesn't? You can't without _reading page tables_.
Yeah the splitting becomes complicated with any vm flags for this... meaning any attempt to expose this in /proc/*/maps have to unconditionally walk the page tables :(
It's not really complicated, it's _impossible_ unless you made literally all VMA code walk page tables for every single operation. Which we are emphatically not going to do :)
And no, /proc/$pid/maps is _never_ going to walk page tables. For obvious performance reasons.
/proc/$pid/smaps _does_ read page tables, but we can't start pretending VMAs exist when they don't, this would be completely inaccurate, would break assumptions for things like mremap (which require a single VMA) and would be unworkable.
The best that _could_ be achieved is to have a marker in /proc/$pid/smaps saying 'hey this region has guard regions somewhere'.
And then simply expose it in /proc/$pid/pagemap, which is a better interface for this pte-level information inside of VMAs. We should still have a spare bit for that purpose in the pagemap entries.
Ah yeah thanks David forgot about that!
This is also a possibility if that'd solve your problems Kalesh?
I'm not sure what is the correct interface to advertise these. Maybe smaps as you suggested since we already walk the page tables there? and pagemap bit for the exact pages as well? It won't solve this particular issue, as 1000s of in field apps do look at this through /proc/*/maps. But maybe we have to live with that...
I mean why are we even considering this if you can't change this anywhere? Confused by that.
I'm afraid upstream can't radically change interfaces to suit this scenario.
We also can't change smaps in the way you want, it _has_ to still give output per VMA information.
The proposed change that would be there would be a flag or something indicating that the VMA has guard regions _SOMEWHERE_ in it.
Since this doesn't solve your problem, adds complexity, and nobody else seems to need it, I would suggest this is not worthwhile and I'd rather not do this.
Therefore for your needs there are literally only two choices here:
1. Add a bit to /proc/$pid/pagemap OR 2. a new interface.
I am not in favour of a new interface here, if we can just extend pagemap.
What you'd have to do is:
1. Find virtual ranges via /proc/$pid/maps 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
Since anything that would retrieve guard region state would need to walk page tables, any approach would be slow and I don't think this would be any less slow than any other interface.
This way you'd be able to find all guard regions all the time.
This is just the trade-off for this feature unfortunately - its whole design ethos is to allow modification of -faulting- behaviour without having to modify -VMA- behaviour.
But if it's banking apps whose code you can't control (surprised you don't lock down these interfaces), I mean is this even useful to you?
If your requirement is 'you have to change /proc/$pid/maps to show guard regions' I mean the answer is that we can't.
We can argue that such apps are broken since they may trip on the SIGBUS off the end of the file -- usually this isn't the case for the ELF segment mappings.
Or tearing of the maps interface, or things getting unmapped or or or... It's really not a sane thing to do.
This is still useful for other cases, I just wanted to get some ideas if this can be extended to further use cases.
Well I'm glad that you guys find it useful for _something_ ;)
Again this wasn't written only for you (it is broadly a good feature for upstream), but I did have your use case in mind, so I'm a little disappointed that it doesn't help, as I like to solve problems.
But I'm glad it solves at least some for you...
Thanks, Kalesh
This bit will be fought over haha
-- Cheers,
David / dhildenb
On Wed, Feb 19, 2025 at 11:20 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Wed, Feb 19, 2025 at 10:52:04AM -0800, Kalesh Singh wrote:
On Wed, Feb 19, 2025 at 1:17 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
On 19.02.25 10:03, Lorenzo Stoakes wrote:
On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote: > > The guard regions feature was initially implemented to support anonymous > mappings only, excluding shmem. > > This was done such as to introduce the feature carefully and incrementally > and to be conservative when considering the various caveats and corner > cases that are applicable to file-backed mappings but not to anonymous > ones. > > Now this feature has landed in 6.13, it is time to revisit this and to > extend this functionality to file-backed and shmem mappings. > > In order to make this maximally useful, and since one may map file-backed > mappings read-only (for instance ELF images), we also remove the > restriction on read-only mappings and permit the establishment of guard > regions in any non-hugetlb, non-mlock()'d mapping.
Hi Lorenzo,
Thank you for your work on this.
You're welcome.
Have we thought about how guard regions are represented in /proc/*/[s]maps?
This is off-topic here but... Yes, extensively. No they do not appear there.
I thought you had attended LPC and my talk where I mentioned this purposefully as a drawback?
I went out of my way to advertise this limitation at the LPC talk, in the original series, etc. so it's a little disappointing that this is being brought up so late, but nobody else has raised objections to this issue so I think in general it's not a limitation that matters in practice.
Sorry for raising this now, yes at the time I believe we discussed reducing the vma slab memory usage for the PROT_NONE mappings. I didn't imagine that apps could have dependencies on the mapped ELF ranges in /proc/self/[s]maps until recent breakages from a similar feature. Android itself doesn't depend on this but what I've seen is banking apps and apps that have obfuscation to prevent reverse engineering (the particulars of such obfuscation are a black box).
Ack ok fair enough, sorry, but obviously you can understand it's frustrating when I went to great lengths to advertise this not only at the talk but in the original series.
Really important to have these discussions early. Not that really we can do much about this, as inherently this feature cannot give you what you need.
Is it _only_ banking apps that do this? And do they exclusively read /proc/$pid/maps? I mean there's nothing we can do about that, sorry.
Not only banking apps but that's a common category.
If that's immutable, then unless you do your own very, very, very slow custom android maps implementation (that will absolutely break the /proc/$pid/maps scalability efforts atm) this is just a no-go.
Yeah unfortunately that's immutable as app versions are mostly independent from the OS version.
We do have something that handles this by encoding the guard regions in the vm_flags, but as you can imagine it's not generic enough for upstream.
In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps.
No, we cannot have these, sorry.
Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this feature is to avoid having to accumulate VMAs for regions which are not intended to be accessible.
Secondly, there is no practical means for this to be accomplished in /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates they have guard regions.
This is intentional, because setting such metadata is simply not practical
- why? Because when you try to split the VMA, how do you know which bit
gets the metadata and which doesn't? You can't without _reading page tables_.
Yeah the splitting becomes complicated with any vm flags for this... meaning any attempt to expose this in /proc/*/maps have to unconditionally walk the page tables :(
It's not really complicated, it's _impossible_ unless you made literally all VMA code walk page tables for every single operation. Which we are emphatically not going to do :)
And no, /proc/$pid/maps is _never_ going to walk page tables. For obvious performance reasons.
/proc/$pid/smaps _does_ read page tables, but we can't start pretending VMAs exist when they don't, this would be completely inaccurate, would break assumptions for things like mremap (which require a single VMA) and would be unworkable.
The best that _could_ be achieved is to have a marker in /proc/$pid/smaps saying 'hey this region has guard regions somewhere'.
And then simply expose it in /proc/$pid/pagemap, which is a better interface for this pte-level information inside of VMAs. We should still have a spare bit for that purpose in the pagemap entries.
Ah yeah thanks David forgot about that!
This is also a possibility if that'd solve your problems Kalesh?
I'm not sure what is the correct interface to advertise these. Maybe smaps as you suggested since we already walk the page tables there? and pagemap bit for the exact pages as well? It won't solve this particular issue, as 1000s of in field apps do look at this through /proc/*/maps. But maybe we have to live with that...
I mean why are we even considering this if you can't change this anywhere? Confused by that.
I'm afraid upstream can't radically change interfaces to suit this scenario.
We also can't change smaps in the way you want, it _has_ to still give output per VMA information.
Sorry I wasn't suggesting to change the entries in smaps, rather agreeing to your marker suggestion. Maybe a set of ranges for each smaps entry that has guards? It doesn't solve the use case, but does make these regions visible to userspace.
The proposed change that would be there would be a flag or something indicating that the VMA has guard regions _SOMEWHERE_ in it.
Since this doesn't solve your problem, adds complexity, and nobody else seems to need it, I would suggest this is not worthwhile and I'd rather not do this.
Therefore for your needs there are literally only two choices here:
- Add a bit to /proc/$pid/pagemap OR
- a new interface.
I am not in favour of a new interface here, if we can just extend pagemap.
What you'd have to do is:
- Find virtual ranges via /proc/$pid/maps
- iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
Could we also consider an smaps field like:
VmGuards: [AAA, BBB), [CCC, DDD), ...
or something of that sort?
Since anything that would retrieve guard region state would need to walk page tables, any approach would be slow and I don't think this would be any less slow than any other interface.
This way you'd be able to find all guard regions all the time.
This is just the trade-off for this feature unfortunately - its whole design ethos is to allow modification of -faulting- behaviour without having to modify -VMA- behaviour.
But if it's banking apps whose code you can't control (surprised you don't lock down these interfaces), I mean is this even useful to you?
If your requirement is 'you have to change /proc/$pid/maps to show guard regions' I mean the answer is that we can't.
We can argue that such apps are broken since they may trip on the SIGBUS off the end of the file -- usually this isn't the case for the ELF segment mappings.
Or tearing of the maps interface, or things getting unmapped or or or... It's really not a sane thing to do.
This is still useful for other cases, I just wanted to get some ideas if this can be extended to further use cases.
Well I'm glad that you guys find it useful for _something_ ;)
Again this wasn't written only for you (it is broadly a good feature for upstream), but I did have your use case in mind, so I'm a little disappointed that it doesn't help, as I like to solve problems.
But I'm glad it solves at least some for you...
I recall Liam had a proposal to store the guard ranges in the maple tree?
I wonder if that can be used in combination with this approach to have a better representation of this?
Thanks, Kalesh
This bit will be fought over haha
-- Cheers,
David / dhildenb
On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
We also can't change smaps in the way you want, it _has_ to still give output per VMA information.
Sorry I wasn't suggesting to change the entries in smaps, rather agreeing to your marker suggestion. Maybe a set of ranges for each smaps entry that has guards? It doesn't solve the use case, but does make these regions visible to userspace.
No, you are not providing a usecase for this. /proc/$pid/pagemaps does not contaminate the smaps output, mess with efforts to make it RCU readable, require updating the ioctl interface, etc. so it is clearly the better choice.
The proposed change that would be there would be a flag or something indicating that the VMA has guard regions _SOMEWHERE_ in it.
Since this doesn't solve your problem, adds complexity, and nobody else seems to need it, I would suggest this is not worthwhile and I'd rather not do this.
Therefore for your needs there are literally only two choices here:
- Add a bit to /proc/$pid/pagemap OR
- a new interface.
I am not in favour of a new interface here, if we can just extend pagemap.
What you'd have to do is:
- Find virtual ranges via /proc/$pid/maps
- iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
Could we also consider an smaps field like:
VmGuards: [AAA, BBB), [CCC, DDD), ...
or something of that sort?
No, absolutely, categorically not. You realise these could be thousands of characters long right?
/proc/$pid/pagemaps resolves this without contaminating this output.
Well I'm glad that you guys find it useful for _something_ ;)
Again this wasn't written only for you (it is broadly a good feature for upstream), but I did have your use case in mind, so I'm a little disappointed that it doesn't help, as I like to solve problems.
But I'm glad it solves at least some for you...
I recall Liam had a proposal to store the guard ranges in the maple tree?
I wonder if that can be used in combination with this approach to have a better representation of this?
This was an alternative proposal made prior to the feature being implemented (and you and others at Google were welcome to comment and many were cc'd, etc.).
There is no 'in combination with'. This feature would take weeks/months to implement, fundamentally impact the maple tree VMA implementation and... not actually achieve anything + immediately be redundant.
Plus it'd likely be slower, have locking implications, would have kernel memory allocation implications, a lot more complexity and probably other problems besides (we discussed this at length at the time and a number of issues came up, I can't recall all of them).
To be crystal clear - we are empathically NOT changing /proc/$pid/maps to lie about VMAs regardless of underlying implementation, nor adding thousands of characters to /proc/$pid/smaps entries.
This is independent of implementation and would have been the case had we gone with a maple node version.
So in no world is your problem solved here, unfortunately you have inextricably tied yourself to a VMA representation here.
I still wonder if you could find some means of abstracting this, but /proc/$pid/pagemaps is where I am likely to expose this information for anybody who needs it, and will likely send a series for this relatively soon.
If you _can_ abstract this in some way, then if we provide this information _anywhere_ you can get it.
As I said to you earlier, the _best_ we could do in smaps would be to add a flag like 'Grd' or something to indicate some part of the VMA is guarded. But I won't do that unless somebody has an -actual use case- for it.
David's /proc/$pid/pagemaps suggestion is excellent, avoids all the pitfalls, exposes guard regions to anybody who really really wants to know and doesn't interfere with anything else, so this is what we'll go with.
Regards, Lorenzo
On 20.02.25 09:51, Lorenzo Stoakes wrote:
On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
We also can't change smaps in the way you want, it _has_ to still give output per VMA information.
Sorry I wasn't suggesting to change the entries in smaps, rather agreeing to your marker suggestion. Maybe a set of ranges for each smaps entry that has guards? It doesn't solve the use case, but does make these regions visible to userspace.
No, you are not providing a usecase for this. /proc/$pid/pagemaps does not contaminate the smaps output, mess with efforts to make it RCU readable, require updating the ioctl interface, etc. so it is clearly the better choice.
The proposed change that would be there would be a flag or something indicating that the VMA has guard regions _SOMEWHERE_ in it.
Since this doesn't solve your problem, adds complexity, and nobody else seems to need it, I would suggest this is not worthwhile and I'd rather not do this.
Therefore for your needs there are literally only two choices here:
- Add a bit to /proc/$pid/pagemap OR
- a new interface.
I am not in favour of a new interface here, if we can just extend pagemap.
What you'd have to do is:
- Find virtual ranges via /proc/$pid/maps
- iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
Could we also consider an smaps field like:
VmGuards: [AAA, BBB), [CCC, DDD), ...
or something of that sort?
No, absolutely, categorically not. You realise these could be thousands of characters long right?
/proc/$pid/pagemaps resolves this without contaminating this output.
Well I'm glad that you guys find it useful for _something_ ;)
Again this wasn't written only for you (it is broadly a good feature for upstream), but I did have your use case in mind, so I'm a little disappointed that it doesn't help, as I like to solve problems.
But I'm glad it solves at least some for you...
I recall Liam had a proposal to store the guard ranges in the maple tree?
I wonder if that can be used in combination with this approach to have a better representation of this?
This was an alternative proposal made prior to the feature being implemented (and you and others at Google were welcome to comment and many were cc'd, etc.).
There is no 'in combination with'. This feature would take weeks/months to implement, fundamentally impact the maple tree VMA implementation and... not actually achieve anything + immediately be redundant.
Plus it'd likely be slower, have locking implications, would have kernel memory allocation implications, a lot more complexity and probably other problems besides (we discussed this at length at the time and a number of issues came up, I can't recall all of them).
To be crystal clear - we are empathically NOT changing /proc/$pid/maps to lie about VMAs regardless of underlying implementation, nor adding thousands of characters to /proc/$pid/smaps entries.
Yes. Calling it a "guard region" might be part of the problem (/"misunderstanding"), because it reminds people of "virtual memory regions".
"Guard markers" or similar might have been clearer that these operate on individual PTEs, require page table scanning etc ... which makes them a lot more scalable and fine-grained and provides all these benfits, with the downside being that we don't end up with that many "virtual memory regions" that maps/smaps operate on.
[...]
As I said to you earlier, the _best_ we could do in smaps would be to add a flag like 'Grd' or something to indicate some part of the VMA is guarded. But I won't do that unless somebody has an -actual use case- for it.
Right, and that would limit where you have to manually scan. Something similar is being done with uffd-wp markers IIRC.
On Thu, Feb 20, 2025 at 09:57:37AM +0100, David Hildenbrand wrote:
On 20.02.25 09:51, Lorenzo Stoakes wrote:
On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
We also can't change smaps in the way you want, it _has_ to still give output per VMA information.
Sorry I wasn't suggesting to change the entries in smaps, rather agreeing to your marker suggestion. Maybe a set of ranges for each smaps entry that has guards? It doesn't solve the use case, but does make these regions visible to userspace.
No, you are not providing a usecase for this. /proc/$pid/pagemaps does not contaminate the smaps output, mess with efforts to make it RCU readable, require updating the ioctl interface, etc. so it is clearly the better choice.
The proposed change that would be there would be a flag or something indicating that the VMA has guard regions _SOMEWHERE_ in it.
Since this doesn't solve your problem, adds complexity, and nobody else seems to need it, I would suggest this is not worthwhile and I'd rather not do this.
Therefore for your needs there are literally only two choices here:
- Add a bit to /proc/$pid/pagemap OR
- a new interface.
I am not in favour of a new interface here, if we can just extend pagemap.
What you'd have to do is:
- Find virtual ranges via /proc/$pid/maps
- iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
Could we also consider an smaps field like:
VmGuards: [AAA, BBB), [CCC, DDD), ...
or something of that sort?
No, absolutely, categorically not. You realise these could be thousands of characters long right?
/proc/$pid/pagemaps resolves this without contaminating this output.
Well I'm glad that you guys find it useful for _something_ ;)
Again this wasn't written only for you (it is broadly a good feature for upstream), but I did have your use case in mind, so I'm a little disappointed that it doesn't help, as I like to solve problems.
But I'm glad it solves at least some for you...
I recall Liam had a proposal to store the guard ranges in the maple tree?
I wonder if that can be used in combination with this approach to have a better representation of this?
This was an alternative proposal made prior to the feature being implemented (and you and others at Google were welcome to comment and many were cc'd, etc.).
There is no 'in combination with'. This feature would take weeks/months to implement, fundamentally impact the maple tree VMA implementation and... not actually achieve anything + immediately be redundant.
Plus it'd likely be slower, have locking implications, would have kernel memory allocation implications, a lot more complexity and probably other problems besides (we discussed this at length at the time and a number of issues came up, I can't recall all of them).
To be crystal clear - we are empathically NOT changing /proc/$pid/maps to lie about VMAs regardless of underlying implementation, nor adding thousands of characters to /proc/$pid/smaps entries.
Yes. Calling it a "guard region" might be part of the problem (/"misunderstanding"), because it reminds people of "virtual memory regions".
"Guard markers" or similar might have been clearer that these operate on individual PTEs, require page table scanning etc ... which makes them a lot more scalable and fine-grained and provides all these benfits, with the downside being that we don't end up with that many "virtual memory regions" that maps/smaps operate on.
Honestly David you and the naming... :P
I disagree, sorry. Saying 'guard' anything might make people think one thing or another. We can't account for that. I mean don't get me started on 'pinning' or any of the million other overloaded terms we use...
I _hugely_ publicly went out of my way to express the limitations, I gave a talk, we had meetings, I mentioned it in the series.
Honestly if at that point you still don't realise, that's not a naming problem. It's a 'did not participate with upstream' problem.
I like guard regions, as they're not pages as we previously referred to them. People have no idea what a marker is, it doesn't sound like it spans ranges, no don't like it sorry.
And sorry but this naming topic is closed :) I already let you change the naming of the MADV_'s, which broke my heart, there will not be a second heart breaking...
[...]
As I said to you earlier, the _best_ we could do in smaps would be to add a flag like 'Grd' or something to indicate some part of the VMA is guarded. But I won't do that unless somebody has an -actual use case- for it.
Right, and that would limit where you have to manually scan. Something similar is being done with uffd-wp markers IIRC.
Yeah that's a good point, but honestly if you're reading smaps that reads the page tables, then reading /proc/$pid/pagemaps and reading page tables TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading /proc/$pid/pagemaps and reading page tables once.
-- Cheers,
David / dhildenb
On 20.02.25 10:04, Lorenzo Stoakes wrote:
On Thu, Feb 20, 2025 at 09:57:37AM +0100, David Hildenbrand wrote:
On 20.02.25 09:51, Lorenzo Stoakes wrote:
On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
We also can't change smaps in the way you want, it _has_ to still give output per VMA information.
Sorry I wasn't suggesting to change the entries in smaps, rather agreeing to your marker suggestion. Maybe a set of ranges for each smaps entry that has guards? It doesn't solve the use case, but does make these regions visible to userspace.
No, you are not providing a usecase for this. /proc/$pid/pagemaps does not contaminate the smaps output, mess with efforts to make it RCU readable, require updating the ioctl interface, etc. so it is clearly the better choice.
The proposed change that would be there would be a flag or something indicating that the VMA has guard regions _SOMEWHERE_ in it.
Since this doesn't solve your problem, adds complexity, and nobody else seems to need it, I would suggest this is not worthwhile and I'd rather not do this.
Therefore for your needs there are literally only two choices here:
- Add a bit to /proc/$pid/pagemap OR
- a new interface.
I am not in favour of a new interface here, if we can just extend pagemap.
What you'd have to do is:
- Find virtual ranges via /proc/$pid/maps
- iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
Could we also consider an smaps field like:
VmGuards: [AAA, BBB), [CCC, DDD), ...
or something of that sort?
No, absolutely, categorically not. You realise these could be thousands of characters long right?
/proc/$pid/pagemaps resolves this without contaminating this output.
Well I'm glad that you guys find it useful for _something_ ;)
Again this wasn't written only for you (it is broadly a good feature for upstream), but I did have your use case in mind, so I'm a little disappointed that it doesn't help, as I like to solve problems.
But I'm glad it solves at least some for you...
I recall Liam had a proposal to store the guard ranges in the maple tree?
I wonder if that can be used in combination with this approach to have a better representation of this?
This was an alternative proposal made prior to the feature being implemented (and you and others at Google were welcome to comment and many were cc'd, etc.).
There is no 'in combination with'. This feature would take weeks/months to implement, fundamentally impact the maple tree VMA implementation and... not actually achieve anything + immediately be redundant.
Plus it'd likely be slower, have locking implications, would have kernel memory allocation implications, a lot more complexity and probably other problems besides (we discussed this at length at the time and a number of issues came up, I can't recall all of them).
To be crystal clear - we are empathically NOT changing /proc/$pid/maps to lie about VMAs regardless of underlying implementation, nor adding thousands of characters to /proc/$pid/smaps entries.
Yes. Calling it a "guard region" might be part of the problem (/"misunderstanding"), because it reminds people of "virtual memory regions".
"Guard markers" or similar might have been clearer that these operate on individual PTEs, require page table scanning etc ... which makes them a lot more scalable and fine-grained and provides all these benfits, with the downside being that we don't end up with that many "virtual memory regions" that maps/smaps operate on.
Honestly David you and the naming... :P
I disagree, sorry. Saying 'guard' anything might make people think
one> thing or another. We can't account for that. I mean don't get me started on
'pinning' or any of the million other overloaded terms we use...
I _hugely_ publicly went out of my way to express the limitations, I
gave a> talk, we had meetings, I mentioned it in the series.
Honestly if at that point you still don't realise, that's not a naming problem. It's a 'did not participate with upstream' problem.
I like guard regions, as they're not pages as we previously referred to them. People have no idea what a marker is, it doesn't sound like it spans ranges, no don't like it sorry.
And sorry but this naming topic is closed :) I already let you change the naming of the MADV_'s, which broke my heart, there will not be a second heart breaking...
Lorenzo, I was not pushing for it to be changed *now*, that ship has sailed, and personally, I *don't* find it confusing because I know how it works under the hood.
I was trying to find a reason *why* people would thing that it would show up in smaps in the first place. For example, just when reading the MAN page *today*.
Doesn't really matter now, it is named the way it is, and all we can do is try making documentation clearer if it keeps confusing people.
Your conclusion is 'did not participate with upstream'; I don't agree with that. But maybe you and Kalesh have a history on that that let's you react on his questions IMHO more emotionally than it should have been.
[...]
As I said to you earlier, the _best_ we could do in smaps would be to add a flag like 'Grd' or something to indicate some part of the VMA is guarded. But I won't do that unless somebody has an -actual use case- for it.
Right, and that would limit where you have to manually scan. Something similar is being done with uffd-wp markers IIRC.
Yeah that's a good point, but honestly if you're reading smaps that reads the page tables, then reading /proc/$pid/pagemaps and reading page tables TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading /proc/$pid/pagemaps and reading page tables once.
Right; I recently wished that we would have an interface to obtain more VMA flags without having to go through smaps
On Thu, Feb 20, 2025 at 10:23:57AM +0100, David Hildenbrand wrote:
On 20.02.25 10:04, Lorenzo Stoakes wrote:
On Thu, Feb 20, 2025 at 09:57:37AM +0100, David Hildenbrand wrote:
On 20.02.25 09:51, Lorenzo Stoakes wrote:
On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
We also can't change smaps in the way you want, it _has_ to still give output per VMA information.
Sorry I wasn't suggesting to change the entries in smaps, rather agreeing to your marker suggestion. Maybe a set of ranges for each smaps entry that has guards? It doesn't solve the use case, but does make these regions visible to userspace.
No, you are not providing a usecase for this. /proc/$pid/pagemaps does not contaminate the smaps output, mess with efforts to make it RCU readable, require updating the ioctl interface, etc. so it is clearly the better choice.
The proposed change that would be there would be a flag or something indicating that the VMA has guard regions _SOMEWHERE_ in it.
Since this doesn't solve your problem, adds complexity, and nobody else seems to need it, I would suggest this is not worthwhile and I'd rather not do this.
Therefore for your needs there are literally only two choices here:
- Add a bit to /proc/$pid/pagemap OR
- a new interface.
I am not in favour of a new interface here, if we can just extend pagemap.
What you'd have to do is:
- Find virtual ranges via /proc/$pid/maps
- iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
Could we also consider an smaps field like:
VmGuards: [AAA, BBB), [CCC, DDD), ...
or something of that sort?
No, absolutely, categorically not. You realise these could be thousands of characters long right?
/proc/$pid/pagemaps resolves this without contaminating this output.
Well I'm glad that you guys find it useful for _something_ ;)
Again this wasn't written only for you (it is broadly a good feature for upstream), but I did have your use case in mind, so I'm a little disappointed that it doesn't help, as I like to solve problems.
But I'm glad it solves at least some for you...
I recall Liam had a proposal to store the guard ranges in the maple tree?
I wonder if that can be used in combination with this approach to have a better representation of this?
This was an alternative proposal made prior to the feature being implemented (and you and others at Google were welcome to comment and many were cc'd, etc.).
There is no 'in combination with'. This feature would take weeks/months to implement, fundamentally impact the maple tree VMA implementation and... not actually achieve anything + immediately be redundant.
Plus it'd likely be slower, have locking implications, would have kernel memory allocation implications, a lot more complexity and probably other problems besides (we discussed this at length at the time and a number of issues came up, I can't recall all of them).
To be crystal clear - we are empathically NOT changing /proc/$pid/maps to lie about VMAs regardless of underlying implementation, nor adding thousands of characters to /proc/$pid/smaps entries.
Yes. Calling it a "guard region" might be part of the problem (/"misunderstanding"), because it reminds people of "virtual memory regions".
"Guard markers" or similar might have been clearer that these operate on individual PTEs, require page table scanning etc ... which makes them a lot more scalable and fine-grained and provides all these benfits, with the downside being that we don't end up with that many "virtual memory regions" that maps/smaps operate on.
Honestly David you and the naming... :P
I disagree, sorry. Saying 'guard' anything might make people think one>
thing or another. We can't account for that. I mean don't get me started on
'pinning' or any of the million other overloaded terms we use...
I _hugely_ publicly went out of my way to express the limitations, I gave
a> talk, we had meetings, I mentioned it in the series.
Honestly if at that point you still don't realise, that's not a naming problem. It's a 'did not participate with upstream' problem.
I like guard regions, as they're not pages as we previously referred to them. People have no idea what a marker is, it doesn't sound like it spans ranges, no don't like it sorry.
And sorry but this naming topic is closed :) I already let you change the naming of the MADV_'s, which broke my heart, there will not be a second heart breaking...
Lorenzo, I was not pushing for it to be changed *now*, that ship has sailed, and personally, I *don't* find it confusing because I know how it works under the hood.
I was trying to find a reason *why* people would thing that it would show up in smaps in the first place. For example, just when reading the MAN page *today*.
Doesn't really matter now, it is named the way it is, and all we can do is try making documentation clearer if it keeps confusing people.
Right, but I disagree with your analysis. In case as you say, it doesn't really matter at this stage.
Your conclusion is 'did not participate with upstream'; I don't agree with that. But maybe you and Kalesh have a history on that that let's you react on his questions IMHO more emotionally than it should have been.
This is wholly unfair, I have been very reasonable in response to this thread. I have offered to find solutions, I have tried to understand the problem in spite of having gone to great lengths to try to discuss the limitations of the proposed approach in every venue I possibly could.
I go out of my way to deal professionally and objectively with what is presented. Nothing here is emotional. So I'd ask that you please abstain from making commentary like this which has no basis.
My point about not participating with upstream is that Kalesh is now asking for an -entirely different- approach to guard regions than was implemented, months after it was merged.
This is after many discussion were had including with people at Google among other organisations, an RFC which clearly delineated the limitations, a talk at LPC also.
I feel I communite these things better than many people actually, I go out of my way to document, add extensive self-documenting tests, etc. etc. and to participate, engage and take onboard feedback from others.
So I'm suggesting that clearly - something broke down here. There was a miscommunication, or there was a lack of awareness of a key requirement.
I mean a large motivator for file-backed support here came from the LPC talk give by Kalesh and Juan re: ELF guard regions. So obviously that this breakdown in communication with upstream occurred is very unfortunate.
I am not blaming anybody or being 'emotional'. I am simply stating what seems to me to be a clear fact.
I genuinely don't understand how it could be seen any other way? How can requesting that an entirely different alternative approach months after the fact be anything other than some failure to engage with upstream?
I am emphatically _not_ blaming Kalesh by the way, whom I respect and with whom I have no problem, in any way whatsoever. I apologised when I realised that he simply was not aware of this limitation at LPC if you look through the thread, having politely suggested disappointment at this not having been brought up then.
I am suggesting that within Google clearly there has been some form of miscommunication or failure for one aspect of things (this limitation) to be expressed at the right time to engage with upstream.
Sorry if you misinterpreted that as something else, this is the _only_ point I am making here.
And again, I am going out of my way to find practical and helpful ways forward - though I cannot see how we can possibly fulfil Kalesh's needs here.
[...]
As I said to you earlier, the _best_ we could do in smaps would be to add a flag like 'Grd' or something to indicate some part of the VMA is guarded. But I won't do that unless somebody has an -actual use case- for it.
Right, and that would limit where you have to manually scan. Something similar is being done with uffd-wp markers IIRC.
Yeah that's a good point, but honestly if you're reading smaps that reads the page tables, then reading /proc/$pid/pagemaps and reading page tables TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading /proc/$pid/pagemaps and reading page tables once.
Right; I recently wished that we would have an interface to obtain more VMA flags without having to go through smaps
Well maybe that lends itself to the idea of adding a whole new interface in general...
-- Cheers,
David / dhildenb
Your conclusion is 'did not participate with upstream'; I don't agree with that. But maybe you and Kalesh have a history on that that let's you react on his questions IMHO more emotionally than it should have been.
This is wholly unfair, I have been very reasonable in response to this thread. I have offered to find solutions, I have tried to understand the problem in spite of having gone to great lengths to try to discuss the limitations of the proposed approach in every venue I possibly could.
I go out of my way to deal professionally and objectively with what is presented. Nothing here is emotional. So I'd ask that you please abstain from making commentary like this which has no basis.
I appreciate everything you write below. But this request is just impossible. I will keep raising my opinion and misunderstandings will happen.
Note that the whole "Honestly David you and the naming. .." thing could have been written as "I don't think it's a naming problem."
[...]
As I said to you earlier, the _best_ we could do in smaps would be to add a flag like 'Grd' or something to indicate some part of the VMA is guarded. But I won't do that unless somebody has an -actual use case- for it.
Right, and that would limit where you have to manually scan. Something similar is being done with uffd-wp markers IIRC.
Yeah that's a good point, but honestly if you're reading smaps that reads the page tables, then reading /proc/$pid/pagemaps and reading page tables TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading /proc/$pid/pagemaps and reading page tables once.
Right; I recently wished that we would have an interface to obtain more VMA flags without having to go through smaps
Well maybe that lends itself to the idea of adding a whole new interface in general...
An extended "maps" interface might be reasonable, that allows for exposing more things without walking the page tables. (e.g., flags)
Maybe one could have an indicator that says "ever had guard regions in this mapping" without actually walking the page tables.
On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
Your conclusion is 'did not participate with upstream'; I don't agree with that. But maybe you and Kalesh have a history on that that let's you react on his questions IMHO more emotionally than it should have been.
This is wholly unfair, I have been very reasonable in response to this thread. I have offered to find solutions, I have tried to understand the problem in spite of having gone to great lengths to try to discuss the limitations of the proposed approach in every venue I possibly could.
I go out of my way to deal professionally and objectively with what is presented. Nothing here is emotional. So I'd ask that you please abstain from making commentary like this which has no basis.
I appreciate everything you write below. But this request is just impossible. I will keep raising my opinion and misunderstandings will happen.
Well I wouldn't ask you not to express your opinion David, you know I respect and like you, and by all means push back hard or call out what you think is bad behaviour :)
I just meant to say, in my view, that there was no basis, but I appreciate miscommunications happen.
So apologies if I came off as being difficult or rude, it actually wasn't intended. And to re-emphasise - I have zero personal issue with anybody in this thread whatsoever!
I just want to find the best way forward, technically and am willing to do whatever work is required to make the guard region implementation as good as it possibly can be.
Note that the whole "Honestly David you and the naming. .." thing could have been written as "I don't think it's a naming problem."
I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek' style, which is what this was meant to be... so I think herein lies the basis of the miscommunication :)
I apologise, the household is ill, which maybe affects my judgment in how I write these, but in general text is a very poor medium. It was meant to be said in a jolly tone with a wink...
I think maybe I should learn my lesson with these things, I thought the ':p' would make this clear but yeah, text, poor medium.
Anyway apologies if this seemed disrespectful.
[...]
As I said to you earlier, the _best_ we could do in smaps would be to add a flag like 'Grd' or something to indicate some part of the VMA is guarded. But I won't do that unless somebody has an -actual use case- for it.
Right, and that would limit where you have to manually scan. Something similar is being done with uffd-wp markers IIRC.
Yeah that's a good point, but honestly if you're reading smaps that reads the page tables, then reading /proc/$pid/pagemaps and reading page tables TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading /proc/$pid/pagemaps and reading page tables once.
Right; I recently wished that we would have an interface to obtain more VMA flags without having to go through smaps
Well maybe that lends itself to the idea of adding a whole new interface in general...
An extended "maps" interface might be reasonable, that allows for exposing more things without walking the page tables. (e.g., flags)
Maybe one could have an indicator that says "ever had guard regions in this mapping" without actually walking the page tables.
Yeah this is something we've discussed before, but it's a little fraught. Let's say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not impact merging (easy enough) to account for splits/merges.
The problem comes in that we would then need to acquire the VMA write lock to do it, something we don't currently require on application of guard regions.
We'd also have to make sure nothing else makes any assumptions about VMA flags implying differences in VMAs in this one instance (though we do already do this for VM_SOFTDIRTY).
I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.
-- Cheers,
David / dhildenb
On 20.02.25 11:15, Lorenzo Stoakes wrote:
On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
Your conclusion is 'did not participate with upstream'; I don't agree with that. But maybe you and Kalesh have a history on that that let's you react on his questions IMHO more emotionally than it should have been.
This is wholly unfair, I have been very reasonable in response to this thread. I have offered to find solutions, I have tried to understand the problem in spite of having gone to great lengths to try to discuss the limitations of the proposed approach in every venue I possibly could.
I go out of my way to deal professionally and objectively with what is presented. Nothing here is emotional. So I'd ask that you please abstain from making commentary like this which has no basis.
I appreciate everything you write below. But this request is just impossible. I will keep raising my opinion and misunderstandings will happen.
Well I wouldn't ask you not to express your opinion David, you know I respect and like you, and by all means push back hard or call out what you think is bad behaviour :)
I just meant to say, in my view, that there was no basis, but I appreciate miscommunications happen.
So apologies if I came off as being difficult or rude, it actually
wasn't
intended. And to re-emphasise - I have zero personal issue with anybody in this thread whatsoever!
It sounded to me like you were trying to defend your work (again, IMHO too emotionally, and I might have completely misinterpreted that) and slowly switching to "friendly fire" (towards me). Apologies from my side if I completely misunderstood/misinterpreted that.
To recap: what we have upstream is great; you did a great job. Yes, the mechanism has its drawbacks, but that's just part of the design.
Some people maybe have wrong expectations, maybe there were misunderstandings, or maybe there are requirements that only now pop up; it's sometimes unavoidable, and that's ok.
We can try to document it better (and I was trying to find clues why people might be mislead), and see if/how we could sort out these requirements. But we can likely not make it perfect in any possible way (I'm sure there are plenty of use cases where what we currently have is more than sufficient).
I just want to find the best way forward, technically and am
willing to do
whatever work is required to make the guard region implementation as good as it possibly can be.
Note that the whole "Honestly David you and the naming. .." thing could have been written as "I don't think it's a naming problem."
I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek' style, which is what this was meant to be... so I think herein lies the basis of the miscommunication :)
I apologise, the household is ill, which maybe affects my judgment in how I write these, but in general text is a very poor medium. It was meant to be said in a jolly tone with a wink...
I think maybe I should learn my lesson with these things, I thought the ':p' would make this clear but yeah, text, poor medium.
Anyway apologies if this seemed disrespectful.
No worries, it's hard to really make me angry, and I appreciate your openness and your apology (well, and you and your work, obviously).
I'll note, though, if my memory serves me right, that nobody so far ever criticized the way I communicate upstream, or even told me to abstain from certain communication.
That probably hurt most, now that a couple of hours passed. Nothing that a couple of beers and a bit of self-reflection on my communication style can't fix ;)
[...]
Yeah that's a good point, but honestly if you're reading smaps that reads the page tables, then reading /proc/$pid/pagemaps and reading page tables TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading /proc/$pid/pagemaps and reading page tables once.
Right; I recently wished that we would have an interface to obtain more VMA flags without having to go through smaps
Well maybe that lends itself to the idea of adding a whole new interface in general...
An extended "maps" interface might be reasonable, that allows for exposing more things without walking the page tables. (e.g., flags)
Maybe one could have an indicator that says "ever had guard regions in this mapping" without actually walking the page tables.
Yeah this is something we've discussed before, but it's a little fraught. Let's say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not impact merging (easy enough) to account for splits/merges.
The problem comes in that we would then need to acquire the VMA
write lock to do
it, something we don't currently require on application of guard regions.
Right, and we shouldn't write-lock the mmap. We'd need some way to just atomically set such an indicator on a VMA.
I'll also note that it might be helpful for smallish region, but especially for large ones (including when they are split and the indicator is wrong), it's less helpful. I don't have to tell you about the VMA merging implications, probably it would be like VM_SOFTDIRTY handling :)
We'd also have to make sure nothing else makes any assumptions about VMA flags implying differences in VMAs in this one instance (though we do already do this for VM_SOFTDIRTY).
I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.
Yes.
On Thu, Feb 20, 2025 at 01:44:20PM +0100, David Hildenbrand wrote:
On 20.02.25 11:15, Lorenzo Stoakes wrote:
On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
Your conclusion is 'did not participate with upstream'; I don't agree with that. But maybe you and Kalesh have a history on that that let's you react on his questions IMHO more emotionally than it should have been.
This is wholly unfair, I have been very reasonable in response to this thread. I have offered to find solutions, I have tried to understand the problem in spite of having gone to great lengths to try to discuss the limitations of the proposed approach in every venue I possibly could.
I go out of my way to deal professionally and objectively with what is presented. Nothing here is emotional. So I'd ask that you please abstain from making commentary like this which has no basis.
I appreciate everything you write below. But this request is just impossible. I will keep raising my opinion and misunderstandings will happen.
Well I wouldn't ask you not to express your opinion David, you know I respect and like you, and by all means push back hard or call out what you think is bad behaviour :)
I just meant to say, in my view, that there was no basis, but I appreciate miscommunications happen.
So apologies if I came off as being difficult or rude, it actually
wasn't
intended. And to re-emphasise - I have zero personal issue with anybody in this thread whatsoever!
It sounded to me like you were trying to defend your work (again, IMHO too emotionally, and I might have completely misinterpreted that) and slowly switching to "friendly fire" (towards me). Apologies from my side if I completely misunderstood/misinterpreted that.
Right this was not at all my intent, sorry if it seemed that way. I may well have communicated terribly, so apologies on my side too.
To recap: what we have upstream is great; you did a great job. Yes, the mechanism has its drawbacks, but that's just part of the design.
Thanks :)
Some people maybe have wrong expectations, maybe there were misunderstandings, or maybe there are requirements that only now pop up; it's sometimes unavoidable, and that's ok.
We can try to document it better (and I was trying to find clues why people might be mislead), and see if/how we could sort out these requirements. But we can likely not make it perfect in any possible way (I'm sure there are plenty of use cases where what we currently have is more than sufficient).
Sure and I"m very open to adding a documentation page for guard regions, in fact was considering this very thing recently. I already added man pages but be good to be able to go into more depth.
I just want to find the best way forward, technically and am willing to
do
whatever work is required to make the guard region implementation as good as it possibly can be.
Note that the whole "Honestly David you and the naming. .." thing could have been written as "I don't think it's a naming problem."
I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek' style, which is what this was meant to be... so I think herein lies the basis of the miscommunication :)
I apologise, the household is ill, which maybe affects my judgment in how I write these, but in general text is a very poor medium. It was meant to be said in a jolly tone with a wink...
I think maybe I should learn my lesson with these things, I thought the ':p' would make this clear but yeah, text, poor medium.
Anyway apologies if this seemed disrespectful.
No worries, it's hard to really make me angry, and I appreciate your openness and your apology (well, and you and your work, obviously).
I'll note, though, if my memory serves me right, that nobody so far ever criticized the way I communicate upstream, or even told me to abstain from certain communication.
I wish I could say the same haha, so perhaps this was a problem on my side honestly. I do have a habit of being 'tongue in cheek' and failing to communicate that which I did say the last time I wouldn't repeat. It is not intended, I promise.
As the abstain, was more a British turn of phrase, meaning to say - I dispute the claim that this is an emotional thing and please don't say this if it isn't so.
But I understand that of course, you may have interpreted it as so, due to my having failed to communicate it well.
Again, I must say, text remains replete with possibilities for miscommunication, misunderstanding and it can so often be difficult to communicate one's intent.
But again of course, I apologise if I overstepped the line in any way!
That probably hurt most, now that a couple of hours passed. Nothing that a couple of beers and a bit of self-reflection on my communication style can't fix ;)
Ugh sorry, man. Not my intent. And it seems - I literally OWE YOU pints now. :) we will fix this at lsf...
Perhaps owe Kalesh some too should he be there... will budget accordingly... :P
[...]
Yeah that's a good point, but honestly if you're reading smaps that reads the page tables, then reading /proc/$pid/pagemaps and reading page tables TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading /proc/$pid/pagemaps and reading page tables once.
Right; I recently wished that we would have an interface to obtain more VMA flags without having to go through smaps
Well maybe that lends itself to the idea of adding a whole new interface in general...
An extended "maps" interface might be reasonable, that allows for exposing more things without walking the page tables. (e.g., flags)
Maybe one could have an indicator that says "ever had guard regions in this mapping" without actually walking the page tables.
Yeah this is something we've discussed before, but it's a little fraught. Let's say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not impact merging (easy enough) to account for splits/merges.
The problem comes in that we would then need to acquire the VMA write
lock to do
it, something we don't currently require on application of guard regions.
Right, and we shouldn't write-lock the mmap. We'd need some way to just atomically set such an indicator on a VMA.
Hm yeah, could be tricky, we definitely can't manage a new field in vm_area_struct, this is a very sensitive subject at the moment really with Suren's work with VMAs allocated via SLAB_TYPESAFE_BY_RCU, putting the lock into the VMA and the alignment requirements.
Not sure what precedent we'd have with atomic setting of a VMA flag for this... could be tricky.
I'll also note that it might be helpful for smallish region, but especially for large ones (including when they are split and the indicator is wrong), it's less helpful. I don't have to tell you about the VMA merging implications, probably it would be like VM_SOFTDIRTY handling :)
Yeah indeed now we've simplified merging a lot of possibilities emerge, this is one!
We'd also have to make sure nothing else makes any assumptions about VMA flags implying differences in VMAs in this one instance (though we do already do this for VM_SOFTDIRTY).
I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.
Yes.
-- Cheers,
David / dhildenb
Best, Lorenzo
On Thu, Feb 20, 2025 at 5:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Thu, Feb 20, 2025 at 01:44:20PM +0100, David Hildenbrand wrote:
On 20.02.25 11:15, Lorenzo Stoakes wrote:
On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
Your conclusion is 'did not participate with upstream'; I don't agree with that. But maybe you and Kalesh have a history on that that let's you react on his questions IMHO more emotionally than it should have been.
This is wholly unfair, I have been very reasonable in response to this thread. I have offered to find solutions, I have tried to understand the problem in spite of having gone to great lengths to try to discuss the limitations of the proposed approach in every venue I possibly could.
I go out of my way to deal professionally and objectively with what is presented. Nothing here is emotional. So I'd ask that you please abstain from making commentary like this which has no basis.
I appreciate everything you write below. But this request is just impossible. I will keep raising my opinion and misunderstandings will happen.
Well I wouldn't ask you not to express your opinion David, you know I respect and like you, and by all means push back hard or call out what you think is bad behaviour :)
I just meant to say, in my view, that there was no basis, but I appreciate miscommunications happen.
So apologies if I came off as being difficult or rude, it actually
wasn't
intended. And to re-emphasise - I have zero personal issue with anybody in this thread whatsoever!
It sounded to me like you were trying to defend your work (again, IMHO too emotionally, and I might have completely misinterpreted that) and slowly switching to "friendly fire" (towards me). Apologies from my side if I completely misunderstood/misinterpreted that.
Right this was not at all my intent, sorry if it seemed that way. I may well have communicated terribly, so apologies on my side too.
Sorry for being late to the party. Was sick for a couple of days. Lorenzo is right, there was a breakdown in communication at Google and he has all the rights to be upset. The issue with obfuscators should have been communicated once it was discovered. I was in regular discussions with Lorenzo but wasn't directly involved with this particular project and wasn't aware or did not realize that the obfuscator issue renders guards unusable for this usecase. My apologies, I should have asked more questions about it. I suspect Lorenzo would have implemented this anyway...
To make guard regions work for this usecase, first we (Android) need to abstract /proc/pid/maps accesses. Only then we can use additional interfaces like /proc/pid/pagemaps to obtain guard region information. I'll start figuring out what it takes to insert such an abstraction. Thanks, Suren.
To recap: what we have upstream is great; you did a great job. Yes, the mechanism has its drawbacks, but that's just part of the design.
Thanks :)
Some people maybe have wrong expectations, maybe there were misunderstandings, or maybe there are requirements that only now pop up; it's sometimes unavoidable, and that's ok.
We can try to document it better (and I was trying to find clues why people might be mislead), and see if/how we could sort out these requirements. But we can likely not make it perfect in any possible way (I'm sure there are plenty of use cases where what we currently have is more than sufficient).
Sure and I"m very open to adding a documentation page for guard regions, in fact was considering this very thing recently. I already added man pages but be good to be able to go into more depth.
I just want to find the best way forward, technically and am willing to
do
whatever work is required to make the guard region implementation as good as it possibly can be.
Note that the whole "Honestly David you and the naming. .." thing could have been written as "I don't think it's a naming problem."
I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek' style, which is what this was meant to be... so I think herein lies the basis of the miscommunication :)
I apologise, the household is ill, which maybe affects my judgment in how I write these, but in general text is a very poor medium. It was meant to be said in a jolly tone with a wink...
I think maybe I should learn my lesson with these things, I thought the ':p' would make this clear but yeah, text, poor medium.
Anyway apologies if this seemed disrespectful.
No worries, it's hard to really make me angry, and I appreciate your openness and your apology (well, and you and your work, obviously).
I'll note, though, if my memory serves me right, that nobody so far ever criticized the way I communicate upstream, or even told me to abstain from certain communication.
I wish I could say the same haha, so perhaps this was a problem on my side honestly. I do have a habit of being 'tongue in cheek' and failing to communicate that which I did say the last time I wouldn't repeat. It is not intended, I promise.
As the abstain, was more a British turn of phrase, meaning to say - I dispute the claim that this is an emotional thing and please don't say this if it isn't so.
But I understand that of course, you may have interpreted it as so, due to my having failed to communicate it well.
Again, I must say, text remains replete with possibilities for miscommunication, misunderstanding and it can so often be difficult to communicate one's intent.
But again of course, I apologise if I overstepped the line in any way!
That probably hurt most, now that a couple of hours passed. Nothing that a couple of beers and a bit of self-reflection on my communication style can't fix ;)
Ugh sorry, man. Not my intent. And it seems - I literally OWE YOU pints now. :) we will fix this at lsf...
Perhaps owe Kalesh some too should he be there... will budget accordingly... :P
[...]
> Yeah that's a good point, but honestly if you're reading smaps that reads > the page tables, then reading /proc/$pid/pagemaps and reading page tables > TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading > /proc/$pid/pagemaps and reading page tables once.
Right; I recently wished that we would have an interface to obtain more VMA flags without having to go through smaps
Well maybe that lends itself to the idea of adding a whole new interface in general...
An extended "maps" interface might be reasonable, that allows for exposing more things without walking the page tables. (e.g., flags)
Maybe one could have an indicator that says "ever had guard regions in this mapping" without actually walking the page tables.
Yeah this is something we've discussed before, but it's a little fraught. Let's say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not impact merging (easy enough) to account for splits/merges.
The problem comes in that we would then need to acquire the VMA write
lock to do
it, something we don't currently require on application of guard regions.
Right, and we shouldn't write-lock the mmap. We'd need some way to just atomically set such an indicator on a VMA.
Hm yeah, could be tricky, we definitely can't manage a new field in vm_area_struct, this is a very sensitive subject at the moment really with Suren's work with VMAs allocated via SLAB_TYPESAFE_BY_RCU, putting the lock into the VMA and the alignment requirements.
Not sure what precedent we'd have with atomic setting of a VMA flag for this... could be tricky.
I'll also note that it might be helpful for smallish region, but especially for large ones (including when they are split and the indicator is wrong), it's less helpful. I don't have to tell you about the VMA merging implications, probably it would be like VM_SOFTDIRTY handling :)
Yeah indeed now we've simplified merging a lot of possibilities emerge, this is one!
We'd also have to make sure nothing else makes any assumptions about VMA flags implying differences in VMAs in this one instance (though we do already do this for VM_SOFTDIRTY).
I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.
Yes.
-- Cheers,
David / dhildenb
Best, Lorenzo
On Thu, Feb 20, 2025 at 8:22 AM Suren Baghdasaryan surenb@google.com wrote:
On Thu, Feb 20, 2025 at 5:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Thu, Feb 20, 2025 at 01:44:20PM +0100, David Hildenbrand wrote:
On 20.02.25 11:15, Lorenzo Stoakes wrote:
On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
> Your conclusion is 'did not participate with upstream'; I don't agree with > that. But maybe you and Kalesh have a history on that that let's you react > on his questions IMHO more emotionally than it should have been.
This is wholly unfair, I have been very reasonable in response to this thread. I have offered to find solutions, I have tried to understand the problem in spite of having gone to great lengths to try to discuss the limitations of the proposed approach in every venue I possibly could.
I go out of my way to deal professionally and objectively with what is presented. Nothing here is emotional. So I'd ask that you please abstain from making commentary like this which has no basis.
I appreciate everything you write below. But this request is just impossible. I will keep raising my opinion and misunderstandings will happen.
Well I wouldn't ask you not to express your opinion David, you know I respect and like you, and by all means push back hard or call out what you think is bad behaviour :)
I just meant to say, in my view, that there was no basis, but I appreciate miscommunications happen.
So apologies if I came off as being difficult or rude, it actually
wasn't
intended. And to re-emphasise - I have zero personal issue with anybody in this thread whatsoever!
It sounded to me like you were trying to defend your work (again, IMHO too emotionally, and I might have completely misinterpreted that) and slowly switching to "friendly fire" (towards me). Apologies from my side if I completely misunderstood/misinterpreted that.
Right this was not at all my intent, sorry if it seemed that way. I may well have communicated terribly, so apologies on my side too.
Hi everyone,
Thank you for all the discussion.
I don't find any personal issues with the communication in this thread, but I appreciate David being the object voice of reason.
I understand it can be frustrating since you have made many efforts to communicate these tradeoffs. Unfortunately these issues were not known for the file-backed ELF guard regions for my particular use case.
Sorry for being late to the party. Was sick for a couple of days. Lorenzo is right, there was a breakdown in communication at Google and he has all the rights to be upset. The issue with obfuscators should have been communicated once it was discovered. I was in regular discussions with Lorenzo but wasn't directly involved with this particular project and wasn't aware or did not realize that the obfuscator issue renders guards unusable for this usecase. My apologies, I should have asked more questions about it. I suspect Lorenzo would have implemented this anyway...
Suren's use case is different from mine and this design fits perfectly for anon guard regions from the allocator. :)
So I think in conclusion, these aren't VMAs and shouldn't be treated as such; we will advertise them from pagemap for those who need to know.
-- Kalesh
To make guard regions work for this usecase, first we (Android) need to abstract /proc/pid/maps accesses. Only then we can use additional interfaces like /proc/pid/pagemaps to obtain guard region information. I'll start figuring out what it takes to insert such an abstraction. Thanks, Suren.
To recap: what we have upstream is great; you did a great job. Yes, the mechanism has its drawbacks, but that's just part of the design.
Thanks :)
Some people maybe have wrong expectations, maybe there were misunderstandings, or maybe there are requirements that only now pop up; it's sometimes unavoidable, and that's ok.
We can try to document it better (and I was trying to find clues why people might be mislead), and see if/how we could sort out these requirements. But we can likely not make it perfect in any possible way (I'm sure there are plenty of use cases where what we currently have is more than sufficient).
Sure and I"m very open to adding a documentation page for guard regions, in fact was considering this very thing recently. I already added man pages but be good to be able to go into more depth.
I just want to find the best way forward, technically and am willing to
do
whatever work is required to make the guard region implementation as good as it possibly can be.
Note that the whole "Honestly David you and the naming. .." thing could have been written as "I don't think it's a naming problem."
I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek' style, which is what this was meant to be... so I think herein lies the basis of the miscommunication :)
I apologise, the household is ill, which maybe affects my judgment in how I write these, but in general text is a very poor medium. It was meant to be said in a jolly tone with a wink...
I think maybe I should learn my lesson with these things, I thought the ':p' would make this clear but yeah, text, poor medium.
Anyway apologies if this seemed disrespectful.
No worries, it's hard to really make me angry, and I appreciate your openness and your apology (well, and you and your work, obviously).
I'll note, though, if my memory serves me right, that nobody so far ever criticized the way I communicate upstream, or even told me to abstain from certain communication.
I wish I could say the same haha, so perhaps this was a problem on my side honestly. I do have a habit of being 'tongue in cheek' and failing to communicate that which I did say the last time I wouldn't repeat. It is not intended, I promise.
As the abstain, was more a British turn of phrase, meaning to say - I dispute the claim that this is an emotional thing and please don't say this if it isn't so.
But I understand that of course, you may have interpreted it as so, due to my having failed to communicate it well.
Again, I must say, text remains replete with possibilities for miscommunication, misunderstanding and it can so often be difficult to communicate one's intent.
But again of course, I apologise if I overstepped the line in any way!
That probably hurt most, now that a couple of hours passed. Nothing that a couple of beers and a bit of self-reflection on my communication style can't fix ;)
Ugh sorry, man. Not my intent. And it seems - I literally OWE YOU pints now. :) we will fix this at lsf...
Perhaps owe Kalesh some too should he be there... will budget accordingly... :P
[...]
> > Yeah that's a good point, but honestly if you're reading smaps that reads > > the page tables, then reading /proc/$pid/pagemaps and reading page tables > > TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading > > /proc/$pid/pagemaps and reading page tables once. > > Right; I recently wished that we would have an interface to obtain more VMA > flags without having to go through smaps
Well maybe that lends itself to the idea of adding a whole new interface in general...
An extended "maps" interface might be reasonable, that allows for exposing more things without walking the page tables. (e.g., flags)
Maybe one could have an indicator that says "ever had guard regions in this mapping" without actually walking the page tables.
Yeah this is something we've discussed before, but it's a little fraught. Let's say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not impact merging (easy enough) to account for splits/merges.
The problem comes in that we would then need to acquire the VMA write
lock to do
it, something we don't currently require on application of guard regions.
Right, and we shouldn't write-lock the mmap. We'd need some way to just atomically set such an indicator on a VMA.
Hm yeah, could be tricky, we definitely can't manage a new field in vm_area_struct, this is a very sensitive subject at the moment really with Suren's work with VMAs allocated via SLAB_TYPESAFE_BY_RCU, putting the lock into the VMA and the alignment requirements.
Not sure what precedent we'd have with atomic setting of a VMA flag for this... could be tricky.
I'll also note that it might be helpful for smallish region, but especially for large ones (including when they are split and the indicator is wrong), it's less helpful. I don't have to tell you about the VMA merging implications, probably it would be like VM_SOFTDIRTY handling :)
Yeah indeed now we've simplified merging a lot of possibilities emerge, this is one!
We'd also have to make sure nothing else makes any assumptions about VMA flags implying differences in VMAs in this one instance (though we do already do this for VM_SOFTDIRTY).
I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.
Yes.
-- Cheers,
David / dhildenb
Best, Lorenzo
On Thu, Feb 20, 2025 at 10:08:36AM -0800, Kalesh Singh wrote:
On Thu, Feb 20, 2025 at 8:22 AM Suren Baghdasaryan surenb@google.com wrote:
On Thu, Feb 20, 2025 at 5:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Thu, Feb 20, 2025 at 01:44:20PM +0100, David Hildenbrand wrote:
On 20.02.25 11:15, Lorenzo Stoakes wrote:
On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
> > Your conclusion is 'did not participate with upstream'; I don't agree with > > that. But maybe you and Kalesh have a history on that that let's you react > > on his questions IMHO more emotionally than it should have been. > > This is wholly unfair, I have been very reasonable in response to this > thread. I have offered to find solutions, I have tried to understand the > problem in spite of having gone to great lengths to try to discuss the > limitations of the proposed approach in every venue I possibly could. > > I go out of my way to deal professionally and objectively with what is > presented. Nothing here is emotional. So I'd ask that you please abstain > from making commentary like this which has no basis.
I appreciate everything you write below. But this request is just impossible. I will keep raising my opinion and misunderstandings will happen.
Well I wouldn't ask you not to express your opinion David, you know I respect and like you, and by all means push back hard or call out what you think is bad behaviour :)
I just meant to say, in my view, that there was no basis, but I appreciate miscommunications happen.
So apologies if I came off as being difficult or rude, it actually
wasn't
intended. And to re-emphasise - I have zero personal issue with anybody in this thread whatsoever!
It sounded to me like you were trying to defend your work (again, IMHO too emotionally, and I might have completely misinterpreted that) and slowly switching to "friendly fire" (towards me). Apologies from my side if I completely misunderstood/misinterpreted that.
Right this was not at all my intent, sorry if it seemed that way. I may well have communicated terribly, so apologies on my side too.
Hi everyone,
Thank you for all the discussion.
I don't find any personal issues with the communication in this thread, but I appreciate David being the object voice of reason.
I understand it can be frustrating since you have made many efforts to communicate these tradeoffs. Unfortunately these issues were not known for the file-backed ELF guard regions for my particular use case.
Sorry for being late to the party. Was sick for a couple of days. Lorenzo is right, there was a breakdown in communication at Google and he has all the rights to be upset. The issue with obfuscators should have been communicated once it was discovered. I was in regular discussions with Lorenzo but wasn't directly involved with this particular project and wasn't aware or did not realize that the obfuscator issue renders guards unusable for this usecase. My apologies, I should have asked more questions about it. I suspect Lorenzo would have implemented this anyway...
Suren's use case is different from mine and this design fits perfectly for anon guard regions from the allocator. :)
So I think in conclusion, these aren't VMAs and shouldn't be treated as such; we will advertise them from pagemap for those who need to know.
Thanks Kalesh, glad there were no issues here and we have found constructive common ground! :)
It turns out implementing the pagemap side of things is _really_ straightforward, so I'll be sending a series for that shortly. Hopefully this provides some basis for whichever use cases need this information, as it is the best and least invasive place for this information at this stage.
Cheers, Lorenzo
-- Kalesh
On Fri, Feb 21, 2025 at 3:04 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Thu, Feb 20, 2025 at 10:08:36AM -0800, Kalesh Singh wrote:
On Thu, Feb 20, 2025 at 8:22 AM Suren Baghdasaryan surenb@google.com wrote:
On Thu, Feb 20, 2025 at 5:18 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Thu, Feb 20, 2025 at 01:44:20PM +0100, David Hildenbrand wrote:
On 20.02.25 11:15, Lorenzo Stoakes wrote:
On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote: > > > Your conclusion is 'did not participate with upstream'; I don't agree with > > > that. But maybe you and Kalesh have a history on that that let's you react > > > on his questions IMHO more emotionally than it should have been. > > > > This is wholly unfair, I have been very reasonable in response to this > > thread. I have offered to find solutions, I have tried to understand the > > problem in spite of having gone to great lengths to try to discuss the > > limitations of the proposed approach in every venue I possibly could. > > > > I go out of my way to deal professionally and objectively with what is > > presented. Nothing here is emotional. So I'd ask that you please abstain > > from making commentary like this which has no basis. > > I appreciate everything you write below. But this request is just > impossible. I will keep raising my opinion and misunderstandings will > happen.
Well I wouldn't ask you not to express your opinion David, you know I respect and like you, and by all means push back hard or call out what you think is bad behaviour :)
I just meant to say, in my view, that there was no basis, but I appreciate miscommunications happen. > So apologies if I came off as being difficult or rude, it actually
wasn't
intended. And to re-emphasise - I have zero personal issue with anybody in this thread whatsoever!
It sounded to me like you were trying to defend your work (again, IMHO too emotionally, and I might have completely misinterpreted that) and slowly switching to "friendly fire" (towards me). Apologies from my side if I completely misunderstood/misinterpreted that.
Right this was not at all my intent, sorry if it seemed that way. I may well have communicated terribly, so apologies on my side too.
Hi everyone,
Thank you for all the discussion.
I don't find any personal issues with the communication in this thread, but I appreciate David being the object voice of reason.
I understand it can be frustrating since you have made many efforts to communicate these tradeoffs. Unfortunately these issues were not known for the file-backed ELF guard regions for my particular use case.
Sorry for being late to the party. Was sick for a couple of days. Lorenzo is right, there was a breakdown in communication at Google and he has all the rights to be upset. The issue with obfuscators should have been communicated once it was discovered. I was in regular discussions with Lorenzo but wasn't directly involved with this particular project and wasn't aware or did not realize that the obfuscator issue renders guards unusable for this usecase. My apologies, I should have asked more questions about it. I suspect Lorenzo would have implemented this anyway...
Suren's use case is different from mine and this design fits perfectly for anon guard regions from the allocator. :)
So I think in conclusion, these aren't VMAs and shouldn't be treated as such; we will advertise them from pagemap for those who need to know.
Thanks Kalesh, glad there were no issues here and we have found constructive common ground! :)
It turns out implementing the pagemap side of things is _really_ straightforward, so I'll be sending a series for that shortly. Hopefully this provides some basis for whichever use cases need this information, as it is the best and least invasive place for this information at this stage.
Hi Lorenzo,
Reviewed your patches, agreed that is the cleanest way to advertise this information.
Thanks, Kalesh
Cheers, Lorenzo
-- Kalesh
On 2/20/25 09:51, Lorenzo Stoakes wrote:
On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
As I said to you earlier, the _best_ we could do in smaps would be to add a flag like 'Grd' or something to indicate some part of the VMA is
In smaps we could say how many kB is covered by guard ptes and it would be in line with the current output, and the fact it's already scanning page tables, so virtually no new overhead. But it wouldn't tell you the address ranges, of course.
guarded. But I won't do that unless somebody has an -actual use case- for it.
Right, we need to hear that first. Also I'm a bit confused about what's the issue with the existing Android apps and the proposals. Do the existing apps expect to see particular PROT_NONE regions, which would be gone (become part of the surrounding vma's) when using guards instead? Then clearly there's no way to use guards for them without breaking their assumptions.
But assuming future versions of these apps for future android versions would have to adapt to guards instead of PROT_NONE, why do we have to promise them to represent the guards, if the point is just so they adapt to the new state of smaps in their dubious security checking code, and not break anymore?
(but geez, if android apps are to use the android apis, which is java based (IIRC?) why were there ever allowed to read /proc in the first place? such a mistake, sigh)
David's /proc/$pid/pagemaps suggestion is excellent, avoids all the pitfalls, exposes guard regions to anybody who really really wants to know and doesn't interfere with anything else, so this is what we'll go with.
Regards, Lorenzo
On Thu, Feb 20, 2025 at 10:22:29AM +0100, Vlastimil Babka wrote:
On 2/20/25 09:51, Lorenzo Stoakes wrote:
On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
As I said to you earlier, the _best_ we could do in smaps would be to add a flag like 'Grd' or something to indicate some part of the VMA is
In smaps we could say how many kB is covered by guard ptes and it would be in line with the current output, and the fact it's already scanning page tables, so virtually no new overhead. But it wouldn't tell you the address ranges, of course.
Yes right. Sorry, I didn't think to suggest this, the fundamental point being that there is essentially a 'flag' or 'entry' saying 'there are guard regions here'. And indeed that could include ranges.
But it doesn't tell you the actual address ranges which is what Kalesh appears to require.
You could equally well find these, and the sizes (albeit less conveniently) with a combination of /proc/$pid/maps and adding a guard bit to /proc/$pid/pagemap.
guarded. But I won't do that unless somebody has an -actual use case- for it.
Right, we need to hear that first. Also I'm a bit confused about what's the issue with the existing Android apps and the proposals. Do the existing apps expect to see particular PROT_NONE regions, which would be gone (become part of the surrounding vma's) when using guards instead? Then clearly there's no way to use guards for them without breaking their assumptions.
But assuming future versions of these apps for future android versions would have to adapt to guards instead of PROT_NONE, why do we have to promise them to represent the guards, if the point is just so they adapt to the new state of smaps in their dubious security checking code, and not break anymore?
(but geez, if android apps are to use the android apis, which is java based (IIRC?) why were there ever allowed to read /proc in the first place? such a mistake, sigh)
David's /proc/$pid/pagemaps suggestion is excellent, avoids all the pitfalls, exposes guard regions to anybody who really really wants to know and doesn't interfere with anything else, so this is what we'll go with.
Regards, Lorenzo
linux-kselftest-mirror@lists.linaro.org