On Thu, Oct 30, 2025 at 07:31:26PM +0100, Vlastimil Babka wrote:
On 10/30/25 17:43, Lorenzo Stoakes wrote:
On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).
Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.
This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.
This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).
We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.
The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.
Do you though? Could it be possible to simply atomically set the flag with the read lock held? This would make it so we can't split the VMA (and tightly
VMA flags are not accessed atomically so no I don't think we can do that in any workable way.
FWIW I think you could work it as an atomic flag and treat those races as benign (this one, at least).
It's not benign as we need to ensure that page tables are correctly propagated on fork.
Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag setting? I think the places that could race with us to cause RMW use vma
I mean, I just spoke about why I didn't think introducing an entirely new (afaik) one-sided atomic VMA flag write, so maybe deal with that first before proposing something new...
We probably can use VMA read lock (actually introduced _after_ I introduced guard regions I believe) if we do not need the write lock.
But I need to hear a compelling non-hand waving reason for introducing an entirely new mechanism like that which is going to violate an assumption we have about VMA flags everywhere...
write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for the oldmm) and it probably would't make sense to start doing it. Maybe we
I mean, the entire point is fork behaviour, so that makes this a non-starter...
could think of something to deal with this special case...
If we implemented that I guess we'd have to atomically read the VMA flag, but a race between the two would be very problematic.
Note that we'd also have to have a new mechanism for writing a VMA flag (this is entirely novel now) because vm_flags_set() itself _literally_ has vma_start_write() in it.
We'd also have to audit any and all cases where VMA flags are accessed to ensure that this new novel method does not cause any issues.
We're also inviting others to do this kind of thing (e.g. drivers).
Seems dangerous in itself.
In any case, since you're now asking for something entirely novel, you really have to provide a compelling argument as to why we can't write lock.
We can't allow races there or we break forking.
I also don't think it's at all necessary, see below.
define what "may have a guard page"), but it sounds much better than introducing lock contention. I don't think it is reasonable to add a write lock to a feature that may be used by such things as thread stack allocation, malloc, etc.
What lock contention? It's per-VMA so the contention is limited to the VMA in question, and only over the span of time you are setting the gaurd region.
Don't we always need to take the mmap write lock when grabbing a VMA write lock as well?
Yup. But at the same time you're doing the kind of operations that'd use this you'd already be taking the lock anyway.
You don't hold it for long and you won't be doing this any more often than you'd be doing other write operations, which you're also not going to be holding up faults on other VMAs either (they can access other VMAs despite mmap write lock being held), so I don't think there's ay issue here.
When allocating thread stacks you'll be mapping things into memory which... take the write lock. malloc() if it goes to the kernel will also take the write lock.
But yes, good point, you're already serializing anyway. I don't think this is a big deal.
Indeed
Besides thread stacks, I'm thinking of the userspace allocators usecase
Thread stacks require mmap/VMA write lock to establish no?
(jemalloc etc) where guard regions were supposed to allow a cheap use-after-free protection by replacing their existing MADV_DONTNEED/MADV_FREE use with MADV_GUARD_INSTALL.
First I'm hearing of this as far as I can recall... maybe worth mentioning at the time?
Now MADV_DONTNEED/MADV_FREE use MADVISE_VMA_READ_LOCK, and guard regions moves from (worse but still reasonable) MADVISE_MMAP_READ_LOCK to the heavy
I think you're hand waving a lot here, do you have data to back anything up here?
Be good to get some timings with the different locks and some indication that the contention is meaningful or that this installation is in a hot-path.
Are these users not in any way manipulating VMAs (seems odd for a malloc but anyway)? Because then you're already having to take the write lock to do so.
process_madvise() can be used to hold the lock over the whole operation more efficiently for multiple guard region installations.
MADVISE_MMAP_WRITE_LOCK. I'm afraid this might be too heavy price for that usecase :(
Based on what? Provide some data or argument to back this up please. Since you're essentially asking me to either throw away my series or implement a risky novel means of manipulating VMA flags I think you're going to have to do better than hand waving.
Thanks.