On Wed, Aug 06, 2025 at 02:46:00PM -0700, Suren Baghdasaryan wrote:
On Wed, Aug 6, 2025 at 12:03 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Wed, Aug 06, 2025 at 08:59:04AM -0700, Suren Baghdasaryan wrote:
Utilize per-vma locks to stabilize vma after lookup without taking mmap_lock during PROCMAP_QUERY ioctl execution. If vma lock is contended, we fall back to mmap_lock but take it only momentarily to lock the vma and release the mmap_lock. In a very unlikely case of vm_refcnt overflow, this fall back path will fail and ioctl is done under mmap_lock protection.
This change is designed to reduce mmap_lock contention and prevent PROCMAP_QUERY ioctl calls from blocking address space updates.
Signed-off-by: Suren Baghdasaryan surenb@google.com
A lot of nits but nothing's really standing out as broken, AFAICT...
fs/proc/task_mmu.c | 84 +++++++++++++++++++++++++++++++++++++--------- 1 file changed, 68 insertions(+), 16 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 45134335e086..0396315dfaee 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -517,28 +517,81 @@ static int pid_maps_open(struct inode *inode, struct file *file) PROCMAP_QUERY_VMA_FLAGS \ )
-static int query_vma_setup(struct mm_struct *mm) +#ifdef CONFIG_PER_VMA_LOCK
+static int query_vma_setup(struct proc_maps_locking_ctx *lock_ctx) {
return mmap_read_lock_killable(mm);
lock_ctx->locked_vma = NULL;
lock_ctx->mmap_locked = false;
We also do this in lock_vma_range(), seems sensible to factor out? E.g.:
static void ctx_clear_vma(struct proc_maps_locking_ctx *lock_ctx)
That name really confused me :) Maybe lock_vma_ctx_init() or something along these lines. If we can't think of a good name I would prefer to keep it as is, given it's only two lines and used only in two places.
Yeah you're right, that name isn't great (it's hard to get naming right isn't it? :P)
I think it's worth separating out just because I find this:
helper_struct->field1 = X; helper_struct->field2 = Y;
Open-coding fiddly and prone to error, what if you add a new field later etc.
It's also semantically useful to know that updating one field impliles the update of another.
{ lock_ctx->locked_vma = NULL; lock_ctx->mmap_locked = false; }
return 0;
}
-static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) +static void query_vma_teardown(struct proc_maps_locking_ctx *lock_ctx) {
mmap_read_unlock(mm);
if (lock_ctx->mmap_locked)
mmap_read_unlock(lock_ctx->mm);
Maybe worth a comment as to why we leave lock_ctx->mmap_locked set here?
Sure. The reason is that this is a teardown stage and lock_ctx won't be used anymore. I guess I could reset it just to leave lock_ctx consistent instead of adding a comment. Will do that.
Thanks makes sense.
else
unlock_vma(lock_ctx);
Should have said on 2/3, but I wonder if we should prefix with ctx_, as 'unlock_vma()' and 'lock_vma()' seem like core functions... esp. since we have vma_start_read/write() rather than functions that reference locking.
So - ctx_unlock_vma() and ctx_lock_vma() or unlock_ctx_vma() / lock_ctx_vma()?
unlock_ctx_vma() / lock_ctx_vma() sounds good to me.
+}
+static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_locking_ctx *lock_ctx,
unsigned long addr)
+{
struct vm_area_struct *vma;
struct vma_iterator vmi;
if (lock_ctx->mmap_locked)
return find_vma(lock_ctx->mm, addr);
unlock_vma(lock_ctx);
rcu_read_lock();
vma_iter_init(&vmi, lock_ctx->mm, addr);
vma = lock_next_vma(lock_ctx->mm, &vmi, addr);
rcu_read_unlock();
I think a comment at the top of this block would be useful, something like 'We unlock any previously locked VMA and find the next under RCU'.
Ack.
if (!IS_ERR_OR_NULL(vma)) {
Is the NULL bit here really necessary? presumably lock_ctx->locked_vma is expected to be NULL already, so we're not overwriting anything here.
In fact we could get rid of the horrid if/else here with a guard clause like:
if (!IS_ERR(vma) || PTR_ERR(vma) != -EAGAIN) return vma;
We still need to assign lock_ctx->locked_vma when !IS_ERR(vma) before we return the vma, so the lines about would not be correct. I can change it to:
if (!vma) return NULL;
if (!IS_ERR(vma)) { lock_ctx->locked_vma = vma; return vma; }
if (PTR_ERR(vma) == -EAGAIN) { /* Fallback to mmap_lock on vma->vm_refcnt overflow */ ... } return vma;
I think that would be the equivalent of what I currently have. Would you prefer that?
Yeah sorry, sort of sketching this out quickly here.
Yeah what you suggest looks good thanks!
(the !IS_ERR() bit is probably a bit redundant but makes things clearer vs. just the PTR_ERR() thing)
Then do the rest below.
lock_ctx->locked_vma = vma;
} else if (PTR_ERR(vma) == -EAGAIN) {
/* Fallback to mmap_lock on vma->vm_refcnt overflow */
mmap_read_lock(lock_ctx->mm);
vma = find_vma(lock_ctx->mm, addr);
lock_ctx->mmap_locked = true;
Kinda sucks we have two separate ways of doing fallback now, this open-coded thing and fallback_to_mmap_lock().
Sort of hard to combine since we have subtly diffrent semantics - the RCU read lock is being held in the /proc/$pid/maps case, but here we've released it already.
Yeah, plus that one uses iterators and this one doesn't... I don't think it's worth trying to shoehorn them together given that the code is quite short.
Yeah right.
I sort of wish we could have things be a little more consistent across the two, but I think that would need to be part of a refactoring of this code in general, so is not really relevant here.
So leave it as-is for now that's fine!
}
return vma;
+}
+#else /* CONFIG_PER_VMA_LOCK */
+static int query_vma_setup(struct proc_maps_locking_ctx *lock_ctx) +{
return mmap_read_lock_killable(lock_ctx->mm);
}
-static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) +static void query_vma_teardown(struct proc_maps_locking_ctx *lock_ctx) {
return find_vma(mm, addr);
mmap_read_unlock(lock_ctx->mm);
}
-static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, +static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_locking_ctx *lock_ctx,
unsigned long addr)
+{
return find_vma(lock_ctx->mm, addr);
+}
+#endif /* CONFIG_PER_VMA_LOCK */
+static struct vm_area_struct *query_matching_vma(struct proc_maps_locking_ctx *lock_ctx, unsigned long addr, u32 flags) { struct vm_area_struct *vma;
next_vma:
vma = query_vma_find_by_addr(mm, addr);
vma = query_vma_find_by_addr(lock_ctx, addr);
if (IS_ERR(vma))
return vma;
if (!vma) goto no_vma;
@@ -579,11 +632,11 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, return ERR_PTR(-ENOENT); }
-static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) +static int do_procmap_query(struct mm_struct *mm, void __user *uarg) {
struct proc_maps_locking_ctx lock_ctx = { .mm = mm };
struct procmap_query karg; struct vm_area_struct *vma;
struct mm_struct *mm; const char *name = NULL; char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; __u64 usize;
@@ -610,17 +663,16 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) if (!!karg.build_id_size != !!karg.build_id_addr) return -EINVAL;
mm = priv->lock_ctx.mm; if (!mm || !mmget_not_zero(mm)) return -ESRCH;
err = query_vma_setup(mm);
err = query_vma_setup(&lock_ctx); if (err) { mmput(mm); return err; }
vma = query_matching_vma(mm, karg.query_addr, karg.query_flags);
vma = query_matching_vma(&lock_ctx, karg.query_addr, karg.query_flags); if (IS_ERR(vma)) { err = PTR_ERR(vma); vma = NULL;
@@ -705,7 +757,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) }
/* unlock vma or mmap_lock, and put mm_struct before copying data to user */
query_vma_teardown(mm, vma);
query_vma_teardown(&lock_ctx); mmput(mm); if (karg.vma_name_size && copy_to_user(u64_to_user_ptr(karg.vma_name_addr),
@@ -725,7 +777,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) return 0;
out:
query_vma_teardown(mm, vma);
query_vma_teardown(&lock_ctx); mmput(mm); kfree(name_buf); return err;
@@ -738,7 +790,7 @@ static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned l
switch (cmd) { case PROCMAP_QUERY:
return do_procmap_query(priv, (void __user *)arg);
return do_procmap_query(priv->lock_ctx.mm, (void __user *)arg);
OK this confused me until I worked it through.
We set priv->lock_ctx.mm in:
pid_maps_open() -> do_maps_open() -> proc_maps_open()
Which it gets from proc_mem_open() which figures out the mm.
Maybe one for 2/3, but it'd be nice to have a comment saying something about how this is set, since it being part of lock_ctx makes it seem like it's something that would be set elsewhere.
Since we have fallback stuff and want to thread through this new lokc context type I guess it makes sense to put it here but given that's the case, let's maybe just add a comment here to clarify.
Ok, something like "lock_ctx.mm is set during file open operation" ?
Yeah that's fine thanks!
default: return -ENOIOCTLCMD; }
-- 2.50.1.565.gc32cd1483b-goog
Cheers, Lorenzo