The retry in lock_vma_under_rcu() drops the rcu read lock before reacquiring the lock and trying again. This may cause a use-after-free if the maple node the maple state was using was freed.
The maple state is protected by the rcu read lock. When the lock is dropped, the state cannot be reused as it tracks pointers to objects that may be freed during the time where the lock was not held.
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), the rcu read lock was dropped and NULL was returned, so the retry would not have happened. However, now that the read lock is dropped regardless of the return, we may use a freed maple tree node cached in the maple state on retry.
Cc: Suren Baghdasaryan surenb@google.com Cc: stable@vger.kernel.org Fixes: 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure") Reported-by: syzbot+131f9eb2b5807573275c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=131f9eb2b5807573275c Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com --- mm/mmap_lock.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index 39f341caf32c0..f2532af6208c0 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */ + mas_set(&mas, address); goto retry; }
On 11/11/25 22:56, Liam R. Howlett wrote:
The retry in lock_vma_under_rcu() drops the rcu read lock before reacquiring the lock and trying again. This may cause a use-after-free if the maple node the maple state was using was freed.
The maple state is protected by the rcu read lock. When the lock is dropped, the state cannot be reused as it tracks pointers to objects that may be freed during the time where the lock was not held.
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), the rcu read lock was dropped and NULL was returned, so the retry would not have happened. However, now that the read lock is dropped regardless of the return, we may use a freed maple tree node cached in the maple state on retry.
Cc: Suren Baghdasaryan surenb@google.com Cc: stable@vger.kernel.org Fixes: 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure")
The commit is 6.18-rc1 so we don't need Cc: stable, but it's a mm-hotfixes material that must go to Linus before 6.18.
Reported-by: syzbot+131f9eb2b5807573275c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=131f9eb2b5807573275c Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com
Acked-by: Vlastimil Babka vbabka@suse.cz
mm/mmap_lock.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index 39f341caf32c0..f2532af6208c0 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
}mas_set(&mas, address); goto retry;
On Tue, Nov 11, 2025 at 2:18 PM Vlastimil Babka vbabka@suse.cz wrote:
On 11/11/25 22:56, Liam R. Howlett wrote:
The retry in lock_vma_under_rcu() drops the rcu read lock before reacquiring the lock and trying again. This may cause a use-after-free if the maple node the maple state was using was freed.
Ah, good catch. I didn't realize the state is RCU protected.
The maple state is protected by the rcu read lock. When the lock is dropped, the state cannot be reused as it tracks pointers to objects that may be freed during the time where the lock was not held.
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), the rcu read lock was dropped and NULL was returned, so the retry would not have happened. However, now that the read lock is dropped regardless of the return, we may use a freed maple tree node cached in the maple state on retry.
Hmm. The above paragraph does not sound right to me, unless I completely misunderstood it. Before 0b16f8bed19c we would keep RCU lock up until the end of lock_vma_under_rcu(), so retries could still happen but we were not dropping the RCU lock while doing that. After 0b16f8bed19c we drop RCU lock if vma_start_read() fails, so retrying after such failure becomes unsafe. So, if you agree with me assessment then I suggest changing it to:
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), the retry after vma_start_read() failure was happening under the same RCU lock. However, now that the read lock is dropped on failure, we may use a freed maple tree node cached in the maple state on retry.
Cc: Suren Baghdasaryan surenb@google.com Cc: stable@vger.kernel.org Fixes: 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure")
The commit is 6.18-rc1 so we don't need Cc: stable, but it's a mm-hotfixes material that must go to Linus before 6.18.
Reported-by: syzbot+131f9eb2b5807573275c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=131f9eb2b5807573275c Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com
Acked-by: Vlastimil Babka vbabka@suse.cz
With the changelog text sorted out.
Reviewed-by: Suren Baghdasaryan surenb@google.com
Thanks!
mm/mmap_lock.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index 39f341caf32c0..f2532af6208c0 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
mas_set(&mas, address); goto retry; }
* Suren Baghdasaryan surenb@google.com [251111 19:11]:
On Tue, Nov 11, 2025 at 2:18 PM Vlastimil Babka vbabka@suse.cz wrote:
On 11/11/25 22:56, Liam R. Howlett wrote:
The retry in lock_vma_under_rcu() drops the rcu read lock before reacquiring the lock and trying again. This may cause a use-after-free if the maple node the maple state was using was freed.
Ah, good catch. I didn't realize the state is RCU protected.
The maple state is protected by the rcu read lock. When the lock is dropped, the state cannot be reused as it tracks pointers to objects that may be freed during the time where the lock was not held.
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), the rcu read lock was dropped and NULL was returned, so the retry would not have happened. However, now that the read lock is dropped regardless of the return, we may use a freed maple tree node cached in the maple state on retry.
Hmm. The above paragraph does not sound right to me, unless I completely misunderstood it. Before 0b16f8bed19c we would keep RCU lock up until the end of lock_vma_under_rcu(),
Ah.. usually, yes? But.. if (unlikely(vma->vm_mm != mm)), then we'd drop and reacquire the rcu read lock, but return NULL. This was fine because we wouldn't return -EAGIAN and so the read lock was toggled.. but it didn't matter since we wouldn't reuse the maple state.
I wanted to make it clear that the dropping/reacquiring of the rcu lock prior to 0b16f8bed19c does not mean we have to backport the fix further.. which I failed to do, I guess.
so retries could still happen but we were not dropping the RCU lock while doing that. After 0b16f8bed19c we drop RCU lock if vma_start_read() fails, so retrying after such failure becomes unsafe. So, if you agree with me assessment then I suggest changing it to:
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), the retry after vma_start_read() failure was happening under the same RCU lock. However, now that the read lock is dropped on failure, we may use a freed maple tree node cached in the maple state on retry.
This is also true, but fails to capture the fact that returning NULL after toggling the lock prior to 0b16f8bed19c is okay.
Cc: Suren Baghdasaryan surenb@google.com Cc: stable@vger.kernel.org Fixes: 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure")
The commit is 6.18-rc1 so we don't need Cc: stable, but it's a mm-hotfixes material that must go to Linus before 6.18.
Reported-by: syzbot+131f9eb2b5807573275c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=131f9eb2b5807573275c Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com
Acked-by: Vlastimil Babka vbabka@suse.cz
With the changelog text sorted out.
Reviewed-by: Suren Baghdasaryan surenb@google.com
Thanks!
mm/mmap_lock.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index 39f341caf32c0..f2532af6208c0 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
mas_set(&mas, address); goto retry; }
On Tue, Nov 11, 2025 at 4:20 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
- Suren Baghdasaryan surenb@google.com [251111 19:11]:
On Tue, Nov 11, 2025 at 2:18 PM Vlastimil Babka vbabka@suse.cz wrote:
On 11/11/25 22:56, Liam R. Howlett wrote:
The retry in lock_vma_under_rcu() drops the rcu read lock before reacquiring the lock and trying again. This may cause a use-after-free if the maple node the maple state was using was freed.
Ah, good catch. I didn't realize the state is RCU protected.
The maple state is protected by the rcu read lock. When the lock is dropped, the state cannot be reused as it tracks pointers to objects that may be freed during the time where the lock was not held.
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), the rcu read lock was dropped and NULL was returned, so the retry would not have happened. However, now that the read lock is dropped regardless of the return, we may use a freed maple tree node cached in the maple state on retry.
Hmm. The above paragraph does not sound right to me, unless I completely misunderstood it. Before 0b16f8bed19c we would keep RCU lock up until the end of lock_vma_under_rcu(),
Ah.. usually, yes? But.. if (unlikely(vma->vm_mm != mm)), then we'd drop and reacquire the rcu read lock, but return NULL. This was fine because we wouldn't return -EAGIAN and so the read lock was toggled.. but it didn't matter since we wouldn't reuse the maple state.
I wanted to make it clear that the dropping/reacquiring of the rcu lock prior to 0b16f8bed19c does not mean we have to backport the fix further.. which I failed to do, I guess.
Ah, ok, now I get it. You were talking about vma_start_read() and RCU being dropped there... Would this explanation be a bit better?
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), vma_start_read() would drop rcu read lock and return NULL, so the retry would not have happened. However, now that vma_start_read() drops rcu read lock on failure followed by a retry, we may end up using a freed maple tree node cached in the maple state.
so retries could still happen but we were not dropping the RCU lock while doing that. After 0b16f8bed19c we drop RCU lock if vma_start_read() fails, so retrying after such failure becomes unsafe. So, if you agree with me assessment then I suggest changing it to:
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), the retry after vma_start_read() failure was happening under the same RCU lock. However, now that the read lock is dropped on failure, we may use a freed maple tree node cached in the maple state on retry.
This is also true, but fails to capture the fact that returning NULL after toggling the lock prior to 0b16f8bed19c is okay.
Cc: Suren Baghdasaryan surenb@google.com Cc: stable@vger.kernel.org Fixes: 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure")
The commit is 6.18-rc1 so we don't need Cc: stable, but it's a mm-hotfixes material that must go to Linus before 6.18.
Reported-by: syzbot+131f9eb2b5807573275c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=131f9eb2b5807573275c Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com
Acked-by: Vlastimil Babka vbabka@suse.cz
With the changelog text sorted out.
Reviewed-by: Suren Baghdasaryan surenb@google.com
Thanks!
mm/mmap_lock.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index 39f341caf32c0..f2532af6208c0 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
mas_set(&mas, address); goto retry; }
* Suren Baghdasaryan surenb@google.com [251111 19:45]:
On Tue, Nov 11, 2025 at 4:20 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
- Suren Baghdasaryan surenb@google.com [251111 19:11]:
On Tue, Nov 11, 2025 at 2:18 PM Vlastimil Babka vbabka@suse.cz wrote:
On 11/11/25 22:56, Liam R. Howlett wrote:
The retry in lock_vma_under_rcu() drops the rcu read lock before reacquiring the lock and trying again. This may cause a use-after-free if the maple node the maple state was using was freed.
Ah, good catch. I didn't realize the state is RCU protected.
The maple state is protected by the rcu read lock. When the lock is dropped, the state cannot be reused as it tracks pointers to objects that may be freed during the time where the lock was not held.
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), the rcu read lock was dropped and NULL was returned, so the retry would not have happened. However, now that the read lock is dropped regardless of the return, we may use a freed maple tree node cached in the maple state on retry.
Hmm. The above paragraph does not sound right to me, unless I completely misunderstood it. Before 0b16f8bed19c we would keep RCU lock up until the end of lock_vma_under_rcu(),
Ah.. usually, yes? But.. if (unlikely(vma->vm_mm != mm)), then we'd drop and reacquire the rcu read lock, but return NULL. This was fine because we wouldn't return -EAGIAN and so the read lock was toggled.. but it didn't matter since we wouldn't reuse the maple state.
I wanted to make it clear that the dropping/reacquiring of the rcu lock prior to 0b16f8bed19c does not mean we have to backport the fix further.. which I failed to do, I guess.
Ah, ok, now I get it. You were talking about vma_start_read() and RCU being dropped there... Would this explanation be a bit better?
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), vma_start_read() would drop rcu read lock and return NULL, so the retry would not have happened. However, now that vma_start_read() drops rcu read lock on failure followed by a retry, we may end up using a freed maple tree node cached in the maple state.
Yes, sounds good.
Andrew, can you make this change and also drop Cc stable tag?
This needs to be a hot fix, as Vlastimil said earlier.
so retries could still happen but we were not dropping the RCU lock while doing that. After 0b16f8bed19c we drop RCU lock if vma_start_read() fails, so retrying after such failure becomes unsafe. So, if you agree with me assessment then I suggest changing it to:
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), the retry after vma_start_read() failure was happening under the same RCU lock. However, now that the read lock is dropped on failure, we may use a freed maple tree node cached in the maple state on retry.
This is also true, but fails to capture the fact that returning NULL after toggling the lock prior to 0b16f8bed19c is okay.
Cc: Suren Baghdasaryan surenb@google.com Cc: stable@vger.kernel.org Fixes: 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure")
The commit is 6.18-rc1 so we don't need Cc: stable, but it's a mm-hotfixes material that must go to Linus before 6.18.
Reported-by: syzbot+131f9eb2b5807573275c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=131f9eb2b5807573275c Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com
Acked-by: Vlastimil Babka vbabka@suse.cz
With the changelog text sorted out.
Reviewed-by: Suren Baghdasaryan surenb@google.com
Thanks!
mm/mmap_lock.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index 39f341caf32c0..f2532af6208c0 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
mas_set(&mas, address); goto retry; }
On Tue, 11 Nov 2025 21:18:19 -0500 "Liam R. Howlett" Liam.Howlett@oracle.com wrote:
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), vma_start_read() would drop rcu read lock and return NULL, so the retry would not have happened. However, now that vma_start_read() drops rcu read lock on failure followed by a retry, we may end up using a freed maple tree node cached in the maple state.
Yes, sounds good.
Andrew, can you make this change and also drop Cc stable tag?
Done.
This needs to be a hot fix, as Vlastimil said earlier.
Yup.
+cc Paul for hare-brained idea
On Tue, Nov 11, 2025 at 04:56:05PM -0500, Liam R. Howlett wrote:
The retry in lock_vma_under_rcu() drops the rcu read lock before reacquiring the lock and trying again. This may cause a use-after-free if the maple node the maple state was using was freed.
The maple state is protected by the rcu read lock. When the lock is dropped, the state cannot be reused as it tracks pointers to objects that may be freed during the time where the lock was not held.
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Since we all missed it I do wonder if we need some super clear comment saying 'hey if you drop + re-acquire RCU lock you MUST revalidate mas state by doing 'blah'.
I think one source of confusion for me with maple tree operations is - what to do if we are in a position where some kind of reset is needed?
So even if I'd realised 'aha we need to reset this' it wouldn't be obvious to me that we ought to set to the address.
I guess a mas_reset() would keep mas->index, last where they where which also wouldn't be right would it?
In which case a mas_reset() is _also_ not a valid operation if invoked after dropping/reacquiring the RCU lock right?
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), the rcu read lock was dropped and NULL was returned, so the retry would not have happened. However, now that the read lock is dropped regardless of the return, we may use a freed maple tree node cached in the maple state on retry.
Cc: Suren Baghdasaryan surenb@google.com Cc: stable@vger.kernel.org Fixes: 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure") Reported-by: syzbot+131f9eb2b5807573275c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=131f9eb2b5807573275c Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com
The reasoning seems sensible & LGTM, so:
Reviewed-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
mm/mmap_lock.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index 39f341caf32c0..f2532af6208c0 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
mas_set(&mas, address);
I wonder if we could detect that the RCU lock was released (+ reacquired) in mas_walk() in a debug mode, like CONFIG_VM_DEBUG_MAPLE_TREE?
Not sure if that's feasible, maybe Paul can comment? :)
I think Vlastimil made a similar kind of comment possibly off-list.
Would there be much overhead if we just did this:
retry: rcu_read_lock(); mas_set(&mas, address); vma = mas_walk(&mas);
The retry path will be super rare, and I think the compiler should be smart enough to not assign index, last twice and this would protect us.
Then we could have some function like:
mas_walk_from(&mas, address);
That did this.
Or, since we _literally_ only use mas for this one walk, have a simple function like:
/** * ... * Performs a single walk of a maple tree to the specified address, * returning a pointer to an entry if found, or NULL if not. * ... */ static void *mt_walk(struct maple_tree *mt, unsigned long address) { MA_STATE(mas, mt, address, adderss);
lockdep_assert_in_rcu_read_lock(); return mas_walk(&mas); }
That literally just does the walk as a one-off?
goto retry; }-- 2.47.2
Cheers, Lorenzo
* Lorenzo Stoakes lorenzo.stoakes@oracle.com [251112 10:06]:
+cc Paul for hare-brained idea
On Tue, Nov 11, 2025 at 04:56:05PM -0500, Liam R. Howlett wrote:
The retry in lock_vma_under_rcu() drops the rcu read lock before reacquiring the lock and trying again. This may cause a use-after-free if the maple node the maple state was using was freed.
The maple state is protected by the rcu read lock. When the lock is dropped, the state cannot be reused as it tracks pointers to objects that may be freed during the time where the lock was not held.
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Since we all missed it I do wonder if we need some super clear comment saying 'hey if you drop + re-acquire RCU lock you MUST revalidate mas state by doing 'blah'.
I think one source of confusion for me with maple tree operations is - what to do if we are in a position where some kind of reset is needed?
So even if I'd realised 'aha we need to reset this' it wouldn't be obvious to me that we ought to set to the address.
I guess a mas_reset() would keep mas->index, last where they where which also wouldn't be right would it?
mas->index and mas->last are updated to the values of the entry you found. So if you ran a mas_find(), the operation will continue until the limit is hit, or if you did a next/prev the address would be lost.
This is why I say mas_set() is safer, because it will ensure we return to the same situation we started from, regardless of the operation.
In which case a mas_reset() is _also_ not a valid operation if invoked after dropping/reacquiring the RCU lock right?
In this case we did a mas_walk(), so a mas_reset() would be fine here.
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), the rcu read lock was dropped and NULL was returned, so the retry would not have happened. However, now that the read lock is dropped regardless of the return, we may use a freed maple tree node cached in the maple state on retry.
Cc: Suren Baghdasaryan surenb@google.com Cc: stable@vger.kernel.org Fixes: 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure") Reported-by: syzbot+131f9eb2b5807573275c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=131f9eb2b5807573275c Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com
The reasoning seems sensible & LGTM, so:
Reviewed-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
mm/mmap_lock.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index 39f341caf32c0..f2532af6208c0 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
mas_set(&mas, address);I wonder if we could detect that the RCU lock was released (+ reacquired) in mas_walk() in a debug mode, like CONFIG_VM_DEBUG_MAPLE_TREE?
Not sure if that's feasible, maybe Paul can comment? :)
I think Vlastimil made a similar kind of comment possibly off-list.
Would there be much overhead if we just did this:
retry: rcu_read_lock(); mas_set(&mas, address); vma = mas_walk(&mas);
The retry path will be super rare, and I think the compiler should be smart enough to not assign index, last twice and this would protect us.
This is what existed before the 0b16f8bed19c change, which was introduced to try and avoid exactly these issues.
I think there's no real way to avoid the complications of an rcu data structure. We've tried to make the interface as simple as possible, and in doing so, have hidden the implementation details of what happens in the 'state' - which is where all these troubles arise.
I can add more documentation around the locking and maple state, hopefully people will find it useful and not just exist to go out of date.
Then we could have some function like:
mas_walk_from(&mas, address);
That did this.
Or, since we _literally_ only use mas for this one walk, have a simple function like:
/** * ... * Performs a single walk of a maple tree to the specified address, * returning a pointer to an entry if found, or NULL if not. * ... */ static void *mt_walk(struct maple_tree *mt, unsigned long address) { MA_STATE(mas, mt, address, adderss);
lockdep_assert_in_rcu_read_lock(); return mas_walk(&mas);}
That literally just does the walk as a one-off?
You have described mtree_load(). The mas_ interfaces are designed for people to handle the locks themselves. The mtree_ interface handles the locking for people.
I don't think we are using the simple interface because we are using the rcu read lock for the vma as well.
If you want to use the simple mtree_load() interface here, I think we'll need two rcu_read_lock()/unlock() calls (nesting is supported fwiu). I don't think we want to nest the locks in this call path though.
...
Thanks, Liam
On Wed, Nov 12, 2025 at 11:10:30AM -0500, Liam R. Howlett wrote:
- Lorenzo Stoakes lorenzo.stoakes@oracle.com [251112 10:06]:
+cc Paul for hare-brained idea
On Tue, Nov 11, 2025 at 04:56:05PM -0500, Liam R. Howlett wrote:
The retry in lock_vma_under_rcu() drops the rcu read lock before reacquiring the lock and trying again. This may cause a use-after-free if the maple node the maple state was using was freed.
The maple state is protected by the rcu read lock. When the lock is dropped, the state cannot be reused as it tracks pointers to objects that may be freed during the time where the lock was not held.
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Since we all missed it I do wonder if we need some super clear comment saying 'hey if you drop + re-acquire RCU lock you MUST revalidate mas state by doing 'blah'.
I think one source of confusion for me with maple tree operations is - what to do if we are in a position where some kind of reset is needed?
So even if I'd realised 'aha we need to reset this' it wouldn't be obvious to me that we ought to set to the address.
I guess a mas_reset() would keep mas->index, last where they where which also wouldn't be right would it?
mas->index and mas->last are updated to the values of the entry you found. So if you ran a mas_find(), the operation will continue until the limit is hit, or if you did a next/prev the address would be lost.
I guess in _this_ specific case since we're specifically looking for a VMA at the address encoded in the index we'd be ok?
This is why I say mas_set() is safer, because it will ensure we return to the same situation we started from, regardless of the operation.
Right yes.
In which case a mas_reset() is _also_ not a valid operation if invoked after dropping/reacquiring the RCU lock right?
In this case we did a mas_walk(), so a mas_reset() would be fine here.
Right yeah.
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), the rcu read lock was dropped and NULL was returned, so the retry would not have happened. However, now that the read lock is dropped regardless of the return, we may use a freed maple tree node cached in the maple state on retry.
Cc: Suren Baghdasaryan surenb@google.com Cc: stable@vger.kernel.org Fixes: 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure") Reported-by: syzbot+131f9eb2b5807573275c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=131f9eb2b5807573275c Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com
The reasoning seems sensible & LGTM, so:
Reviewed-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
mm/mmap_lock.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index 39f341caf32c0..f2532af6208c0 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
mas_set(&mas, address);I wonder if we could detect that the RCU lock was released (+ reacquired) in mas_walk() in a debug mode, like CONFIG_VM_DEBUG_MAPLE_TREE?
Not sure if that's feasible, maybe Paul can comment? :)
I think Vlastimil made a similar kind of comment possibly off-list.
Would there be much overhead if we just did this:
retry: rcu_read_lock(); mas_set(&mas, address); vma = mas_walk(&mas);
The retry path will be super rare, and I think the compiler should be smart enough to not assign index, last twice and this would protect us.
This is what existed before the 0b16f8bed19c change, which was introduced to try and avoid exactly these issues.
I think there's no real way to avoid the complications of an rcu data structure. We've tried to make the interface as simple as possible, and in doing so, have hidden the implementation details of what happens in the 'state' - which is where all these troubles arise.
I can add more documentation around the locking and maple state, hopefully people will find it useful and not just exist to go out of date.
Discussed more in reply to Matthew.
Yeah I agree that we're doing a trade-off here between abstraction and performance overall.
Then we could have some function like:
mas_walk_from(&mas, address);
That did this.
Or, since we _literally_ only use mas for this one walk, have a simple function like:
/** * ... * Performs a single walk of a maple tree to the specified address, * returning a pointer to an entry if found, or NULL if not. * ... */ static void *mt_walk(struct maple_tree *mt, unsigned long address) { MA_STATE(mas, mt, address, adderss);
lockdep_assert_in_rcu_read_lock(); return mas_walk(&mas);}
That literally just does the walk as a one-off?
You have described mtree_load(). The mas_ interfaces are designed for people to handle the locks themselves. The mtree_ interface handles the locking for people.
I don't think we are using the simple interface because we are using the rcu read lock for the vma as well.
Yup I don't think we can use this as-is.
What I'm proposing Ig uess would be __mtree_load() or something with the locking taken out.
I have suggested an alternative approach in reply to Matthew anyway.
If you want to use the simple mtree_load() interface here, I think we'll need two rcu_read_lock()/unlock() calls (nesting is supported fwiu). I don't think we want to nest the locks in this call path though.
And we don't want to just unconditionally release RCU read lock after we're done either so I think that's out.
...
Thanks, Liam
Cheers, Lorenzo
On Wed, Nov 12, 2025 at 03:06:38PM +0000, Lorenzo Stoakes wrote:
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Since we all missed it I do wonder if we need some super clear comment saying 'hey if you drop + re-acquire RCU lock you MUST revalidate mas state by doing 'blah'.
I mean, this really isn't an RCU thing. This is also bad:
spin_lock(a); p = *q; spin_unlock(a); spin_lock(a); b = *p;
p could have been freed while you didn't hold lock a. Detecting this kind of thing needs compiler assistence (ie Rust) to let you know that you don't have the right to do that any more.
I think one source of confusion for me with maple tree operations is - what to do if we are in a position where some kind of reset is needed?
So even if I'd realised 'aha we need to reset this' it wouldn't be obvious to me that we ought to set to the address.
I think that's a separate problem.
+++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
mas_set(&mas, address);I wonder if we could detect that the RCU lock was released (+ reacquired) in mas_walk() in a debug mode, like CONFIG_VM_DEBUG_MAPLE_TREE?
Dropping and reacquiring the RCU read lock should have been a big red flag. I didn't have time to review the patches, but if I had, I would have suggested passing the mas down to the routine that drops the rcu read lock so it can be invalidated before dropping the readlock.
On Thu, Nov 13, 2025 at 12:04:19AM +0000, Matthew Wilcox wrote:
On Wed, Nov 12, 2025 at 03:06:38PM +0000, Lorenzo Stoakes wrote:
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Since we all missed it I do wonder if we need some super clear comment saying 'hey if you drop + re-acquire RCU lock you MUST revalidate mas state by doing 'blah'.
I mean, this really isn't an RCU thing. This is also bad:
spin_lock(a); p = *q; spin_unlock(a); spin_lock(a); b = *p;
p could have been freed while you didn't hold lock a. Detecting this kind of thing needs compiler assistence (ie Rust) to let you know that you don't have the right to do that any more.
While in no way denigrating Rust's compile-time detection of this sort of thing, use of KASAN combined with CONFIG_RCU_STRICT_GRACE_PERIOD=y (which restricts you to four CPUs) can sometimes help.
I think one source of confusion for me with maple tree operations is - what to do if we are in a position where some kind of reset is needed?
So even if I'd realised 'aha we need to reset this' it wouldn't be obvious to me that we ought to set to the address.
I think that's a separate problem.
+++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
mas_set(&mas, address);I wonder if we could detect that the RCU lock was released (+ reacquired) in mas_walk() in a debug mode, like CONFIG_VM_DEBUG_MAPLE_TREE?
Dropping and reacquiring the RCU read lock should have been a big red flag. I didn't have time to review the patches, but if I had, I would have suggested passing the mas down to the routine that drops the rcu read lock so it can be invalidated before dropping the readlock.
There has been some academic efforts to check for RCU-protected pointers leaking from one RCU read-side critical section to another, but nothing useful has come from this. :-/
But rcu_pointer_handoff() and unrcu_pointer() are intended not only for documentation, but also to suppress the inevitable false positives should anyone figure out how to detect leaking of RCU-protected pointers.
Thanx, Paul
On Wed, Nov 12, 2025 at 05:27:22PM -0800, Paul E. McKenney wrote:
On Thu, Nov 13, 2025 at 12:04:19AM +0000, Matthew Wilcox wrote:
On Wed, Nov 12, 2025 at 03:06:38PM +0000, Lorenzo Stoakes wrote:
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Since we all missed it I do wonder if we need some super clear comment saying 'hey if you drop + re-acquire RCU lock you MUST revalidate mas state by doing 'blah'.
I mean, this really isn't an RCU thing. This is also bad:
spin_lock(a); p = *q; spin_unlock(a); spin_lock(a); b = *p;
p could have been freed while you didn't hold lock a. Detecting this kind of thing needs compiler assistence (ie Rust) to let you know that you don't have the right to do that any more.
While in no way denigrating Rust's compile-time detection of this sort of thing, use of KASAN combined with CONFIG_RCU_STRICT_GRACE_PERIOD=y (which restricts you to four CPUs) can sometimes help.
I think one source of confusion for me with maple tree operations is - what to do if we are in a position where some kind of reset is needed?
So even if I'd realised 'aha we need to reset this' it wouldn't be obvious to me that we ought to set to the address.
I think that's a separate problem.
+++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
mas_set(&mas, address);I wonder if we could detect that the RCU lock was released (+ reacquired) in mas_walk() in a debug mode, like CONFIG_VM_DEBUG_MAPLE_TREE?
Dropping and reacquiring the RCU read lock should have been a big red flag. I didn't have time to review the patches, but if I had, I would have suggested passing the mas down to the routine that drops the rcu read lock so it can be invalidated before dropping the readlock.
There has been some academic efforts to check for RCU-protected pointers leaking from one RCU read-side critical section to another, but nothing useful has come from this. :-/
Ugh a pity. I was hoping we could do (in debug mode only obv) something absolutely roughly like:
On init:
mas->rcu_critical_section = rcu_get_critical_section_blah();
...
On walk:
VM_WARN_ON(rcu_critical_section_blah() != mas->rcu_critical_section);
But sounds like that isn't feasible.
I always like the idea of us having debug stuff that helps highlight dumb mistakes very quickly, no matter how silly they might be :)
But rcu_pointer_handoff() and unrcu_pointer() are intended not only for documentation, but also to suppress the inevitable false positives should anyone figure out how to detect leaking of RCU-protected pointers.
Thanx, Paul
Cheers, Lorenzo
On Thu, Nov 13, 2025 at 12:04:19AM +0000, Matthew Wilcox wrote:
On Wed, Nov 12, 2025 at 03:06:38PM +0000, Lorenzo Stoakes wrote:
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Since we all missed it I do wonder if we need some super clear comment saying 'hey if you drop + re-acquire RCU lock you MUST revalidate mas state by doing 'blah'.
I mean, this really isn't an RCU thing. This is also bad:
spin_lock(a); p = *q; spin_unlock(a); spin_lock(a); b = *p;
p could have been freed while you didn't hold lock a. Detecting this kind of thing needs compiler assistence (ie Rust) to let you know that you don't have the right to do that any more.
Right but in your example the use of the pointers is _realy clear_. In the mas situation, the pointers are embedded in the helper struct, there's a state machine, etc. so it's harder to catch this.
There's already a state machine embedded in it, and I think the confusing bit, at least for me, was a line of thinking like - 'oh there's all this logic that figures out what's going on and if there's an error rewalks and etc. - so it'll handle this case too'.
Obviously, very much wrong.
Generally I wonder if, when dealing with VMAs, we shouldn't just use the VMA iterator anyway? Whenever I see 'naked' mas stuff I'm always a little confused as to why.
I think one source of confusion for me with maple tree operations is - what to do if we are in a position where some kind of reset is needed?
So even if I'd realised 'aha we need to reset this' it wouldn't be obvious to me that we ought to set to the address.
I think that's a separate problem.
Sure but I think there's a broader issue around confusion arising around mas state and when we need to do one thing or another, there were a number of issues that arose in the past where people got confused about what to do with vma iterator state.
I think it's a difficult problem - we're both trying to abstract stuff here but also retain performance, which is a trade-off.
+++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
mas_set(&mas, address);I wonder if we could detect that the RCU lock was released (+ reacquired) in mas_walk() in a debug mode, like CONFIG_VM_DEBUG_MAPLE_TREE?
Dropping and reacquiring the RCU read lock should have been a big red flag. I didn't have time to review the patches, but if I had, I would
I think if you have 3 mm developers who all work with VMAs all the time missing this, that's a signal that something is confusing here :)
So the issue is we all thought dropping the RCU lock would be OK, and mas_walk(...) would 'somehow' do the right thing. See above for why I think perhaps that happened.
have suggested passing the mas down to the routine that drops the rcu read lock so it can be invalidated before dropping the readlock.
This would require changing vma_start_read(), which is called by both lock_vma_under_rcu() and lock_next_vma().
We could make them consistent and have lock_vma_under_rcu() do something like:
VMA_ITERATOR(vmi, mm, address);
...
rcu_read_lock(); vma = vma_start_read(&vmi);
And have vma_start_read() handle the:
if (!vma) { rcu_read_unlock(); goto inval; }
Case we have in lock_vma_under_rcu() now.
We'd need to keep:
vma = vma_next(vmi); if (!vma) return NULL;
In lock_next_vma().
Then you could have:
err: /* Reset so state is valid if reused. */ vmi_iter_reset(vmi); rcu_read_unlock();
In vma_start_read().
Assuming any/all of this is correct :)
I _think_ based on what Liam said in other sub-thread the reset should work here (perhaps not quite maximally efficient).
If we risk perhaps relying on the optimiser to help us or hope no real perf impact perhaps we could do both by also having the 'set address' bit happen in lock_vma_under_rcu() also e.g.:
VMA_ITERATOR(vmi, mm, address);
...
retry: rcu_read_lock(); vma_iter_set(&vmi, address); vma = vma_start_read(&vmi);
Let me know if any of this is sane... :)
Cheers, Lorenzo
* Lorenzo Stoakes lorenzo.stoakes@oracle.com [251113 05:45]:
On Thu, Nov 13, 2025 at 12:04:19AM +0000, Matthew Wilcox wrote:
On Wed, Nov 12, 2025 at 03:06:38PM +0000, Lorenzo Stoakes wrote:
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Since we all missed it I do wonder if we need some super clear comment saying 'hey if you drop + re-acquire RCU lock you MUST revalidate mas state by doing 'blah'.
I mean, this really isn't an RCU thing. This is also bad:
spin_lock(a); p = *q; spin_unlock(a); spin_lock(a); b = *p;
p could have been freed while you didn't hold lock a. Detecting this kind of thing needs compiler assistence (ie Rust) to let you know that you don't have the right to do that any more.
Right but in your example the use of the pointers is _realy clear_. In the mas situation, the pointers are embedded in the helper struct, there's a state machine, etc. so it's harder to catch this.
We could modify the above example to use a helper struct and the same problem would arise...
There's already a state machine embedded in it, and I think the confusing bit, at least for me, was a line of thinking like - 'oh there's all this logic that figures out what's going on and if there's an error rewalks and etc. - so it'll handle this case too'.
Obviously, very much wrong.
Generally I wonder if, when dealing with VMAs, we shouldn't just use the VMA iterator anyway? Whenever I see 'naked' mas stuff I'm always a little confused as to why.
I am not sure why this was left as maple state either. But translating it to the vma iterator would result in the same bug. The locking story would be the same. There isn't much to the vma iterator, it will just call the mas_ functions for you.
In other code, the maple state is used when we need to do special operations that would be the single user of a vma iterator function. I suspect this was the case here at some point.
I think one source of confusion for me with maple tree operations is - what to do if we are in a position where some kind of reset is needed?
So even if I'd realised 'aha we need to reset this' it wouldn't be obvious to me that we ought to set to the address.
I think that's a separate problem.
Sure but I think there's a broader issue around confusion arising around mas state and when we need to do one thing or another, there were a number of issues that arose in the past where people got confused about what to do with vma iterator state.
I think it's a difficult problem - we're both trying to abstract stuff here but also retain performance, which is a trade-off.
+++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
mas_set(&mas, address);I wonder if we could detect that the RCU lock was released (+ reacquired) in mas_walk() in a debug mode, like CONFIG_VM_DEBUG_MAPLE_TREE?
Dropping and reacquiring the RCU read lock should have been a big red flag. I didn't have time to review the patches, but if I had, I would
I think if you have 3 mm developers who all work with VMAs all the time missing this, that's a signal that something is confusing here :)
So the issue is we all thought dropping the RCU lock would be OK, and mas_walk(...) would 'somehow' do the right thing. See above for why I think perhaps that happened.
But again, I feel like we could replace the maple state with any helper struct and this could also be missed.
I'm not sure there's an easy way to remove this class of errors without changing the basic tooling to be rust or the like...
vma_start_read() is inherently complicated because of what it does without taking the mmap lock. Dealing with a potential failure/retry is equally messy.
The locking is impossible to do in a clean way since one caller does not take the rcu read lock itself, but may return without it held in many scenarios.
have suggested passing the mas down to the routine that drops the rcu read lock so it can be invalidated before dropping the readlock.
This would require changing vma_start_read(), which is called by both lock_vma_under_rcu() and lock_next_vma().
We could make them consistent and have lock_vma_under_rcu() do something like:
VMA_ITERATOR(vmi, mm, address);
...
rcu_read_lock(); vma = vma_start_read(&vmi);
And have vma_start_read() handle the:
if (!vma) { rcu_read_unlock(); goto inval; }
Case we have in lock_vma_under_rcu() now.
We'd need to keep:
vma = vma_next(vmi); if (!vma) return NULL;
In lock_next_vma().
Then you could have:
err: /* Reset so state is valid if reused. */ vmi_iter_reset(vmi); rcu_read_unlock();
In vma_start_read().
Assuming any/all of this is correct :)
I _think_ based on what Liam said in other sub-thread the reset should work here (perhaps not quite maximally efficient).
No, don't do that. If you want to go this route, use vma_iter_set() in the error label to set the address. Which means that we'll need to pass the vma iterator and the address into vma_star_read() from both callers.
And may as well add this in vma_start_read()..
err_unstable: vma_iter_set(&vmi, address);
If we risk perhaps relying on the optimiser to help us or hope no real perf impact perhaps we could do both by also having the 'set address' bit happen in lock_vma_under_rcu() also e.g.:
VMA_ITERATOR(vmi, mm, address);
...
retry: rcu_read_lock(); vma_iter_set(&vmi, address); vma = vma_start_read(&vmi);
lock_next_vma() also calls vma_iter_set() in the -EAGAIN case, so passing both through might make more sense.
Let me know if any of this is sane... :)
The locking on this function makes it virtually impossible to reuse for anything beyond the two users it has today. Passing the iterator down might remind people of what to do if the function itself changes. It does seem like the right way of handling this, since we can't clean up the locking.
Thanks, Liam
On Thu, Nov 13, 2025 at 12:28:58PM -0500, Liam R. Howlett wrote:
- Lorenzo Stoakes lorenzo.stoakes@oracle.com [251113 05:45]:
On Thu, Nov 13, 2025 at 12:04:19AM +0000, Matthew Wilcox wrote:
On Wed, Nov 12, 2025 at 03:06:38PM +0000, Lorenzo Stoakes wrote:
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Since we all missed it I do wonder if we need some super clear comment saying 'hey if you drop + re-acquire RCU lock you MUST revalidate mas state by doing 'blah'.
I mean, this really isn't an RCU thing. This is also bad:
spin_lock(a); p = *q; spin_unlock(a); spin_lock(a); b = *p;
p could have been freed while you didn't hold lock a. Detecting this kind of thing needs compiler assistence (ie Rust) to let you know that you don't have the right to do that any more.
Right but in your example the use of the pointers is _realy clear_. In the mas situation, the pointers are embedded in the helper struct, there's a state machine, etc. so it's harder to catch this.
We could modify the above example to use a helper struct and the same problem would arise...
I disagree.
It's a helper struct with a state machine, manipulated by API functions. In fact it turns out we _can_ recover this state even after dropping/reacquiring the lock by calling the appropriate API functions to do so.
You manipulate this state via mas_xxx() commands, and in fact we resolve the lock issue by issuing the correct one.
So it's a problem of abstraction I think.
HOWEVER, clearly the crux of the problem as you say elsewhere is that we are using the 'advanced' API and handling our own lock, which leaves us open to mistakes like this.
My thought process here is around 'can we avoid a bunch of mm developers all making the same mistake again'.
In this case I mean - it's a unique situation, in some already _very_ hairy VMA lock code, that used to be much simpler (*grumble grumble*). We're paying the price for rolling our own mechanism here in general.
But I think more broadly, perhaps there's things we can do here to help. You need to be able to go on vacation without having to worry about what mistakes we might make with this stuff :P
There's already a state machine embedded in it, and I think the confusing bit, at least for me, was a line of thinking like - 'oh there's all this logic that figures out what's going on and if there's an error rewalks and etc. - so it'll handle this case too'.
Obviously, very much wrong.
Generally I wonder if, when dealing with VMAs, we shouldn't just use the VMA iterator anyway? Whenever I see 'naked' mas stuff I'm always a little confused as to why.
I am not sure why this was left as maple state either. But translating it to the vma iterator would result in the same bug. The locking story would be the same. There isn't much to the vma iterator, it will just call the mas_ functions for you.
Yes I understand it wouldn't fix the bug :) I'm saying this as an aside, and it leads into the suggestion I make below.
In other code, the maple state is used when we need to do special operations that would be the single user of a vma iterator function. I suspect this was the case here at some point.
Right yes. And perhaps so.
I think one source of confusion for me with maple tree operations is - what to do if we are in a position where some kind of reset is needed?
So even if I'd realised 'aha we need to reset this' it wouldn't be obvious to me that we ought to set to the address.
I think that's a separate problem.
Sure but I think there's a broader issue around confusion arising around mas state and when we need to do one thing or another, there were a number of issues that arose in the past where people got confused about what to do with vma iterator state.
I think it's a difficult problem - we're both trying to abstract stuff here but also retain performance, which is a trade-off.
+++ b/mm/mmap_lock.c @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (PTR_ERR(vma) == -EAGAIN) { count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */
mas_set(&mas, address);I wonder if we could detect that the RCU lock was released (+ reacquired) in mas_walk() in a debug mode, like CONFIG_VM_DEBUG_MAPLE_TREE?
Dropping and reacquiring the RCU read lock should have been a big red flag. I didn't have time to review the patches, but if I had, I would
I think if you have 3 mm developers who all work with VMAs all the time missing this, that's a signal that something is confusing here :)
So the issue is we all thought dropping the RCU lock would be OK, and mas_walk(...) would 'somehow' do the right thing. See above for why I think perhaps that happened.
But again, I feel like we could replace the maple state with any helper struct and this could also be missed.
I disagree for the reasons stated above.
I'm not sure there's an easy way to remove this class of errors without changing the basic tooling to be rust or the like...
Well I like to be optimistic that we can find ways forward without that.
vma_start_read() is inherently complicated because of what it does without taking the mmap lock. Dealing with a potential failure/retry is equally messy.
Yes I agree.
The locking is impossible to do in a clean way since one caller does not take the rcu read lock itself, but may return without it held in many scenarios.
Yes absolutely. I am not necessarily in love with how complicated we've made all of this and I am not sure it was justified, but unfortunately I didn't pay enough attention to the VMA lock seqcount rework.
have suggested passing the mas down to the routine that drops the rcu read lock so it can be invalidated before dropping the readlock.
This would require changing vma_start_read(), which is called by both lock_vma_under_rcu() and lock_next_vma().
We could make them consistent and have lock_vma_under_rcu() do something like:
VMA_ITERATOR(vmi, mm, address);
...
rcu_read_lock(); vma = vma_start_read(&vmi);
And have vma_start_read() handle the:
if (!vma) { rcu_read_unlock(); goto inval; }
Case we have in lock_vma_under_rcu() now.
We'd need to keep:
vma = vma_next(vmi); if (!vma) return NULL;
In lock_next_vma().
Then you could have:
err: /* Reset so state is valid if reused. */ vmi_iter_reset(vmi); rcu_read_unlock();
In vma_start_read().
Assuming any/all of this is correct :)
I _think_ based on what Liam said in other sub-thread the reset should work here (perhaps not quite maximally efficient).
No, don't do that. If you want to go this route, use vma_iter_set() in the error label to set the address. Which means that we'll need to pass the vma iterator and the address into vma_star_read() from both callers.
Well that's what I'm proposing we do re: passing in the vma iterator, so it seems we're generally aligned on this, but sure we should use vma_iter_set(), ack on that.
And may as well add this in vma_start_read()..
err_unstable: vma_iter_set(&vmi, address);
Ack.
If we risk perhaps relying on the optimiser to help us or hope no real perf impact perhaps we could do both by also having the 'set address' bit happen in lock_vma_under_rcu() also e.g.:
VMA_ITERATOR(vmi, mm, address);
...
retry: rcu_read_lock(); vma_iter_set(&vmi, address); vma = vma_start_read(&vmi);
lock_next_vma() also calls vma_iter_set() in the -EAGAIN case, so passing both through might make more sense.
Yes.
Let me know if any of this is sane... :)
The locking on this function makes it virtually impossible to reuse for anything beyond the two users it has today. Passing the iterator down might remind people of what to do if the function itself changes. It does seem like the right way of handling this, since we can't clean up the locking.
OK, I can put forward a patch for this if that'd be helpful!
Thanks, Liam
Cheers, Lorenzo
* Lorenzo Stoakes lorenzo.stoakes@oracle.com [251114 06:51]:
On Thu, Nov 13, 2025 at 12:28:58PM -0500, Liam R. Howlett wrote:
- Lorenzo Stoakes lorenzo.stoakes@oracle.com [251113 05:45]:
On Thu, Nov 13, 2025 at 12:04:19AM +0000, Matthew Wilcox wrote:
On Wed, Nov 12, 2025 at 03:06:38PM +0000, Lorenzo Stoakes wrote:
Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree.
Since we all missed it I do wonder if we need some super clear comment saying 'hey if you drop + re-acquire RCU lock you MUST revalidate mas state by doing 'blah'.
I mean, this really isn't an RCU thing. This is also bad:
spin_lock(a); p = *q; spin_unlock(a); spin_lock(a); b = *p;
p could have been freed while you didn't hold lock a. Detecting this kind of thing needs compiler assistence (ie Rust) to let you know that you don't have the right to do that any more.
Right but in your example the use of the pointers is _realy clear_. In the mas situation, the pointers are embedded in the helper struct, there's a state machine, etc. so it's harder to catch this.
We could modify the above example to use a helper struct and the same problem would arise...
I disagree.
It's a helper struct with a state machine, manipulated by API functions. In fact it turns out we _can_ recover this state even after dropping/reacquiring the lock by calling the appropriate API functions to do so.
You manipulate this state via mas_xxx() commands, and in fact we resolve the lock issue by issuing the correct one.
The state is never recovered.. it's re-established entirely.
What is happening is, we are walking a tree data structure and keeping tack of where we are by keeping a pointer to the node. This node remains stable as long as the rcu or write lock is held for the tree.
If you are not unlocking, you could see how keeping the node for prev/next operations would be much faster.. it's just one pointer away.
When you drop whatever lock you are holding, that node may disappear, which is what happened in this bug.
When you mas_reset() or mas_set() or mas_set_range(), then you are setting the node in the maple state to MA_START. Any operation you call from then on will start over (get the root node and re-walk the tree).
So, calling the reset removes any potentially stale pointers but does not restore any state.
mas_set()/mas_set_range() sets the index and last to what you are looking for, which is part of the state. The operations will set the index/last to the range it finds on a search. In the vma case, this isn't very useful since we have vm_start/vm_end.
The state is re-established once you call the api to find something again.
This is, imo, very close to having a vma in a helper struct, then calling a function that drops the mmap lock, reacquires the lock, and continues to use the vma. The only way to restore the vma helper struct to a safe state is to do the vma lookup again and replace the (potentially) stale vma pointer.
If, say, for some reason, during copy_vma() we needed to drop the lock after vma_merge_new_range(). We'd have to restore vmg->target to whatever it was pointed to by vmg->start.. but vmg->start might not be right if vmg->target was set to the previous vma. We'd have to set vmg->target = vma_lookup(vmg->orig_start) or such, then re-evaluate the merging scenario.
I don't really see a difference in mas->node being invalid after a locking dance vs vmg->target being invalid if there was a locking dance. I also think it's fair to say that vma_merge_new_range() is an api that copy_vma() is using.
I do see that hiding it in an API could be missed, but the API exists because the mas struct is used in a lot of places that are in and around locking like this.
I'll add to the documentation, but I suspect it won't really help.
...
Thanks, Liam
linux-stable-mirror@lists.linaro.org