On Wed, Jul 9, 2025 at 4:12 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
- Suren Baghdasaryan surenb@google.com [250709 11:06]:
On Wed, Jul 9, 2025 at 3:03 PM Vlastimil Babka vbabka@suse.cz wrote:
On 7/9/25 16:43, Suren Baghdasaryan wrote:
On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka vbabka@suse.cz wrote:
On 7/8/25 01:10, Suren Baghdasaryan wrote:
>> + rcu_read_unlock(); >> + vma = lock_vma_under_mmap_lock(mm, iter, address); >> + rcu_read_lock(); > OK I guess we hold the RCU lock the whole time as we traverse except when > we lock under mmap lock. Correct.
I wonder if it's really necessary? Can't it be done just inside lock_next_vma()? It would also avoid the unlock/lock dance quoted above.
Even if we later manage to extend this approach to smaps and employ rcu locking to traverse the page tables, I'd think it's best to separate and fine-grain the rcu lock usage for vma iterator and page tables, if only to avoid too long time under the lock.
I thought we would need to be in the same rcu read section while traversing the maple tree using vma_next() but now looking at it, maybe we can indeed enter only while finding and locking the next vma... Liam, would that work? I see struct ma_state containing a node field. Can it be freed from under us if we find a vma, exit rcu read section then re-enter rcu and use the same iterator to find the next vma?
If the rcu protection needs to be contigous, and patch 8 avoids the issue by always doing vma_iter_init() after rcu_read_lock() (but does it really avoid the issue or is it why we see the syzbot reports?) then I guess in the code quoted above we also need a vma_iter_init() after the rcu_read_lock(), because although the iterator was used briefly under mmap_lock protection, that was then unlocked and there can be a race before the rcu_read_lock().
Quite true. So, let's wait for Liam's confirmation and based on his answer I'll change the patch by either reducing the rcu read section or adding the missing vma_iter_init() after we switch to mmap_lock.
You need to either be under rcu or mmap lock to ensure the node in the maple state hasn't been freed (and potentially, reallocated).
So in this case, in the higher level, we can hold the rcu read lock for a series of walks and avoid re-walking the tree then the performance would be better.
Got it. Thanks for confirming!
When we return to userspace, then we should drop the rcu read lock and will need to vma_iter_set()/vma_iter_invalidate() on return. I thought this was being done (through vma_iter_init()), but syzbot seems to indicate a path that was missed?
We do that in m_start()/m_stop() by calling lock_vma_range()/unlock_vma_range() but I think I have two problems here: 1. As Vlastimil mentioned I do not reset the iterator when falling back to mmap_lock and exiting and then re-entering rcu read section; 2. I do not reset the iterator after exiting rcu read section in m_stop() and re-entering it in m_start(), so the later call to lock_next_vma() might be using an iterator with a node that was freed (and possibly reallocated).
This is the same thing that needed to be done previously with the mmap lock, but now under the rcu lock.
I'm not sure how to mitigate the issue with the page table, maybe we guess on the number of vmas that we were doing for 4k blocks of output and just drop/reacquire then. Probably a problem for another day anyways.
Also, I think you can also change the vma_iter_init() to vma_iter_set(), which is slightly less code under the hood. Vlastimil asked about this and it's probably a better choice.
Ack. I'll update my series with these fixes and all comments I received so far, will run the reproducers to confirm no issues and repost them later today. Thanks, Suren.
Thanks, Liam