On Tue, Jul 15, 2025 at 1:16 AM Vlastimil Babka vbabka@suse.cz wrote:
On 7/10/25 19:02, Suren Baghdasaryan wrote:
On Thu, Jul 10, 2025 at 12:03 AM Suren Baghdasaryan surenb@google.com wrote:
On Wed, Jul 9, 2025 at 10:47 AM Suren Baghdasaryan surenb@google.com wrote:
On Wed, Jul 9, 2025 at 4:12 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
- Suren Baghdasaryan surenb@google.com [250709 11:06]:
On Wed, Jul 9, 2025 at 3:03 PM Vlastimil Babka vbabka@suse.cz wrote: > > On 7/9/25 16:43, Suren Baghdasaryan wrote: > > On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka vbabka@suse.cz wrote: > >> > >> On 7/8/25 01:10, Suren Baghdasaryan wrote: > >> >>> + rcu_read_unlock(); > >> >>> + vma = lock_vma_under_mmap_lock(mm, iter, address); > >> >>> + rcu_read_lock(); > >> >> OK I guess we hold the RCU lock the whole time as we traverse except when > >> >> we lock under mmap lock. > >> > Correct. > >> > >> I wonder if it's really necessary? Can't it be done just inside > >> lock_next_vma()? It would also avoid the unlock/lock dance quoted above. > >> > >> Even if we later manage to extend this approach to smaps and employ rcu > >> locking to traverse the page tables, I'd think it's best to separate and > >> fine-grain the rcu lock usage for vma iterator and page tables, if only to > >> avoid too long time under the lock. > > > > I thought we would need to be in the same rcu read section while > > traversing the maple tree using vma_next() but now looking at it, > > maybe we can indeed enter only while finding and locking the next > > vma... > > Liam, would that work? I see struct ma_state containing a node field. > > Can it be freed from under us if we find a vma, exit rcu read section > > then re-enter rcu and use the same iterator to find the next vma? > > If the rcu protection needs to be contigous, and patch 8 avoids the issue by > always doing vma_iter_init() after rcu_read_lock() (but does it really avoid > the issue or is it why we see the syzbot reports?) then I guess in the code > quoted above we also need a vma_iter_init() after the rcu_read_lock(), > because although the iterator was used briefly under mmap_lock protection, > that was then unlocked and there can be a race before the rcu_read_lock().
Quite true. So, let's wait for Liam's confirmation and based on his answer I'll change the patch by either reducing the rcu read section or adding the missing vma_iter_init() after we switch to mmap_lock.
You need to either be under rcu or mmap lock to ensure the node in the maple state hasn't been freed (and potentially, reallocated).
So in this case, in the higher level, we can hold the rcu read lock for a series of walks and avoid re-walking the tree then the performance would be better.
Got it. Thanks for confirming!
When we return to userspace, then we should drop the rcu read lock and will need to vma_iter_set()/vma_iter_invalidate() on return. I thought this was being done (through vma_iter_init()), but syzbot seems to indicate a path that was missed?
We do that in m_start()/m_stop() by calling lock_vma_range()/unlock_vma_range() but I think I have two problems here:
- As Vlastimil mentioned I do not reset the iterator when falling
back to mmap_lock and exiting and then re-entering rcu read section; 2. I do not reset the iterator after exiting rcu read section in m_stop() and re-entering it in m_start(), so the later call to lock_next_vma() might be using an iterator with a node that was freed (and possibly reallocated).
This is the same thing that needed to be done previously with the mmap lock, but now under the rcu lock.
I'm not sure how to mitigate the issue with the page table, maybe we guess on the number of vmas that we were doing for 4k blocks of output and just drop/reacquire then. Probably a problem for another day anyways.
Also, I think you can also change the vma_iter_init() to vma_iter_set(), which is slightly less code under the hood. Vlastimil asked about this and it's probably a better choice.
Ack. I'll update my series with these fixes and all comments I received so far, will run the reproducers to confirm no issues and repost them later today.
I have the patchset ready but would like to test it some more. Will post it tomorrow.
Ok, I found a couple of issues using the syzbot reproducer [1] (which is awesome BTW!):
- rwsem_acquire_read() inside vma_start_read() at [2] should be moved
after the last check, otherwise the lock is considered taken on vma->vm_refcnt overflow;
I think it's fine because if the last check fails there's a vma_refcount_put() that includes rwsem_release(), no?
Ah, yes, you are right. This is fine. Obviously trying to figure out the issue right before a flight is not a good idea :)
- query_matching_vma() is missing unlock_vma() call when it does
"goto next_vma;" and re-issues query_vma_find_by_addr(). The previous vma is left locked;
[1] https://syzkaller.appspot.com/x/repro.c?x=101edf70580000 [2] https://elixir.bootlin.com/linux/v6.15.5/source/include/linux/mm.h#L747
After these fixes it's much harder to fail but I still get one more error copied below. I will continue the investigation and will hold off reposting until this is fixed. That will be next week since I'll be out of town the rest of this week.
Andrew, could you please remove this patchset from mm-unstable for now until I fix the issue and re-post the new version?
Andrew can you do that please? We keep getting new syzbot reports.
The error I got after these fixes is:
I suspect the root cause is the ioctls are not serialized against each other (probably not even against read()) and yet we treat m->private as safe to work on. Now we have various fields that are dangerous to race on - for example locked_vma and iter races would explain a lot of this.
I suspect as long as we used purely seq_file workflow, it did the right thing for us wrt serialization, but the ioctl addition violates that. We should rather recheck even the code before this series, if dangerous ioctl vs read() races are possible. And the ioctl implementation should be refactored to use an own per-ioctl-call private context, not the seq_file's per-file-open context.
Huh, I completely failed to consider this. In hindsight it is quite obvious... Thanks Vlastimil, I owe you a beer or two.