Re: [PATCH v3 1/2] fork: lock VMAs of the parent process when forking

6 Jul 2023

On Wed, Jul 5, 2023 at 5:33 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
...

Suren Baghdasaryan surenb@google.com [230705 20:20]:

...
On Wed, Jul 5, 2023 at 4:07 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
...

Suren Baghdasaryan surenb@google.com [230705 13:24]:

...
On Wed, Jul 5, 2023 at 10:14 AM David Hildenbrand david@redhat.com wrote:
...
On 05.07.23 19:12, Suren Baghdasaryan wrote:
...
When forking a child process, parent write-protects an anonymous page
and COW-shares it with the child being forked using copy_present_pte().
Parent's TLB is flushed right before we drop the parent's mmap_lock in
dup_mmap(). If we get a write-fault before that TLB flush in the parent,
and we end up replacing that anonymous page in the parent process in
do_wp_page() (because, COW-shared with the child), this might lead to
some stale writable TLB entries targeting the wrong (old) page.
Similar issue happened in the past with userfaultfd (see flush_tlb_page()
call inside do_wp_page()).
Lock VMAs of the parent process when forking a child, which prevents
concurrent page faults during fork operation and avoids this issue.
This fix can potentially regress some fork-heavy workloads. Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.
Suggested-by: David Hildenbrand david@redhat.com
Reported-by: Jiri Slaby jirislaby@kernel.org
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffstätte holger@applied-asynchrony.com
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asy...
Reported-by: Jacob Young jacobly.alt@gmail.com
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan surenb@google.com

kernel/fork.c | 6 ++++++
  1 file changed, 6 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index b85814e614a5..403bc2b72301 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -658,6 +658,12 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
              retval = -EINTR;
              goto fail_uprobe_end;
      }
+#ifdef CONFIG_PER_VMA_LOCK

/* Disallow any page faults before calling flush_cache_dup_mm */


for_each_vma(old_vmi, mpnt)


        vma_start_write(mpnt);


vma_iter_init(&old_vmi, oldmm, 0);



vma_iter_set(&old_vmi, 0) is probably what you want here.
Ok, I send another version with that.
...
...
...
...
+#endif
      flush_cache_dup_mm(oldmm);
      uprobe_dup_mmap(oldmm, mm);
      /*
The old version was most probably fine as well, but this certainly looks
even safer.
Acked-by: David Hildenbrand david@redhat.com
I think this is overkill and believe setting the vma_start_write() will
synchronize with any readers since it's using the per-vma rw semaphore
in write mode. Anything faulting will need to finish before the fork
continues and faults during the fork will fall back to a read lock of
the mmap_lock.  Is there a possibility of populate happening outside the
mmap_write lock/vma_lock?
Yes, I think we understand the loss of concurrency in the parent's
ability to fault pages while forking. Is that a real problem though?
No, I don't think that part is an issue at all.  I wanted to be sure I
didn't miss something.
...
...
Was your benchmarking done with this loop at the start?
No, it was done with the initial version where the lock was inside the
existing loop. I just reran the benchmark and while kernel compilation
times did not change, the stress test shows ~7% regression now,
probably due to that additional tree walk. I'll update that number in
the new patch.
..but I expected a performance hit and didn't understand why you updated
the patch this way.  It would probably only happen on really big trees
though and, ah, the largest trees I see are from the android side.  I'd
wager the impact will be felt more when larger trees encounter smaller
CPU cache.
My test has 10000 vmas and even for Android that's a stretch (the
highest number I've seen was ~4000).
We can think of a less restrictive solution if this proves to be a
problem for some workloads but for now I would prefer to fix this in a
safe way and possibly improve that later. The alternative is to revert
this completely and we get no more testing until the next release.
...
Thanks,
Liam

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v3 1/2] fork: lock VMAs of the parent process when forking