On Wed, Jul 5, 2023 at 5:33 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
- Suren Baghdasaryan surenb@google.com [230705 20:20]:
On Wed, Jul 5, 2023 at 4:07 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
- Suren Baghdasaryan surenb@google.com [230705 13:24]:
On Wed, Jul 5, 2023 at 10:14 AM David Hildenbrand david@redhat.com wrote:
On 05.07.23 19:12, Suren Baghdasaryan wrote:
When forking a child process, parent write-protects an anonymous page and COW-shares it with the child being forked using copy_present_pte(). Parent's TLB is flushed right before we drop the parent's mmap_lock in dup_mmap(). If we get a write-fault before that TLB flush in the parent, and we end up replacing that anonymous page in the parent process in do_wp_page() (because, COW-shared with the child), this might lead to some stale writable TLB entries targeting the wrong (old) page. Similar issue happened in the past with userfaultfd (see flush_tlb_page() call inside do_wp_page()). Lock VMAs of the parent process when forking a child, which prevents concurrent page faults during fork operation and avoids this issue. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~5% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic.
Suggested-by: David Hildenbrand david@redhat.com Reported-by: Jiri Slaby jirislaby@kernel.org Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ Reported-by: Holger Hoffstätte holger@applied-asynchrony.com Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asy... Reported-by: Jacob Young jacobly.alt@gmail.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") Cc: stable@vger.kernel.org Signed-off-by: Suren Baghdasaryan surenb@google.com
kernel/fork.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/kernel/fork.c b/kernel/fork.c index b85814e614a5..403bc2b72301 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -658,6 +658,12 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, retval = -EINTR; goto fail_uprobe_end; } +#ifdef CONFIG_PER_VMA_LOCK
/* Disallow any page faults before calling flush_cache_dup_mm */
for_each_vma(old_vmi, mpnt)
vma_start_write(mpnt);
vma_iter_init(&old_vmi, oldmm, 0);
vma_iter_set(&old_vmi, 0) is probably what you want here.
Ok, I send another version with that.
+#endif flush_cache_dup_mm(oldmm); uprobe_dup_mmap(oldmm, mm); /*
The old version was most probably fine as well, but this certainly looks even safer.
Acked-by: David Hildenbrand david@redhat.com
I think this is overkill and believe setting the vma_start_write() will synchronize with any readers since it's using the per-vma rw semaphore in write mode. Anything faulting will need to finish before the fork continues and faults during the fork will fall back to a read lock of the mmap_lock. Is there a possibility of populate happening outside the mmap_write lock/vma_lock?
Yes, I think we understand the loss of concurrency in the parent's ability to fault pages while forking. Is that a real problem though?
No, I don't think that part is an issue at all. I wanted to be sure I didn't miss something.
Was your benchmarking done with this loop at the start?
No, it was done with the initial version where the lock was inside the existing loop. I just reran the benchmark and while kernel compilation times did not change, the stress test shows ~7% regression now, probably due to that additional tree walk. I'll update that number in the new patch.
..but I expected a performance hit and didn't understand why you updated the patch this way. It would probably only happen on really big trees though and, ah, the largest trees I see are from the android side. I'd wager the impact will be felt more when larger trees encounter smaller CPU cache.
My test has 10000 vmas and even for Android that's a stretch (the highest number I've seen was ~4000). We can think of a less restrictive solution if this proves to be a problem for some workloads but for now I would prefer to fix this in a safe way and possibly improve that later. The alternative is to revert this completely and we get no more testing until the next release.
Thanks, Liam