With recent changes necessitating mmap_lock to be held for write while expanding a stack, per-VMA locks should follow the same rules and be write-locked to prevent page faults into the VMA being expanded. Add the necessary locking.
Cc: stable@vger.kernel.org Signed-off-by: Suren Baghdasaryan surenb@google.com --- mm/mmap.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c index 204ddcd52625..c66e4622a557 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1977,6 +1977,8 @@ static int expand_upwards(struct vm_area_struct *vma, unsigned long address) return -ENOMEM; }
+ /* Lock the VMA before expanding to prevent concurrent page faults */ + vma_start_write(vma); /* * vma->vm_start/vm_end cannot change under us because the caller * is required to hold the mmap_lock in read mode. We need the @@ -2064,6 +2066,8 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address) return -ENOMEM; }
+ /* Lock the VMA before expanding to prevent concurrent page faults */ + vma_start_write(vma); /* * vma->vm_start/vm_end cannot change under us because the caller * is required to hold the mmap_lock in read mode. We need the
mmap_region adds a newly created VMA into VMA tree and might modify it afterwards before dropping the mmap_lock. This poses a problem for page faults handled under per-VMA locks because they don't take the mmap_lock and can stumble on this VMA while it's still being modified. Currently this does not pose a problem since post-addition modifications are done only for file-backed VMAs, which are not handled under per-VMA lock. However, once support for handling file-backed page faults with per-VMA locks is added, this will become a race. Fix this by write-locking the VMA before inserting it into the VMA tree. Other places where a new VMA is added into VMA tree do not modify it after the insertion, so do not need the same locking.
Cc: stable@vger.kernel.org Signed-off-by: Suren Baghdasaryan surenb@google.com --- mm/mmap.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c index c66e4622a557..84c71431a527 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2812,6 +2812,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, if (vma->vm_file) i_mmap_lock_write(vma->vm_file->f_mapping);
+ /* Lock the VMA since it is modified after insertion into VMA tree */ + vma_start_write(vma); vma_iter_store(&vmi, vma); mm->map_count++; if (vma->vm_file) {
When forking a child process, parent write-protects an anonymous page and COW-shares it with the child being forked using copy_present_pte(). Parent's TLB is flushed right before we drop the parent's mmap_lock in dup_mmap(). If we get a write-fault before that TLB flush in the parent, and we end up replacing that anonymous page in the parent process in do_wp_page() (because, COW-shared with the child), this might lead to some stale writable TLB entries targeting the wrong (old) page. Similar issue happened in the past with userfaultfd (see flush_tlb_page() call inside do_wp_page()). Lock VMAs of the parent process when forking a child, which prevents concurrent page faults during fork operation and avoids this issue. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~5% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic.
Suggested-by: David Hildenbrand david@redhat.com Reported-by: Jiri Slaby jirislaby@kernel.org Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ Reported-by: Holger Hoffstätte holger@applied-asynchrony.com Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asy... Reported-by: Jacob Young jacobly.alt@gmail.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") Cc: stable@vger.kernel.org Signed-off-by: Suren Baghdasaryan surenb@google.com --- kernel/fork.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/fork.c b/kernel/fork.c index b85814e614a5..d2e12b6d2b18 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, for_each_vma(old_vmi, mpnt) { struct file *file;
+ vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); continue;
On Sat, Jul 8, 2023 at 12:12 PM Suren Baghdasaryan surenb@google.com wrote:
When forking a child process, parent write-protects an anonymous page and COW-shares it with the child being forked using copy_present_pte(). Parent's TLB is flushed right before we drop the parent's mmap_lock in dup_mmap(). If we get a write-fault before that TLB flush in the parent, and we end up replacing that anonymous page in the parent process in do_wp_page() (because, COW-shared with the child), this might lead to some stale writable TLB entries targeting the wrong (old) page. Similar issue happened in the past with userfaultfd (see flush_tlb_page() call inside do_wp_page()). Lock VMAs of the parent process when forking a child, which prevents concurrent page faults during fork operation and avoids this issue. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~5% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic.
Sending this earlier version of the patch per request from Linus and with his explanation here: https://lore.kernel.org/all/CAHk-=wi-99-DyMOGywTbjRnRRC+XfpPm=r=pei4A=MEL0QD...
Suggested-by: David Hildenbrand david@redhat.com Reported-by: Jiri Slaby jirislaby@kernel.org Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ Reported-by: Holger Hoffstätte holger@applied-asynchrony.com Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asy... Reported-by: Jacob Young jacobly.alt@gmail.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") Cc: stable@vger.kernel.org Signed-off-by: Suren Baghdasaryan surenb@google.com
kernel/fork.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/fork.c b/kernel/fork.c index b85814e614a5..d2e12b6d2b18 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, for_each_vma(old_vmi, mpnt) { struct file *file;
vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); continue;
-- 2.41.0.390.g38632f3daf-goog
On Sat, 8 Jul 2023 at 12:12, Suren Baghdasaryan surenb@google.com wrote:
kernel/fork.c | 1 + 1 file changed, 1 insertion(+)
I ended up editing your explanation a lot.
I'm not convinced that the bug has much to do with the delayed tlb flushing.
I think it's more fundamental than some tlb coherence issue: our VM copying simply expects to not have any unrelated concurrent page fault activity, and various random internal data structures simply rely on that.
I made up an example that I'm not sure is relevant to any of the particular failures, but that I think is a non-TLB case: the parent 'vma->anon_vma' chain is copied by dup_mmap() in anon_vma_fork(), and it's possible that the parent vma didn't have any anon_vma associated with it at that point.
But a concurrent page fault to the same vma - even *before* the page tables have been copied, and when the TLB is still entirely coherent - could then cause a anon_vma_prepare() on that parent vma, and associate one of the pages with that anon-vma.
Then the page table copy happens, and that page gets marked read-only again, and is added to both the parent and the child vma's, but the child vma never got associated with the parents new anon_vma, because it didn't exist when anon_vma_fork() happened.
Does this ever happen? I have no idea. But it would seem to be an example that really has nothing to do with any TLB state, and is just simply "we cannot handle concurrent page faults while we're busy copying the mm".
Again - maybe I messed up, but it really feels like the missing vma_start_write() was more fundamental, and not some "TLB coherency" issue.
Linus
On Sat, Jul 8, 2023 at 2:18 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, 8 Jul 2023 at 12:12, Suren Baghdasaryan surenb@google.com wrote:
kernel/fork.c | 1 + 1 file changed, 1 insertion(+)
I ended up editing your explanation a lot.
I'm not convinced that the bug has much to do with the delayed tlb flushing.
I think it's more fundamental than some tlb coherence issue: our VM copying simply expects to not have any unrelated concurrent page fault activity, and various random internal data structures simply rely on that.
I made up an example that I'm not sure is relevant to any of the particular failures, but that I think is a non-TLB case: the parent 'vma->anon_vma' chain is copied by dup_mmap() in anon_vma_fork(), and it's possible that the parent vma didn't have any anon_vma associated with it at that point.
But a concurrent page fault to the same vma - even *before* the page tables have been copied, and when the TLB is still entirely coherent - could then cause a anon_vma_prepare() on that parent vma, and associate one of the pages with that anon-vma.
Then the page table copy happens, and that page gets marked read-only again, and is added to both the parent and the child vma's, but the child vma never got associated with the parents new anon_vma, because it didn't exist when anon_vma_fork() happened.
Does this ever happen? I have no idea. But it would seem to be an example that really has nothing to do with any TLB state, and is just simply "we cannot handle concurrent page faults while we're busy copying the mm".
Again - maybe I messed up, but it really feels like the missing vma_start_write() was more fundamental, and not some "TLB coherency" issue.
Sounds plausible. I'll try to use the reproducer to verify if that's indeed happening here. It's likely there are multiple problematic scenarios due to this missing lock though. Thanks, Suren.
Linus
On Sat, 8 Jul 2023 at 15:36, Suren Baghdasaryan surenb@google.com wrote:
On Sat, Jul 8, 2023 at 2:18 PM Linus Torvalds
Again - maybe I messed up, but it really feels like the missing vma_start_write() was more fundamental, and not some "TLB coherency" issue.
Sounds plausible. I'll try to use the reproducer to verify if that's indeed happening here.
I really don't think that's what people are reporting, I was just trying to make up a completely different case that has nothing to do with any TLB issues.
My real point was simply this one:
It's likely there are multiple problematic scenarios due to this missing lock though.
Right. That's my issue. I felt your explanation was *too* targeted at some TLB non-coherency thing, when I think the problem was actually a much larger "page faults simply must not happen while we're copying the page tables because data isn't coherent".
The anon_vma case was just meant as another random example of the other kinds of things I suspect can go wrong, because we're simply not able to do this whole "copy vma while it's being modified by page faults".
Now, I agree that the PTE problem is real, and probable the main thing, ie when we as part of fork() do this:
/* * If it's a COW mapping, write protect it both * in the parent and the child */ if (is_cow_mapping(vm_flags) && pte_write(pte)) { ptep_set_wrprotect(src_mm, addr, src_pte); pte = pte_wrprotect(pte); }
and the thing that can go wrong before the TLB flush happens is that - because the TLB's haven't been flushed yet - some threads in the parent happily continue to write to the page and didn't see the wrprotect happening.
And then you get into the situation where *some* thread see the page protections change (maybe they had a TLB flush event on that CPU for random reasons), and they will take a page fault and do the COW thing and create a new page.
And all the while *other* threads still see the old writeable TLB state, and continue to write to the old page.
So now you have a page that gets its data copied *while* somebody is still writing to it, and the end result is that some write easily gets lost, and so when that new copy is installed, you see it as data corruption.
And I agree completely that that is probably the thing that most people actually saw and reacted to as corruption.
But the reason I didn't like the explanation was that I think this is just one random example of the more fundamental issue of "we simply must not take page faults while copying".
Your explanation made me think "stale TLB is the problem", and *that* was what I objected to. The stale TLB was just one random sign of the much larger problem.
It might even have been the most common symptom, but I think it was just a *symptom*, not the *cause* of the problem.
And I must have been bad at explaining that, because David Hildenbrand also reacted negatively to my change.
So I'll happily take a patch that adds more commentary about this, and gives several examples of the things that go wrong.
Linus
On Sat, Jul 8, 2023 at 3:54 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, 8 Jul 2023 at 15:36, Suren Baghdasaryan surenb@google.com wrote:
On Sat, Jul 8, 2023 at 2:18 PM Linus Torvalds
Again - maybe I messed up, but it really feels like the missing vma_start_write() was more fundamental, and not some "TLB coherency" issue.
Sounds plausible. I'll try to use the reproducer to verify if that's indeed happening here.
I really don't think that's what people are reporting, I was just trying to make up a completely different case that has nothing to do with any TLB issues.
My real point was simply this one:
It's likely there are multiple problematic scenarios due to this missing lock though.
Right. That's my issue. I felt your explanation was *too* targeted at some TLB non-coherency thing, when I think the problem was actually a much larger "page faults simply must not happen while we're copying the page tables because data isn't coherent".
The anon_vma case was just meant as another random example of the other kinds of things I suspect can go wrong, because we're simply not able to do this whole "copy vma while it's being modified by page faults".
Now, I agree that the PTE problem is real, and probable the main thing, ie when we as part of fork() do this:
/* * If it's a COW mapping, write protect it both * in the parent and the child */ if (is_cow_mapping(vm_flags) && pte_write(pte)) { ptep_set_wrprotect(src_mm, addr, src_pte); pte = pte_wrprotect(pte); }
and the thing that can go wrong before the TLB flush happens is that - because the TLB's haven't been flushed yet - some threads in the parent happily continue to write to the page and didn't see the wrprotect happening.
And then you get into the situation where *some* thread see the page protections change (maybe they had a TLB flush event on that CPU for random reasons), and they will take a page fault and do the COW thing and create a new page.
And all the while *other* threads still see the old writeable TLB state, and continue to write to the old page.
So now you have a page that gets its data copied *while* somebody is still writing to it, and the end result is that some write easily gets lost, and so when that new copy is installed, you see it as data corruption.
And I agree completely that that is probably the thing that most people actually saw and reacted to as corruption.
But the reason I didn't like the explanation was that I think this is just one random example of the more fundamental issue of "we simply must not take page faults while copying".
Your explanation made me think "stale TLB is the problem", and *that* was what I objected to. The stale TLB was just one random sign of the much larger problem.
It might even have been the most common symptom, but I think it was just a *symptom*, not the *cause* of the problem.
And I must have been bad at explaining that, because David Hildenbrand also reacted negatively to my change.
So I'll happily take a patch that adds more commentary about this, and gives several examples of the things that go wrong.
How about adding your example to the original description as yet another scenario which is broken without this change? I guess having both issues described would not hurt.
Linus
On Sat, Jul 08, 2023 at 12:12:12PM -0700, Suren Baghdasaryan wrote: [..]
Lock VMAs of the parent process when forking a child, which prevents concurrent page faults during fork operation and avoids this issue. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~5% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic.
kernel/fork.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/fork.c b/kernel/fork.c index b85814e614a5..d2e12b6d2b18 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, for_each_vma(old_vmi, mpnt) { struct file *file;
if (mpnt->vm_flags & VM_DONTCOPY) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); continue;vma_start_write(mpnt);
I don't see it mentioned in the discussion, so at a risk of ruffling feathers or looking really bad I'm going to ask: is the locking of any use if the forking process is single-threaded? The singular thread in this case is occupied executing this very code, so it can't do any op in parallel. Is there anyone else who could trigger a page fault? Are these shared with other processes? Cursory reading suggests a private copy is made here, so my guess is no. But then again, I landed here freshly from the interwebz.
Or in short: if nobody can mess up the state if the forking process is single-threaded, why not check for mm_users or whatever other indicator to elide the slowdown for the (arguably) most common case?
If the state can be messed up anyway, that's a shame, but short explanation how would be welcome.
to illustrate (totally untested): diff --git a/kernel/fork.c b/kernel/fork.c index d2e12b6d2b18..aac6b08a0b21 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -652,6 +652,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, LIST_HEAD(uf); VMA_ITERATOR(old_vmi, oldmm, 0); VMA_ITERATOR(vmi, mm, 0); + bool singlethread = READ_ONCE(oldmm->mm_users) == 1;
uprobe_start_dup_mmap(); if (mmap_write_lock_killable(oldmm)) { @@ -686,7 +687,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, for_each_vma(old_vmi, mpnt) { struct file *file;
- vma_start_write(mpnt); + if (!singelthreaded) + vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); continue;
On Fri, 4 Aug 2023 at 14:46, Mateusz Guzik mjguzik@gmail.com wrote:
I don't see it mentioned in the discussion, so at a risk of ruffling feathers or looking really bad I'm going to ask: is the locking of any use if the forking process is single-threaded? T
Sadly, we've always been able to access the mm from other processes, so the locking is - I think - unavoidable.
And some of those "access from other processes" aren't even uncommon or special. It's things like "ps" etc, that do it just to see the process name and arguments.
Linus
On 8/5/23, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, 4 Aug 2023 at 14:46, Mateusz Guzik mjguzik@gmail.com wrote:
I don't see it mentioned in the discussion, so at a risk of ruffling feathers or looking really bad I'm going to ask: is the locking of any use if the forking process is single-threaded? T
Sadly, we've always been able to access the mm from other processes, so the locking is - I think - unavoidable.
And some of those "access from other processes" aren't even uncommon or special. It's things like "ps" etc, that do it just to see the process name and arguments.
I know of these guys, I think they are excluded as is -- they go through access_remote_vm, starting with: if (mmap_read_lock_killable(mm)) return 0;
while dup_mmap already write locks the parent's mm.
I don't see any surprise relocks of the semaphore.
Granted, should someone *bypass* this mechanism the above would be moot.
On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik mjguzik@gmail.com wrote:
I know of these guys, I think they are excluded as is -- they go through access_remote_vm, starting with: if (mmap_read_lock_killable(mm)) return 0;
while dup_mmap already write locks the parent's mm.
Oh, you're only worried about vma_start_write()?
That's a non-issue. It doesn't take the lock normally, since it starts off with
if (__is_vma_write_locked(vma, &mm_lock_seq)) return;
which catches on the lock sequence number already being set.
So no extra locking there.
Well, technically there's extra locking because the code stupidly doesn't initialize new vma allocations to the right sequence number, but that was talked about here:
https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZN...
and it's a separate issue.
Linus
On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik mjguzik@gmail.com wrote:
I know of these guys, I think they are excluded as is -- they go through access_remote_vm, starting with: if (mmap_read_lock_killable(mm)) return 0;
while dup_mmap already write locks the parent's mm.
Oh, you're only worried about vma_start_write()?
That's a non-issue. It doesn't take the lock normally, since it starts off with
if (__is_vma_write_locked(vma, &mm_lock_seq)) return;
which catches on the lock sequence number already being set.
That check will prevent re-locking but if vma is not already locked then the call will proceed with obtaining the lock and setting vma->vm_lock_seq to mm->mm_lock_seq.
So no extra locking there.
Well, technically there's extra locking because the code stupidly doesn't initialize new vma allocations to the right sequence number, but that was talked about here:
https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/
and it's a separate issue.
Linus
On Fri, Aug 4, 2023 at 5:26 PM Suren Baghdasaryan surenb@google.com wrote:
On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik mjguzik@gmail.com wrote:
I know of these guys, I think they are excluded as is -- they go through access_remote_vm, starting with: if (mmap_read_lock_killable(mm)) return 0;
while dup_mmap already write locks the parent's mm.
Oh, you're only worried about vma_start_write()?
That's a non-issue. It doesn't take the lock normally, since it starts off with
if (__is_vma_write_locked(vma, &mm_lock_seq)) return;
which catches on the lock sequence number already being set.
That check will prevent re-locking but if vma is not already locked then the call will proceed with obtaining the lock and setting vma->vm_lock_seq to mm->mm_lock_seq.
The optimization Mateusz describes looks valid to me. If there is nobody else to fault a page and mm_users is stable (which I think it is because we are holding mmap_lock for write) then we can skip vma locking, I think.
So no extra locking there.
Well, technically there's extra locking because the code stupidly doesn't initialize new vma allocations to the right sequence number, but that was talked about here:
https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/
and it's a separate issue.
Linus
On 8/5/23, Suren Baghdasaryan surenb@google.com wrote:
On Fri, Aug 4, 2023 at 5:26 PM Suren Baghdasaryan surenb@google.com wrote:
On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik mjguzik@gmail.com wrote:
I know of these guys, I think they are excluded as is -- they go through access_remote_vm, starting with: if (mmap_read_lock_killable(mm)) return 0;
while dup_mmap already write locks the parent's mm.
Oh, you're only worried about vma_start_write()?
That's a non-issue. It doesn't take the lock normally, since it starts off with
if (__is_vma_write_locked(vma, &mm_lock_seq)) return;
which catches on the lock sequence number already being set.
That check will prevent re-locking but if vma is not already locked then the call will proceed with obtaining the lock and setting vma->vm_lock_seq to mm->mm_lock_seq.
The optimization Mateusz describes looks valid to me. If there is nobody else to fault a page and mm_users is stable (which I think it is because we are holding mmap_lock for write) then we can skip vma locking, I think.
mm_users is definitely *not* stable -- it can be bumped by get_task_mm, which is only synchronized with task lock.
However, the other users (that I know of ) go through the mmap semaphore to mess with anything which means they will wait for dup_mmap to finish (or do their work first). I would be surprised if there were any cases which don't take the semaphore, given that it was a requirement prior to the vma patchset (unless you patched some to no longer need it?). I would guess worst case the semaphore can be added if missing.
What is guaranteed is that if the forking process is single-threaded, there will be no threads added out of nowhere -- the only thread which could do it is busy creating one in dup_mmap. If multithreaded operation of the forking process was the only problem, that's it.
So no extra locking there.
Well, technically there's extra locking because the code stupidly doesn't initialize new vma allocations to the right sequence number, but that was talked about here:
https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZN...
and it's a separate issue.
Linus
On Fri, Aug 4, 2023 at 5:49 PM Mateusz Guzik mjguzik@gmail.com wrote:
On 8/5/23, Suren Baghdasaryan surenb@google.com wrote:
On Fri, Aug 4, 2023 at 5:26 PM Suren Baghdasaryan surenb@google.com wrote:
On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik mjguzik@gmail.com wrote:
I know of these guys, I think they are excluded as is -- they go through access_remote_vm, starting with: if (mmap_read_lock_killable(mm)) return 0;
while dup_mmap already write locks the parent's mm.
Oh, you're only worried about vma_start_write()?
That's a non-issue. It doesn't take the lock normally, since it starts off with
if (__is_vma_write_locked(vma, &mm_lock_seq)) return;
which catches on the lock sequence number already being set.
That check will prevent re-locking but if vma is not already locked then the call will proceed with obtaining the lock and setting vma->vm_lock_seq to mm->mm_lock_seq.
The optimization Mateusz describes looks valid to me. If there is nobody else to fault a page and mm_users is stable (which I think it is because we are holding mmap_lock for write) then we can skip vma locking, I think.
mm_users is definitely *not* stable -- it can be bumped by get_task_mm, which is only synchronized with task lock.
Ugh, you are of course correct. Poor choice for saying no new users (threads) can appear from under us.
However, the other users (that I know of ) go through the mmap semaphore to mess with anything which means they will wait for dup_mmap to finish (or do their work first). I would be surprised if there were any cases which don't take the semaphore, given that it was a requirement prior to the vma patchset (unless you patched some to no longer need it?). I would guess worst case the semaphore can be added if missing.
No, the only mmap_lock read-lock that is affected is during the page fault, which is expected.
What is guaranteed is that if the forking process is single-threaded, there will be no threads added out of nowhere -- the only thread which could do it is busy creating one in dup_mmap. If multithreaded operation of the forking process was the only problem, that's it.
So no extra locking there.
Well, technically there's extra locking because the code stupidly doesn't initialize new vma allocations to the right sequence number, but that was talked about here:
https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZN...
and it's a separate issue.
Linus
-- Mateusz Guzik <mjguzik gmail.com>
On 8/5/23, Suren Baghdasaryan surenb@google.com wrote:
On Fri, Aug 4, 2023 at 5:49 PM Mateusz Guzik mjguzik@gmail.com wrote:
However, the other users (that I know of ) go through the mmap semaphore to mess with anything which means they will wait for dup_mmap to finish (or do their work first). I would be surprised if there were any cases which don't take the semaphore, given that it was a requirement prior to the vma patchset (unless you patched some to no longer need it?). I would guess worst case the semaphore can be added if missing.
No, the only mmap_lock read-lock that is affected is during the page fault, which is expected.
I have difficulty parsing your statement.
I am saying that any 3rd parties which can trigger page faults already read lock mmap_lock or can be made to do it (and I don't know any case which does not already, but I'm not willing to spend time poking around to make sure). One can consider 3rd parties as not a problem, modulo the audit.
Past that there does is no known source of trouble? In my original e-mail I was worried about processes up the chain in ancestry, perhaps some of the state is shared(?) and the locking at hand neuters any problems. I'm guessing this is not necessary.
Bottom line though it looks like this will work fine?
That said, I'm not going to submit a patch I can't confidently defend. As I did not dig into any of the VMA code and can't be arsed to audit all places which mess with "foreign" mm, I'm definitely not submitting this myself. You are most welcome to write your own variant at your leisure. :)
On Fri, Aug 4, 2023 at 6:17 PM Mateusz Guzik mjguzik@gmail.com wrote:
On 8/5/23, Suren Baghdasaryan surenb@google.com wrote:
On Fri, Aug 4, 2023 at 5:49 PM Mateusz Guzik mjguzik@gmail.com wrote:
However, the other users (that I know of ) go through the mmap semaphore to mess with anything which means they will wait for dup_mmap to finish (or do their work first). I would be surprised if there were any cases which don't take the semaphore, given that it was a requirement prior to the vma patchset (unless you patched some to no longer need it?). I would guess worst case the semaphore can be added if missing.
No, the only mmap_lock read-lock that is affected is during the page fault, which is expected.
I have difficulty parsing your statement.
I was just saying that vma lock patchset did not touch any other mmap_locking paths except for the page fault one where we try to skip read-locking mmap_lock.
I am saying that any 3rd parties which can trigger page faults already read lock mmap_lock or can be made to do it (and I don't know any case which does not already, but I'm not willing to spend time poking around to make sure). One can consider 3rd parties as not a problem, modulo the audit.
Past that there does is no known source of trouble? In my original e-mail I was worried about processes up the chain in ancestry, perhaps some of the state is shared(?) and the locking at hand neuters any problems. I'm guessing this is not necessary.
Bottom line though it looks like this will work fine?
That said, I'm not going to submit a patch I can't confidently defend. As I did not dig into any of the VMA code and can't be arsed to audit all places which mess with "foreign" mm, I'm definitely not submitting this myself. You are most welcome to write your own variant at your leisure. :)
Ok, I see. I'll need to double check locking when a 3rd party is involved. Will post a patch when I'm confident enough it's safe. Thanks!
-- Mateusz Guzik <mjguzik gmail.com>
On 8/5/23, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik mjguzik@gmail.com wrote:
I know of these guys, I think they are excluded as is -- they go through access_remote_vm, starting with: if (mmap_read_lock_killable(mm)) return 0;
while dup_mmap already write locks the parent's mm.
Oh, you're only worried about vma_start_write()?
That's a non-issue. It doesn't take the lock normally, since it starts off with
if (__is_vma_write_locked(vma, &mm_lock_seq)) return;
which catches on the lock sequence number already being set.
So no extra locking there.
Well, technically there's extra locking because the code stupidly doesn't initialize new vma allocations to the right sequence number, but that was talked about here:
https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZN...
and it's a separate issue.
I'm going to bet one beer this is the issue.
The patch I'm responding to only consists of adding the call to vma_start_write and claims the 5% slowdown from it, while fixing crashes if the forking process is multithreaded.
For the fix to work it has to lock something against the parent.
VMA_ITERATOR(old_vmi, oldmm, 0); [..] for_each_vma(old_vmi, mpnt) { [..] vma_start_write(mpnt);
the added line locks an obj in the parent's vm space.
The problem you linked looks like pessimization for freshly allocated vmas, but that's what is being operated on here.
On Fri, Aug 4, 2023 at 6:06 PM Mateusz Guzik mjguzik@gmail.com wrote:
On 8/5/23, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik mjguzik@gmail.com wrote:
I know of these guys, I think they are excluded as is -- they go through access_remote_vm, starting with: if (mmap_read_lock_killable(mm)) return 0;
while dup_mmap already write locks the parent's mm.
Oh, you're only worried about vma_start_write()?
That's a non-issue. It doesn't take the lock normally, since it starts off with
if (__is_vma_write_locked(vma, &mm_lock_seq)) return;
which catches on the lock sequence number already being set.
So no extra locking there.
Well, technically there's extra locking because the code stupidly doesn't initialize new vma allocations to the right sequence number, but that was talked about here:
https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZN...
and it's a separate issue.
I'm going to bet one beer this is the issue.
The patch I'm responding to only consists of adding the call to vma_start_write and claims the 5% slowdown from it, while fixing crashes if the forking process is multithreaded.
For the fix to work it has to lock something against the parent.
VMA_ITERATOR(old_vmi, oldmm, 0);
[..] for_each_vma(old_vmi, mpnt) { [..] vma_start_write(mpnt);
the added line locks an obj in the parent's vm space.
The problem you linked looks like pessimization for freshly allocated vmas, but that's what is being operated on here.
Sorry, now I'm having trouble understanding the problem you are describing. We are locking the parent's vma before copying it and the newly created vma is locked before it's added into the vma tree. What is the problem then?
-- Mateusz Guzik <mjguzik gmail.com>
On 8/5/23, Suren Baghdasaryan surenb@google.com wrote:
On Fri, Aug 4, 2023 at 6:06 PM Mateusz Guzik mjguzik@gmail.com wrote:
On 8/5/23, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik mjguzik@gmail.com wrote:
I know of these guys, I think they are excluded as is -- they go through access_remote_vm, starting with: if (mmap_read_lock_killable(mm)) return 0;
while dup_mmap already write locks the parent's mm.
Oh, you're only worried about vma_start_write()?
That's a non-issue. It doesn't take the lock normally, since it starts off with
if (__is_vma_write_locked(vma, &mm_lock_seq)) return;
which catches on the lock sequence number already being set.
So no extra locking there.
Well, technically there's extra locking because the code stupidly doesn't initialize new vma allocations to the right sequence number, but that was talked about here:
https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZN...
and it's a separate issue.
I'm going to bet one beer this is the issue.
The patch I'm responding to only consists of adding the call to vma_start_write and claims the 5% slowdown from it, while fixing crashes if the forking process is multithreaded.
For the fix to work it has to lock something against the parent.
VMA_ITERATOR(old_vmi, oldmm, 0);
[..] for_each_vma(old_vmi, mpnt) { [..] vma_start_write(mpnt);
the added line locks an obj in the parent's vm space.
The problem you linked looks like pessimization for freshly allocated vmas, but that's what is being operated on here.
Sorry, now I'm having trouble understanding the problem you are describing. We are locking the parent's vma before copying it and the newly created vma is locked before it's added into the vma tree. What is the problem then?
Sorry for the late reply!
Looks there has been a bunch of weird talking past one another in this thread and I don't think trying to straighten it all out is worth any time.
I think at least the two of us agree that if a single-threaded process enters dup_mmap an down_writes the mmap semaphore, then no new thread can pop up in said process, thus no surprise page faults from that angle. 3rd parties are supposed to interfaces like access_remote_vm, which down_read said semaphore and are consequently also not a problem. The only worry here is that someone is messing with another process memory without the semaphore, but is very unlikely and patchable in the worst case -- but someone(tm) has to audit. With all these conditions satisfied one can elide vma_start_write for a perf win.
Finally, I think we agreed you are going to do the audit ;)
Cheers,
On Wed, Aug 9, 2023 at 2:07 PM Mateusz Guzik mjguzik@gmail.com wrote:
On 8/5/23, Suren Baghdasaryan surenb@google.com wrote:
On Fri, Aug 4, 2023 at 6:06 PM Mateusz Guzik mjguzik@gmail.com wrote:
On 8/5/23, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik mjguzik@gmail.com wrote:
I know of these guys, I think they are excluded as is -- they go through access_remote_vm, starting with: if (mmap_read_lock_killable(mm)) return 0;
while dup_mmap already write locks the parent's mm.
Oh, you're only worried about vma_start_write()?
That's a non-issue. It doesn't take the lock normally, since it starts off with
if (__is_vma_write_locked(vma, &mm_lock_seq)) return;
which catches on the lock sequence number already being set.
So no extra locking there.
Well, technically there's extra locking because the code stupidly doesn't initialize new vma allocations to the right sequence number, but that was talked about here:
https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZN...
and it's a separate issue.
I'm going to bet one beer this is the issue.
The patch I'm responding to only consists of adding the call to vma_start_write and claims the 5% slowdown from it, while fixing crashes if the forking process is multithreaded.
For the fix to work it has to lock something against the parent.
VMA_ITERATOR(old_vmi, oldmm, 0);
[..] for_each_vma(old_vmi, mpnt) { [..] vma_start_write(mpnt);
the added line locks an obj in the parent's vm space.
The problem you linked looks like pessimization for freshly allocated vmas, but that's what is being operated on here.
Sorry, now I'm having trouble understanding the problem you are describing. We are locking the parent's vma before copying it and the newly created vma is locked before it's added into the vma tree. What is the problem then?
Sorry for the late reply!
Looks there has been a bunch of weird talking past one another in this thread and I don't think trying to straighten it all out is worth any time.
I think at least the two of us agree that if a single-threaded process enters dup_mmap an down_writes the mmap semaphore, then no new thread can pop up in said process, thus no surprise page faults from that angle. 3rd parties are supposed to interfaces like access_remote_vm, which down_read said semaphore and are consequently also not a problem. The only worry here is that someone is messing with another process memory without the semaphore, but is very unlikely and patchable in the worst case -- but someone(tm) has to audit. With all these conditions satisfied one can elide vma_start_write for a perf win.
Finally, I think we agreed you are going to do the audit ;)
Ack. I'll look into this once the dust settles. Thanks!
Cheers,
Mateusz Guzik <mjguzik gmail.com>
linux-stable-mirror@lists.linaro.org