The first patch is a fix with an explanation of the issue, you should read that first. The second patch adds a comment to document the rules because figuring this out from scratch causes brain pain.
Accidentally hitting this issue and getting negative consequences from it would require several stars to line up just right; but if someone out there is using a malloc() implementation that uses lockless data structures across threads or such, this could actually be a problem.
In case someone wants a testcase, here's a very artificial one:
``` #include <pthread.h> #include <err.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #include <sys/uio.h> #include <sys/mman.h> #include <sys/wait.h> #include <linux/io_uring.h>
#define SYSCHK(x) ({ \ typeof(x) __res = (x); \ if (__res == (typeof(x))-1) \ err(1, "SYSCHK(" #x ")"); \ __res; \ })
#define NUM_SQ_PAGES 4 static int uring_init(struct io_uring_sqe **sqesp, void **cqesp) { struct io_uring_sqe *sqes = SYSCHK(mmap(NULL, NUM_SQ_PAGES*0x1000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0)); void *cqes = SYSCHK(mmap(NULL, NUM_SQ_PAGES*0x1000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0)); *(volatile unsigned int *)(cqes+4) = 64 * NUM_SQ_PAGES; struct io_uring_params params = { .flags = IORING_SETUP_NO_MMAP|IORING_SETUP_NO_SQARRAY, .sq_off = { .user_addr = (unsigned long)sqes }, .cq_off = { .user_addr = (unsigned long)cqes } }; int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/10, ¶ms)); if (sqesp) *sqesp = sqes; if (cqesp) *cqesp = cqes; return uring_fd; }
static char *bufmem[0x3000] __attribute__((aligned(0x1000)));
static void *thread_fn(void *dummy) { unsigned long i = 0; while (1) { *(volatile unsigned long *)(bufmem + 0x0000) = i; *(volatile unsigned long *)(bufmem + 0x0f00) = i; *(volatile unsigned long *)(bufmem + 0x1000) = i; *(volatile unsigned long *)(bufmem + 0x1f00) = i; *(volatile unsigned long *)(bufmem + 0x2000) = i; *(volatile unsigned long *)(bufmem + 0x2f00) = i; i++; } }
int main(void) { #if 1 int uring_fd = uring_init(NULL, NULL); struct iovec reg_iov = { .iov_base = bufmem, .iov_len = 0x2000 }; SYSCHK(syscall(__NR_io_uring_register, uring_fd, IORING_REGISTER_BUFFERS, ®_iov, 1)); #endif
pthread_t thread; if (pthread_create(&thread, NULL, thread_fn, NULL)) errx(1, "pthread_create");
sleep(1); int child = SYSCHK(fork()); if (child == 0) { printf("bufmem values:\n"); printf(" 0x0000: 0x%lx\n", *(volatile unsigned long *)(bufmem + 0x0000)); printf(" 0x0f00: 0x%lx\n", *(volatile unsigned long *)(bufmem + 0x0f00)); printf(" 0x1000: 0x%lx\n", *(volatile unsigned long *)(bufmem + 0x1000)); printf(" 0x1f00: 0x%lx\n", *(volatile unsigned long *)(bufmem + 0x1f00)); printf(" 0x2000: 0x%lx\n", *(volatile unsigned long *)(bufmem + 0x2000)); printf(" 0x2f00: 0x%lx\n", *(volatile unsigned long *)(bufmem + 0x2f00)); return 0; } int wstatus; SYSCHK(wait(&wstatus)); return 0; } ```
Without this series, the child will usually print results that are apart by more than 1, which is not a state that ever occurred in the parent; in my opinion, that counts as a bug. If you change the "#if 1" to "#if 0", the bug won't manifest.
Signed-off-by: Jann Horn jannh@google.com --- Jann Horn (2): mm/memory: ensure fork child sees coherent memory snapshot mm/memory: Document how we make a coherent memory snapshot
kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++ mm/memory.c | 18 ++++++++++++++++++ 2 files changed, 52 insertions(+) --- base-commit: 8477ab143069c6b05d6da4a8184ded8b969240f5 change-id: 20250530-fork-tearing-71da211a50cf
When fork() encounters possibly-pinned pages, those pages are immediately copied instead of just marking PTEs to make CoW happen later. If the parent is multithreaded, this can cause the child to see memory contents that are inconsistent in multiple ways:
1. We are copying the contents of a page with a memcpy() while userspace may be writing to it. This can cause the resulting data in the child to be inconsistent. 2. After we've copied this page, future writes to other pages may continue to be visible to the child while future writes to this page are no longer visible to the child.
This means the child could theoretically see incoherent states where allocator freelists point to objects that are actually in use or stuff like that. A mitigating factor is that, unless userspace already has a deadlock bug, userspace can pretty much only observe such issues when fancy lockless data structures are used (because if another thread was in the middle of mutating data during fork() and the post-fork child tried to take the mutex protecting that data, it might wait forever).
On top of that, this issue is only observable when pages are either DMA-pinned or appear false-positive-DMA-pinned due to a page having >=1024 references and the parent process having used DMA-pinning at least once before.
Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn jannh@google.com --- mm/memory.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c index 49199410805c..b406dfda976b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -917,7 +917,25 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma /* * We have a prealloc page, all good! Take it * over and copy the page & arm it. + * + * One nasty aspect is that we could be in a multithreaded process or + * such, where another thread is in the middle of writing to memory + * while this thread is forking. As long as we're just marking PTEs as + * read-only to make copy-on-write happen *later*, that's easy; we just + * need to do a single TLB flush before dropping the mmap/VMA locks, and + * that's enough to guarantee that the child gets a coherent snapshot of + * memory. + * But here, where we're doing an immediate copy, we must ensure that + * threads in the parent process can no longer write into the page being + * copied until we're done forking. + * This means that we still need to mark the source PTE as read-only, + * with an immediate TLB flush. + * (To make the source PTE writable again after fork() is done, we can + * rely on the page fault handler to do that lazily, thanks to + * PageAnonExclusive().) */ + ptep_set_wrprotect(src_vma->vm_mm, addr, src_pte); + flush_tlb_page(src_vma, addr);
if (copy_mc_user_highpage(&new_folio->page, page, addr, src_vma)) return -EHWPOISON;
On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
When fork() encounters possibly-pinned pages, those pages are immediately copied instead of just marking PTEs to make CoW happen later. If the parent is multithreaded, this can cause the child to see memory contents that are inconsistent in multiple ways:
- We are copying the contents of a page with a memcpy() while userspace may be writing to it. This can cause the resulting data in the child to be inconsistent.
- After we've copied this page, future writes to other pages may continue to be visible to the child while future writes to this page are no longer visible to the child.
This means the child could theoretically see incoherent states where allocator freelists point to objects that are actually in use or stuff like that. A mitigating factor is that, unless userspace already has a deadlock bug, userspace can pretty much only observe such issues when fancy lockless data structures are used (because if another thread was in the middle of mutating data during fork() and the post-fork child tried to take the mutex protecting that data, it might wait forever).
Um, OK, but isn't that expected behaviour? POSIX says:
: A process shall be created with a single thread. If a multi-threaded : process calls fork(), the new process shall contain a replica of the : calling thread and its entire address space, possibly including the : states of mutexes and other resources. Consequently, the application : shall ensure that the child process only executes async-signal-safe : operations until such time as one of the exec functions is successful.
It's always been my understanding that you really, really shouldn't call fork() from a multithreaded process.
On 03.06.25 20:29, Matthew Wilcox wrote:
On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
When fork() encounters possibly-pinned pages, those pages are immediately copied instead of just marking PTEs to make CoW happen later. If the parent is multithreaded, this can cause the child to see memory contents that are inconsistent in multiple ways:
- We are copying the contents of a page with a memcpy() while userspace may be writing to it. This can cause the resulting data in the child to be inconsistent.
- After we've copied this page, future writes to other pages may continue to be visible to the child while future writes to this page are no longer visible to the child.
This means the child could theoretically see incoherent states where allocator freelists point to objects that are actually in use or stuff like that. A mitigating factor is that, unless userspace already has a deadlock bug, userspace can pretty much only observe such issues when fancy lockless data structures are used (because if another thread was in the middle of mutating data during fork() and the post-fork child tried to take the mutex protecting that data, it might wait forever).
Um, OK, but isn't that expected behaviour? POSIX says:
: A process shall be created with a single thread. If a multi-threaded : process calls fork(), the new process shall contain a replica of the : calling thread and its entire address space, possibly including the : states of mutexes and other resources. Consequently, the application : shall ensure that the child process only executes async-signal-safe : operations until such time as one of the exec functions is successful.
It's always been my understanding that you really, really shouldn't call fork() from a multithreaded process.
I have the same recollection, but rather because of concurrent O_DIRECT and locking (pthread_atfork ...).
Using the allocator above example: what makes sure that no other thread is halfway through modifying allocator state? You really have to sync somehow before calling fork() -- e.g., grabbing allocator locks in pthread_atfork().
For Linux we document in the man page
"After a fork() in a multithreaded program, the child can safely call only async-signal-safe functions (see signal-safety(7)) until such time as it calls execve(2)."
On Tue, Jun 3, 2025 at 8:37 PM David Hildenbrand david@redhat.com wrote:
On 03.06.25 20:29, Matthew Wilcox wrote:
On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
When fork() encounters possibly-pinned pages, those pages are immediately copied instead of just marking PTEs to make CoW happen later. If the parent is multithreaded, this can cause the child to see memory contents that are inconsistent in multiple ways:
- We are copying the contents of a page with a memcpy() while userspace may be writing to it. This can cause the resulting data in the child to be inconsistent.
- After we've copied this page, future writes to other pages may continue to be visible to the child while future writes to this page are no longer visible to the child.
This means the child could theoretically see incoherent states where allocator freelists point to objects that are actually in use or stuff like that. A mitigating factor is that, unless userspace already has a deadlock bug, userspace can pretty much only observe such issues when fancy lockless data structures are used (because if another thread was in the middle of mutating data during fork() and the post-fork child tried to take the mutex protecting that data, it might wait forever).
Um, OK, but isn't that expected behaviour? POSIX says:
: A process shall be created with a single thread. If a multi-threaded : process calls fork(), the new process shall contain a replica of the : calling thread and its entire address space, possibly including the : states of mutexes and other resources. Consequently, the application : shall ensure that the child process only executes async-signal-safe : operations until such time as one of the exec functions is successful.
It's always been my understanding that you really, really shouldn't call fork() from a multithreaded process.
I have the same recollection, but rather because of concurrent O_DIRECT and locking (pthread_atfork ...).
Using the allocator above example: what makes sure that no other thread is halfway through modifying allocator state? You really have to sync somehow before calling fork() -- e.g., grabbing allocator locks in pthread_atfork().
Yeah, like what glibc does for its malloc implementation to prevent allocator calls from racing with fork(), so that malloc() keeps working after fork(), even though POSIX says that the libc doesn't have to guarantee that.
For Linux we document in the man page
"After a fork() in a multithreaded program, the child can safely call only async-signal-safe functions (see signal-safety(7)) until such time as it calls execve(2)."
On 03.06.25 21:09, Jann Horn wrote:
On Tue, Jun 3, 2025 at 8:37 PM David Hildenbrand david@redhat.com wrote:
On 03.06.25 20:29, Matthew Wilcox wrote:
On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
When fork() encounters possibly-pinned pages, those pages are immediately copied instead of just marking PTEs to make CoW happen later. If the parent is multithreaded, this can cause the child to see memory contents that are inconsistent in multiple ways:
- We are copying the contents of a page with a memcpy() while userspace may be writing to it. This can cause the resulting data in the child to be inconsistent.
- After we've copied this page, future writes to other pages may continue to be visible to the child while future writes to this page are no longer visible to the child.
This means the child could theoretically see incoherent states where allocator freelists point to objects that are actually in use or stuff like that. A mitigating factor is that, unless userspace already has a deadlock bug, userspace can pretty much only observe such issues when fancy lockless data structures are used (because if another thread was in the middle of mutating data during fork() and the post-fork child tried to take the mutex protecting that data, it might wait forever).
Um, OK, but isn't that expected behaviour? POSIX says:
: A process shall be created with a single thread. If a multi-threaded : process calls fork(), the new process shall contain a replica of the : calling thread and its entire address space, possibly including the : states of mutexes and other resources. Consequently, the application : shall ensure that the child process only executes async-signal-safe : operations until such time as one of the exec functions is successful.
It's always been my understanding that you really, really shouldn't call fork() from a multithreaded process.
I have the same recollection, but rather because of concurrent O_DIRECT and locking (pthread_atfork ...).
Using the allocator above example: what makes sure that no other thread is halfway through modifying allocator state? You really have to sync somehow before calling fork() -- e.g., grabbing allocator locks in pthread_atfork().
Yeah, like what glibc does for its malloc implementation to prevent allocator calls from racing with fork(), so that malloc() keeps working after fork(), even though POSIX says that the libc doesn't have to guarantee that.
I mean, the patch here is simple, and there is already a performance penalty when allocating+copying the page, so it's not really the common hot path.
Merely a question if this was ever officially supported and warrents a "Fixes:".
On Tue, Jun 3, 2025 at 8:29 PM Matthew Wilcox willy@infradead.org wrote:
On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
When fork() encounters possibly-pinned pages, those pages are immediately copied instead of just marking PTEs to make CoW happen later. If the parent is multithreaded, this can cause the child to see memory contents that are inconsistent in multiple ways:
- We are copying the contents of a page with a memcpy() while userspace may be writing to it. This can cause the resulting data in the child to be inconsistent.
- After we've copied this page, future writes to other pages may continue to be visible to the child while future writes to this page are no longer visible to the child.
This means the child could theoretically see incoherent states where allocator freelists point to objects that are actually in use or stuff like that. A mitigating factor is that, unless userspace already has a deadlock bug, userspace can pretty much only observe such issues when fancy lockless data structures are used (because if another thread was in the middle of mutating data during fork() and the post-fork child tried to take the mutex protecting that data, it might wait forever).
Um, OK, but isn't that expected behaviour? POSIX says:
I don't think it is expected behavior that locklessly-updated data structures in application code could break.
: A process shall be created with a single thread. If a multi-threaded : process calls fork(), the new process shall contain a replica of the : calling thread and its entire address space, possibly including the : states of mutexes and other resources. Consequently, the application : shall ensure that the child process only executes async-signal-safe : operations until such time as one of the exec functions is successful.
I think that is only talking about ways in which you can interact with libc, and does not mean that application code couldn't access its own lockless data structures or such.
Though admittedly that is a fairly theoretical point, since in practice the most likely place where you'd encounter this kind of issue would be in an allocator implementation or such.
It's always been my understanding that you really, really shouldn't call fork() from a multithreaded process.
On 03.06.25 21:03, Jann Horn wrote:
On Tue, Jun 3, 2025 at 8:29 PM Matthew Wilcox willy@infradead.org wrote:
On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
When fork() encounters possibly-pinned pages, those pages are immediately copied instead of just marking PTEs to make CoW happen later. If the parent is multithreaded, this can cause the child to see memory contents that are inconsistent in multiple ways:
- We are copying the contents of a page with a memcpy() while userspace may be writing to it. This can cause the resulting data in the child to be inconsistent.
- After we've copied this page, future writes to other pages may continue to be visible to the child while future writes to this page are no longer visible to the child.
This means the child could theoretically see incoherent states where allocator freelists point to objects that are actually in use or stuff like that. A mitigating factor is that, unless userspace already has a deadlock bug, userspace can pretty much only observe such issues when fancy lockless data structures are used (because if another thread was in the middle of mutating data during fork() and the post-fork child tried to take the mutex protecting that data, it might wait forever).
Um, OK, but isn't that expected behaviour? POSIX says:
I don't think it is expected behavior that locklessly-updated data structures in application code could break.
: A process shall be created with a single thread. If a multi-threaded : process calls fork(), the new process shall contain a replica of the : calling thread and its entire address space, possibly including the : states of mutexes and other resources. Consequently, the application : shall ensure that the child process only executes async-signal-safe : operations until such time as one of the exec functions is successful.
I think that is only talking about ways in which you can interact with libc, and does not mean that application code couldn't access its own lockless data structures or such.
Though admittedly that is a fairly theoretical point, since in practice the most likely place where you'd encounter this kind of issue would be in an allocator implementation or such.
I thought a bit further about that.
The thing is, that another thread in the parent might be doing something lockless using atomics etc. And in the parent, that thread will make progress as soon as fork() is done. In the child, that thread will never make progress again, as it is gone.
So the assumption must be that another thread possibly stalling forever and not making any progress will not affect the algorithm in the child.
I am not sure if that is actually the case for many lockless algorithms in allocators. I'm curious, do we have any examples of such algorithms, in particular, regarding allocators?
On 03.06.25 20:21, Jann Horn wrote:
When fork() encounters possibly-pinned pages, those pages are immediately copied instead of just marking PTEs to make CoW happen later. If the parent is multithreaded, this can cause the child to see memory contents that are inconsistent in multiple ways:
- We are copying the contents of a page with a memcpy() while userspace may be writing to it. This can cause the resulting data in the child to be inconsistent.
- After we've copied this page, future writes to other pages may>
continue to be visible to the child while future writes to this page are
no longer visible to the child.
This means the child could theoretically see incoherent states where
allocator freelists point to objects that are actually in use or stuff like that. A mitigating factor is that, unless userspace already has a deadlock bug, userspace can pretty much only observe such issues when fancy lockless data structures are used (because if another thread was in the middle of mutating data during fork() and the post-fork child tried to take the mutex protecting that data, it might wait forever).
Hmm, interesting.
On top of that, this issue is only observable when pages are either DMA-pinned or appear false-positive-DMA-pinned due to a page having >=1024 references and the parent process having used DMA-pinning at least once before.
Right.
Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn jannh@google.com
mm/memory.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c index 49199410805c..b406dfda976b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -917,7 +917,25 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma /* * We have a prealloc page, all good! Take it * over and copy the page & arm it.
*
* One nasty aspect is that we could be in a multithreaded process or
* such, where another thread is in the middle of writing to memory
* while this thread is forking. As long as we're just marking PTEs as
* read-only to make copy-on-write happen *later*, that's easy; we just
* need to do a single TLB flush before dropping the mmap/VMA locks, and
* that's enough to guarantee that the child gets a coherent snapshot of
* memory.
* But here, where we're doing an immediate copy, we must ensure that
* threads in the parent process can no longer write into the page being
* copied until we're done forking.
* This means that we still need to mark the source PTE as read-only,
* with an immediate TLB flush.
* (To make the source PTE writable again after fork() is done, we can
* rely on the page fault handler to do that lazily, thanks to
*/* PageAnonExclusive().)
- ptep_set_wrprotect(src_vma->vm_mm, addr, src_pte);
- flush_tlb_page(src_vma, addr);
Would we need something similar for hugetlb, or is that already handled?
On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
When fork() encounters possibly-pinned pages, those pages are immediately copied instead of just marking PTEs to make CoW happen later. If the parent is multithreaded, this can cause the child to see memory contents that are inconsistent in multiple ways:
- We are copying the contents of a page with a memcpy() while userspace may be writing to it. This can cause the resulting data in the child to be inconsistent.
This is an interesting problem, but we'll get to it later.
- After we've copied this page, future writes to other pages may continue to be visible to the child while future writes to this page are no longer visible to the child.
Yes, and this is not fixable. It's also a problem for the regular write-protect pte path where inevitably only a part of the address space will be write-protected. This would only be fixable if e.g we suspended every thread on a multi-threaded fork.
This means the child could theoretically see incoherent states where allocator freelists point to objects that are actually in use or stuff like that. A mitigating factor is that, unless userspace already has a deadlock bug, userspace can pretty much only observe such issues when fancy lockless data structures are used (because if another thread was in the middle of mutating data during fork() and the post-fork child tried to take the mutex protecting that data, it might wait forever).
Ok, so the issue here is that atomics + memcpy (or our kernel variants) will possibly observe tearing. This is indeed a problem, and POSIX doesn't _really_ tell us anything about this. _However_:
POSIX says:
Any locks held by any thread in the calling process that have been set to be process-shared shall not be held by the child process. For locks held by any thread in the calling process that have not been set to be process-shared, any attempt by the child process to perform any operation on the lock results in undefined behavior (regardless of whether the calling process is single-threaded or multi-threaded).
The interesting bit here is "For locks held by any thread [...] any attempt by the child [...] results in UB". I don't think it's entirely far-fetched to say the spirit of the law is that atomics may also be UB (just like a lock[1] that was held by a separate thread, then unlocked mid-concurrent-fork is in a UB state).
In any way, I think the bottom-line is that fork memory snapshot coherency is a fallacy. It's really impossible to reach without adding insane constraints (like the aforementioned thread suspending + resume). It's not even possible when going through normal write-protect paths that have been conceptually stable since the BSDs in the 1980s (due to the write-protect-a-page-at-a-time-problem).
Thus, personally I don't think this is worth fixing.
[1] This (at least in theory) covers every lock, so it also encompasses pthread spinlocks
On Tue, Jun 3, 2025 at 10:32 PM Pedro Falcato pfalcato@suse.de wrote:
On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
When fork() encounters possibly-pinned pages, those pages are immediately copied instead of just marking PTEs to make CoW happen later. If the parent is multithreaded, this can cause the child to see memory contents that are inconsistent in multiple ways:
- We are copying the contents of a page with a memcpy() while userspace may be writing to it. This can cause the resulting data in the child to be inconsistent.
This is an interesting problem, but we'll get to it later.
- After we've copied this page, future writes to other pages may continue to be visible to the child while future writes to this page are no longer visible to the child.
Yes, and this is not fixable. It's also a problem for the regular write-protect pte path where inevitably only a part of the address space will be write-protected.
I don't understand what you mean by "inevitably only a part of the address space will be write-protected". Are you talking about how shared pages are kept shared between parent in child? Or are you talking about how there is a point in time at which part of the address space is write-protected while another part is not yet write-protected? In that case: Yes, that can happen, but that's not a problem.
This would only be fixable if e.g we suspended every thread on a multi-threaded fork.
No, I think it is fine to keep threads running in parallel on a multi-threaded fork as long as all the writes they do are guaranteed to also be observable in the child. Such writes are no different from writes performed before fork().
It would only get problematic if something in the parent first wrote to page A, which has already been copied to the child (so the child no longer sees the write) and then wrote to page B, which is CoWed (so the child would see the write). I prevent this scenario by effectively suspending the thread that tries to write to page A until the fork is over (by making it block on the mmap lock in the fault handling path).
This means the child could theoretically see incoherent states where allocator freelists point to objects that are actually in use or stuff like that. A mitigating factor is that, unless userspace already has a deadlock bug, userspace can pretty much only observe such issues when fancy lockless data structures are used (because if another thread was in the middle of mutating data during fork() and the post-fork child tried to take the mutex protecting that data, it might wait forever).
Ok, so the issue here is that atomics + memcpy (or our kernel variants) will possibly observe tearing. This is indeed a problem, and POSIX doesn't _really_ tell us anything about this. _However_:
POSIX says:
Any locks held by any thread in the calling process that have been set to be process-shared shall not be held by the child process. For locks held by any thread in the calling process that have not been set to be process-shared, any attempt by the child process to perform any operation on the lock results in undefined behavior (regardless of whether the calling process is single-threaded or multi-threaded).
The interesting bit here is "For locks held by any thread [...] any attempt by the child [...] results in UB". I don't think it's entirely far-fetched to say the spirit of the law is that atomics may also be UB (just like a lock[1] that was held by a separate thread, then unlocked mid-concurrent-fork is in a UB state).
I think interpreting atomic operations as locks is far-fetched. Also, POSIX is a sort of minimal bar, and if we only implemented things explicitly required by POSIX, we might not have a particularly useful operating system.
Besides, I think things specified by the C standard override whatever POSIX says, and C23 specifies that there are atomic operations, and I haven't seen anything in C23 that restricts availability of those to before fork().
In any way, I think the bottom-line is that fork memory snapshot coherency is a fallacy. It's really impossible to reach without adding insane constraints (like the aforementioned thread suspending + resume). It's not even possible when going through normal write-protect paths that have been conceptually stable since the BSDs in the 1980s (due to the write-protect-a-page-at-a-time-problem).
No, Linux already had memory snapshot coherency before commit 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes"). Write-protecting a page at a time does not cause coherency issues, because letting a concurrent thread write into such memory during fork() is no different from letting it do so before fork() from a memory coherency perspective, as long as fork() write-locks memory management for the process.
On Wed, Jun 04, 2025 at 05:41:47PM +0200, Jann Horn wrote:
On Tue, Jun 3, 2025 at 10:32 PM Pedro Falcato pfalcato@suse.de wrote:
On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
When fork() encounters possibly-pinned pages, those pages are immediately copied instead of just marking PTEs to make CoW happen later. If the parent is multithreaded, this can cause the child to see memory contents that are inconsistent in multiple ways:
- We are copying the contents of a page with a memcpy() while userspace may be writing to it. This can cause the resulting data in the child to be inconsistent.
This is an interesting problem, but we'll get to it later.
- After we've copied this page, future writes to other pages may continue to be visible to the child while future writes to this page are no longer visible to the child.
Yes, and this is not fixable. It's also a problem for the regular write-protect pte path where inevitably only a part of the address space will be write-protected.
I don't understand what you mean by "inevitably only a part of the address space will be write-protected". Are you talking about how shared pages are kept shared between parent in child? Or are you talking about how there is a point in time at which part of the address space is write-protected while another part is not yet write-protected? In that case: Yes, that can happen, but that's not a problem.
This would only be fixable if e.g we suspended every thread on a multi-threaded fork.
No, I think it is fine to keep threads running in parallel on a multi-threaded fork as long as all the writes they do are guaranteed to also be observable in the child. Such writes are no different from writes performed before fork().
It would only get problematic if something in the parent first wrote to page A, which has already been copied to the child (so the child no longer sees the write) and then wrote to page B, which is CoWed (so the child would see the write). I prevent this scenario by effectively suspending the thread that tries to write to page A until the fork is over (by making it block on the mmap lock in the fault handling path).
Ah yes, I see my mistake - we write lock all VMAs as we dup them, so the problem I described can't happen. Thanks for the explanation :)
On 6/3/25 20:21, Jann Horn wrote:
When fork() encounters possibly-pinned pages, those pages are immediately copied instead of just marking PTEs to make CoW happen later. If the parent is multithreaded, this can cause the child to see memory contents that are inconsistent in multiple ways:
- We are copying the contents of a page with a memcpy() while userspace may be writing to it. This can cause the resulting data in the child to be inconsistent.
- After we've copied this page, future writes to other pages may continue to be visible to the child while future writes to this page are no longer visible to the child.
This means the child could theoretically see incoherent states where allocator freelists point to objects that are actually in use or stuff like that. A mitigating factor is that, unless userspace already has a deadlock bug, userspace can pretty much only observe such issues when fancy lockless data structures are used (because if another thread was in the middle of mutating data during fork() and the post-fork child tried to take the mutex protecting that data, it might wait forever).
On top of that, this issue is only observable when pages are either DMA-pinned or appear false-positive-DMA-pinned due to a page having >=1024 references and the parent process having used DMA-pinning at least once before.
Seems the changelog seems to be missing the part describing what it's doing to fix the issue? Some details are not immediately obvious (the writing threads become blocked in page fault) as the conversation has shown.
Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn jannh@google.com
Given how the fix seems to be localized to the already rare slowpath and doesn't require us to pessimize every trivial fork(), it seems reasonable to me even if don't have a concrete example of a sane code in the wild that's broken by the current behavior, so:
Acked-by: Vlastimil Babka vbabka@suse.cz
mm/memory.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c index 49199410805c..b406dfda976b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -917,7 +917,25 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma /* * We have a prealloc page, all good! Take it * over and copy the page & arm it.
*
* One nasty aspect is that we could be in a multithreaded process or
* such, where another thread is in the middle of writing to memory
* while this thread is forking. As long as we're just marking PTEs as
* read-only to make copy-on-write happen *later*, that's easy; we just
* need to do a single TLB flush before dropping the mmap/VMA locks, and
* that's enough to guarantee that the child gets a coherent snapshot of
* memory.
* But here, where we're doing an immediate copy, we must ensure that
* threads in the parent process can no longer write into the page being
* copied until we're done forking.
* This means that we still need to mark the source PTE as read-only,
* with an immediate TLB flush.
* (To make the source PTE writable again after fork() is done, we can
* rely on the page fault handler to do that lazily, thanks to
*/* PageAnonExclusive().)
- ptep_set_wrprotect(src_vma->vm_mm, addr, src_pte);
- flush_tlb_page(src_vma, addr);
if (copy_mc_user_highpage(&new_folio->page, page, addr, src_vma)) return -EHWPOISON;
On Thu, Jun 05, 2025 at 09:33:24AM +0200, Vlastimil Babka wrote:
On 6/3/25 20:21, Jann Horn wrote:
When fork() encounters possibly-pinned pages, those pages are immediately copied instead of just marking PTEs to make CoW happen later. If the parent is multithreaded, this can cause the child to see memory contents that are inconsistent in multiple ways:
- We are copying the contents of a page with a memcpy() while userspace may be writing to it. This can cause the resulting data in the child to be inconsistent.
- After we've copied this page, future writes to other pages may continue to be visible to the child while future writes to this page are no longer visible to the child.
This means the child could theoretically see incoherent states where allocator freelists point to objects that are actually in use or stuff like that. A mitigating factor is that, unless userspace already has a deadlock bug, userspace can pretty much only observe such issues when fancy lockless data structures are used (because if another thread was in the middle of mutating data during fork() and the post-fork child tried to take the mutex protecting that data, it might wait forever).
On top of that, this issue is only observable when pages are either DMA-pinned or appear false-positive-DMA-pinned due to a page having >=1024 references and the parent process having used DMA-pinning at least once before.
Seems the changelog seems to be missing the part describing what it's doing to fix the issue? Some details are not immediately obvious (the writing threads become blocked in page fault) as the conversation has shown.
Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn jannh@google.com
Given how the fix seems to be localized to the already rare slowpath and doesn't require us to pessimize every trivial fork(), it seems reasonable to me even if don't have a concrete example of a sane code in the wild that's broken by the current behavior, so:
Acked-by: Vlastimil Babka vbabka@suse.cz
Acked-by: Pedro Falcato pfalcato@suse.de
linux-stable-mirror@lists.linaro.org