Hi!
On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
My observation/assumption is:
each child touches 100 random pages and despawns on each despawn `huge_pmd_unshare()` is called each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
Yeah, makes sense that that'd be slow.
There are probably several ways this could be optimized - like maybe changing tlb_remove_table_sync_one() to rely on the MM's cpumask (though that would require thinking about whether this interacts with remote MM access somehow), or batching the refcount drops for hugetlb shared page tables through something like struct mmu_gather, or doing something special for the unmap path, or changing the semantics of hugetlb page tables such that they can never turn into normal page tables again. However, I'm not planning to work on optimizing this.