On Tue, Dec 22, 2020 at 05:23:39PM -0700, Yu Zhao wrote:
and 2) people are spearheading multiple efforts to reduce the mmap_lock contention, which hopefully would make ufd users suffer less soon.
In my view UFFD is an already deployed working solution that eliminates the mmap_lock_write contention to allocate and free memory.
We need to add a UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS (UFFDIO_POPULATE just needs to zero out a page or THP and map it, it'll be indistinguishable to UFFDIO_ZEROPAGE, but it will solve the last performance bottleneck by avoiding a suprious wrprotect fault after the allocation).
After that malloc based on uffd should become competitive single threaded and it won't ever require the mmap_lock_write so allocations and freeing of memory can continue indefinitely from all threaded in parallel. There will never be another mmap or munmap stalling all threads.
This is not why uffd was created, it's just a secondary performance benefit of uffd, but it's still a relevant benefit in my view.
Every time I hear people with major mmap_lock_write issues I recommend uffd, but you know, until we add the UFFDIO_POPULATE, it will still have higher fixed allocation overhead because of the wprotect fault after UFFDIO_ZEROCOPY. UFFDIO_COPY also would be not as optimal as a clear_page and currently it's not even THP capable.
In addition you'll get a SIGBUS after an user after free. It's not like when you have a malloc lib doing MADV_DONTNEED at PAGE_SIZE granularity to rate limit the costly munmap, and then the app does an use after free and it reads zero or writes to a newly faulted in page.
The above will not require any special privilege and all allocated virtual memory remains fully swappable, because SIGBUS mode will never have to block any kernel initiated faults.
uffd-wp also is totally usable unprivileged by default to replace various software dirty bits with the info provided in O(1) instead of O(N), as long as the writes are done in userland also unprivileged by default without tweaking any sysctl and with zero risk of increasing reproduciblity of any exploit against unrelated random kernel bugs.
So if we're forced to take the mmap_lock_write it'd be cool if at least we can avoid it for 1 single pte or hugepmd wrprotection, as it happens in write_protect_page() KSM.
Thanks, Andrea