On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
Thanks for the details.
I hope we can find a way put the page_mapcount back where there's a page_count right now.
If you're so worried about having to maintain a all defined well documented (or to be documented even better if you ACK it) marker/catcher for userfaultfd_writeprotect, I can't see how you could consider to maintain the page fault safe against any random code leaving too permissive TLB entries out of sync of the more restrictive pte permissions as it was happening with clear_refs_write, which worked by luck until page_mapcount was changed to page_count.
page_count is far from optimal, but it is a feature it finally allowed us to notice that various code (clear_refs_write included apparently even after the fix) leaves stale too permissive TLB entries when it shouldn't.
The question is only which way you prefer to fix clear_refs_write and I don't think we can deviate from those 3 methods that already exist today. So clear_refs_write will have to pick one of those and currently it's not falling in the same category with mprotect even after the fix.
I think if clear_refs_write starts to take the mmap_write_lock and really start to operate like mprotect, only then we can consider to make userfaultfd_writeprotect also operate like mprotect.
Even then I'd hope we can at least be allowed to make it operate like KSM write_protect_page for len <= HPAGE_PMD_SIZE, something that clear_refs_write cannot do since it works in O(N) and tends to scan everything at once, so there would be no point to optimize not to defer the flush, for a process with a tiny amount of virtual memory mapped.
vm86 also should be fixed to fall in the same category with mprotect, since performance there is irrelevant.
Thanks, Andrea