Hello Linus,
On Wed, Dec 23, 2020 at 03:39:53PM -0800, Linus Torvalds wrote:
On Wed, Dec 23, 2020 at 1:39 PM Andrea Arcangeli aarcange@redhat.com wrote:
On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
Thanks for the details.
I hope we can find a way put the page_mapcount back where there's a page_count right now.
I really don't think that's ever going to happen - at least if you're talking about do_wp_page().
Yes, I was referring to do_wp_page().
I refuse to make the *simple* core operations VM have to jump through hoops, and the old COW mapcount logic was that. I *much* prefer the newer COW code, because the rules are more straightforward.
Agreed. I don't have any practical alternative proposal in fact, it was only an hypothetical wish/hope, echoing Hugh's view on this matter.
page_count is far from optimal, but it is a feature it finally allowed us to notice that various code (clear_refs_write included apparently even after the fix) leaves stale too permissive TLB entries when it shouldn't.
I absolutely agree that page_count isn't exactly optimal, but "mapcount" is just so much worse.
Agreed. The "isn't exactly optimal" part is the only explanation for wish/hope.
page_count() is at least _logical_, and has a very clear meaning: "this page has other users". mapcount() means something else, and is simply not sufficient or relevant wrt COW.
That doesn't mean that page_mapcount() is wrong - it's just that it's wrong for COW. page_mapcount() is great for rmap, so that we can see when we need to shoot down a memory mapping of a page that gets released (truncate being the classic example).
I think that the mapcount games we used to have were horrible. I absolutely much prefer where we are now wrt COW.
The modern rules for COW handling are:
- if we got a COW fault there might be another user, we copy (and
this is where the page count makes so much logical sense).
- if somebody needs to pin the page in the VM, we either make sure
that it is pre-COWed and we
(a) either never turn it back into a COW page (ie the fork()-time stuff we have for pinned pages)
(b) or there is some explicit marker on the page or in the page table (ie the userfaultfd_pte_wp thing).
those are _so_ much more straightforward than the very complex rules we used to have that were actively buggy, in addition to requiring the page lock. So they were buggy and slow.
Agreed, this is a very solid solution that solves the problem the mapcount had with GUP pins.
The only alternative but very vapourware thought I had on this matter, is that all troubles start when we let a GUP-pinned page be unmapped from the pgtables.
I already told Jens, is io_uring could use mmu notifier already (that would make any io_uring GUP pin not count anymore with regard to page_mapcount vs page_count difference) and vmsplice also needs some handling or maybe become privileged.
The above two improvements are orthogonal with the COW issue since long term GUP pins do more than just mlock and they break most advanced VM features. It's not ok just to account GUP pins in rlimit mlock.
The above two improvements however would go into the direction to make mapcount more suitable for COW, but it'd still not be enough.
The transient GUP pins also would need to be taken care of, but we could wait for those to be released, since those are guaranteed to be released or something else, not just munmap()/MADV_DONTNEED, will remain in D state.
Anyway.. until io_uring and vmsplice are solved first, it's pointless to even wonder about transient GUP pins.
And yes, I had forgotten about that userfaultfd_pte_wp() because I was being myopic and only looking at wp_page_copy(). So using that as a way to make sure that a page doesn't go through COW is a good way to avoid the COW race, but I think that thing requires a bit in the page table which might be a problem on some architectures?
Correct, it requires a bit in the pgtable.
The bit is required in order to disambiguate which regions have been marked for wrprotect faults and which didn't.
A practical example would be: fork() called by an app with a very large vma with VM_UFFD_WP set.
There would be a storm of WP userfaults if we didn't have the software bit to detect exactly which pte were wrprotected by uffd-wp and which ones were wrprotected by fork.
That bit then sends the COW fault to a safe place if wrprotect is running (the problem is we didn't see the un-wrprotect would clear the bit while the TLB flush of the wrprotect could be still pending).
The page_mapcount I guess hidden that race to begin with, just like it did in clear_refs_write.
uffd-wp is similar to soft dirty: it won't work well without a pgtable software bit. The software bit, to avoid storms of false positives during memory pressure, also survives swapin/swapout, again like soft dirty.
The bit requirement is handled through a per-arch option arch/x86/Kconfig similarly to soft dirty.
select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD
From an high level, the extra bit in the pte/hugepmd could be seen as
holding the information that would be generated by split_vma() followed by clearing VM_WRITE in the middle vma. Setting that bit results in a cheaper runtime than allocating 2 more vmas with the associated locking and rbtree size increase, but same accuracy.
Thanks, Andrea