On Tue, Dec 22, 2020 at 09:56:11PM -0500, Andrea Arcangeli wrote:
On Tue, Dec 22, 2020 at 04:39:46PM -0700, Yu Zhao wrote:
We are talking about non-COW anon pages here -- they can't be mapped more than once. So why not just identify them by checking page_mapcount == 1 and then unconditionally reuse them? (This is probably where I've missed things.)
The problem in depending on page_mapcount to decide if it's COW or non-COW (respectively wp_page_copy or wp_page_reuse) is that is GUP may elevate the count of a COW anon page that become a non-COW anon page.
This is Jann's idea not mine.
The problem is we have an unprivileged long term GUP like vmsplice that facilitates elevating the page count indefinitely, until the parent finally writes a secret to it. Theoretically a short term pin would do it too so it's not just vmpslice, but the short term pin would be incredibly more challenging to become a concern since it'd kill a phone battery and flash before it can read any data.
So what happens with your page_mapcount == 1 check is that it doesn't mean non-COW (we thought it did until it didn't for the long term gup pin in vmsplice).
Jann's testcases does fork() and set page_mapcount 2 and page_count to 2, vmsplice, take unprivileged infinitely long GUP pin to set page_count to 3, queue the page in the pipe with page_count elevated, munmap to drop page_count to 2 and page_mapcount to 1.
page_mapcount is 1, so you'd think the page is non-COW and owned by the parent, but the child can still read it so it's very much still wp_page_copy material if the parent tries to modify it. Otherwise the child can read the content.
This was supposed to be solvable by just doing the COW in gup(write=0) case if page_mapcount > 1 with commit 17839856fd58. I'm not exactly sure why that didn't fly and it had to be reverted by Peter in a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a but at the time this was happening I was side tracked by urgent issues and I didn't manage to look back of how we ended up with the big hammer page_count == 1 check instead to decide if to call wp_page_reuse or wp_page_shared.
So anyway, the only thing that is clear to me is that keeping the child from reading the page_mapcount == 1 pages of the parent, is the only reason why wp_page_reuse(vmf) will only be called on page_count(page) == 1 and not on page_mapcount(page) == 1.
It's also the reason why your page_mapcount assumption will risk to reintroduce the issue, and I only wish we could put back page_mapcount == 1 back there.
Still even if we put back page_mapcount there, it is not ok to leave the page fault with stale TLB entries and to rely on the fact wp_page_shared won't run. It'd also avoid the problem but I think if you leave stale TLB entries in change_protection just like NUMA balancing does, it also requires a catcher just like NUMA balancing has, or it'd truly work by luck.
So until we can put a page_mapcount == 1 check back there, the page_count will be by definition unreliable because of the speculative lookups randomly elevating all non zero page_counts at any time in the background on all pages, so you will never be able to tell if a page is true COW or if it's just a spurious COW because of a speculative lookup. It is impossible to differentiate a speculative lookup from a vmsplice ref in a child.
Thanks for the details.
In your patch, do we need to take wrprotect_rwsem in handle_userfault() as well? Otherwise, it seems userspace would have to synchronize between its wrprotect ioctl and fault handler? i.e., the fault hander needs to be aware that the content of write- protected pages can actually change before the iotcl returns.