On Mon, Dec 21, 2020 at 8:16 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, Dec 21, 2020 at 7:19 PM Andy Lutomirski luto@kernel.org wrote:
Ugh, this is unpleasantly complicated.
What I *should* have said is that *because* userfaultfd is changing the VM layout, it should always act as if it had to take the mmap_sem for writing, and that the RW->RO change is an example of when downgrading that write-lock to a read lock is simply not valid - because it basically breaks the rules about what a lookup (ie a read) can do.
A lookup can never turn a writable page non-writable. A lookup - through COW - _can_ turn a read-only page writable. So that's why "RO->RW" can be ok under the read lock, but RW->RO is not.
Does that clarify the situation when I phrase it that way instead?
It's certainly more clear, but I'm still not thrilled with a bunch of functions in different files maintained by different people that cooperate using an undocumented lockless protocol. It makes me nervous from a maintainability standpoint even if the code is, in fact, entirely correct.
But I didn't make my question clear either: when I asked about mark_screen_rdonly(), I wasn't asking about locking or races at all. mark_screen_rdonly() will happily mark *any* PTE readonly. I can easily believe that writable mappings of non-shared mappings all follow the same CoW rules in the kernel and that, assuming all the locking is correct, marking them readonly is conceptually okay. But shared mappings are a whole different story. Is it actually the case that all, or even most, drivers and other sources of shared mappings will function correctly if one of their PTEs becomes readonly behind their back? Or maybe there are less bizarre ways for this to happen without vm86 shenanigans and this is completely safe after all.
P.S. This whole mess reminds me that I need to take a closer look at the upcoming SHSTK code. Shadow stack pages violate all common sense about how the various PTE bits work.