On Mon, Dec 21, 2020 at 2:55 PM Nadav Amit nadav.amit@gmail.com wrote:
So as an alternative solution, I can do copying under the PTL after flushing, which seems to solve the problem.
I think that's a valid model, but note that we do the "re-check ptl" in a (*completely(* different part than we do the actual PTE install.
Note that the "Re-validate under PTL" code in cow_user_page() is *not* the "now we are installing the copy". No, that's actually for the "uhhuh, the copy using the virtual address outside the ptl failed, now we need to do something special".
The real "we hold teh ptl" actually happens in wp_page_copy(), after cow_user_page() has already returned.
So you'd have to change how all of that works.
And honestly, I'm not sure it's worth it - if this was the *only* case, then yes. But that whole "we load the original pte first, then we do whatever we _think_ we will need to do, and then we install the final pte after checking" is actually the case for every other page fault handling case too.
So are we sure the COW case is so special?
I really think this is clearly just a userfaultfd bug that we hadn't realized until now, and had possibly been hidden by timings or other random stuff before.
Linus