On Sat, Dec 19, 2020 at 2:06 PM Nadav Amit nadav.amit@gmail.com wrote:
[ I have in mind another solution, such as keeping in each page-table a “table-generation” which is the mm-generation at the time of the change, and only flush if “table-generation”==“mm-generation”, but it requires some thought on how to avoid adding new memory barriers. ]
IOW: I think the change that you suggest is insufficient, and a proper solution is too intrusive for “stable".
As for performance, I can add another patch later to remove the TLB flush that is unnecessarily performed during change_protection_range() that does permission promotion. I know that your concern is about the “protect” case but I cannot think of a good immediate solution that avoids taking mmap_lock for write.
Thoughts?
On a second thought (i.e., I don’t know what I was thinking), doing so — checking mm_tlb_flush_pending() on every PTE read which is potentially dangerous and flushing if needed - can lead to huge amount of TLB flushes and shootodowns as the counter might be elevated for considerable amount of time.
I've lost track as to whether we still think that this particular problem is really a problem, but could we perhaps make the tlb_flush_pending field be per-ptl instead of per-mm? Depending on how it gets used, it could plausibly be done without atomics or expensive barriers by using PTL to protect the field.
FWIW, x86 has a mm generation counter, and I don't think it would be totally crazy to find a way to expose an mm generation to core code. I don't think we'd want to expose the specific data structures that x86 uses to track it -- they're very tailored to the oddities of x86 TLB management. x86 also doesn't currently have any global concept of which mm generation is guaranteed to have been propagated to all CPUs -- we track the generation in the pagetables and, per cpu, the generation that we know that CPU has seen. x86 could offer a function "ensure that all CPUs catch up to mm generation G and don't return until this happens" and its relative "have all CPUs caught up to mm generation G", but these would need to look at data from multiple CPUs and would probably be too expensive on very large systems to use in normal page faults unless we were to cache the results somewhere. Making a nice cache for this is surely doable, but maybe more complexity than we'd want.
--Andy