On Jan 17, 2021, at 1:16 AM, Yu Zhao yuzhao@google.com wrote:
On Sat, Jan 16, 2021 at 11:32:22PM -0800, Nadav Amit wrote:
On Jan 16, 2021, at 8:41 PM, Yu Zhao yuzhao@google.com wrote:
On Tue, Jan 12, 2021 at 09:43:38PM +0000, Will Deacon wrote:
On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
On Jan 12, 2021, at 11:56 AM, Yu Zhao yuzhao@google.com wrote: On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote: > I will send an RFC soon for per-table deferred TLB flushes tracking. > The basic idea is to save a generation in the page-struct that tracks > when deferred PTE change took place, and track whenever a TLB flush > completed. In addition, other users - such as mprotect - would use > the tlb_gather interface. > > Unfortunately, due to limited space in page-struct this would only > be possible for 64-bit (and my implementation is only for x86-64).
I don't want to discourage you but I don't think this would end up well. PPC doesn't necessarily follow one-page-struct-per-table rule, and I've run into problems with this before while trying to do something similar.
Discourage, discourage. Better now than later.
It will be relatively easy to extend the scheme to be per-VMA instead of per-table for architectures that prefer it this way. It does require TLB-generation tracking though, which Andy only implemented for x86, so I will focus on x86-64 right now.
Can you remind me of what we're missing on arm64 in this area, please? I'm happy to help get this up and running once you have something I can build on.
I noticed arm/arm64 don't support ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH. Would it be something worth pursuing? Arm has been using mm_cpumask, so it might not be too difficult I guess?
[ +Mel Gorman who implemented ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH ]
IIUC, there are at least two bugs in x86 implementation.
First, there is a missing memory barrier in tlbbatch_add_mm() between inc_mm_tlb_gen() and the read of mm_cpumask().
In arch_tlbbatch_add_mm()? inc_mm_tlb_gen() has builtin barrier as its comment says -- atomic update ops that return values are also full memory barriers.
Yes, you are correct.
Second, try_to_unmap_flush() clears flush_required after flushing. Another thread can call set_tlb_ubc_flush_pending() after the flush and before flush_required is cleared, and the indication that a TLB flush is pending can be lost.
This isn't a problem either because flush_required is per thread.
Sorry, I meant mm->tlb_flush_batched . It is not per-thread. flush_tlb_batched_pending() clears it after flush and indications that set_tlb_ubc_flush_pending() sets in between can be lost.
I am working on addressing these issues among others, but, as you already saw, I am a bit slow.
On a different but related topic: Another thing that I noticed that Arm does not do is batching TLB flushes across VMAs. Since Arm does not have its own tlb_end_vma(), it uses the default tlb_end_vma(), which flushes each VMA separately. Peter Zijlstra’s comment says that there are advantages in flushing each VMA separately, but I am not sure it is better or intentional (especially since x86 does not do so).
I am trying to remove the arch-specific tlb_end_vma() and have a config option to control this behavior.
One thing worth noting is not all arm/arm64 hw versions support ranges. (system_supports_tlb_range()). But IIUC what you are trying to do, this isn't a problem.
I just wanted to get rid of arch-specific tlb_start_vma() it in order to cleanup the code (I am doing first the VMA-deferred tracking, as you asked). While I was doing that, I noticed that Arm does per-VMA TLB flushes. I do not know whether it is intentional, but it seems rather easy to get this behavior unintentionally.