[ Restoring the recipients after mistakenly pressing reply instead of reply-all ]
On May 9, 2019, at 12:11 PM, Peter Zijlstra peterz@infradead.org wrote:
On Thu, May 09, 2019 at 06:50:00PM +0000, Nadav Amit wrote:
On May 9, 2019, at 11:24 AM, Peter Zijlstra peterz@infradead.org wrote:
On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
As a simple optimization, I think it is possible to hold multiple nesting counters in the mm, similar to tlb_flush_pending, for freed_tables, cleared_ptes, etc.
The first time you set tlb->freed_tables, you also atomically increase mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use mm->tlb_flush_freed_tables instead of tlb->freed_tables.
That sounds fraught with races and expensive; I would much prefer to not go there for this arguably rare case.
Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1 races and doesn't see that PTE. Therefore CPU-0 sets and counts cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather, it will see cleared_ptes count increased and flush that granularity, OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall miss an invalidate it should have had.
CPU-0 would send a TLB shootdown request to CPU-1 when it is done, so I don’t see the problem. The TLB shootdown mechanism is independent of the mmu_gather for the matter.
Duh.. I still don't like those unconditional mm wide atomic counters.
This whole concurrent mmu_gather stuff is horrible.
/me ponders more....
So I think the fundamental race here is this:
CPU-0 CPU-1
tlb_gather_mmu(.start=1, tlb_gather_mmu(.start=2, .end=3); .end=4);
ptep_get_and_clear_full(2) tlb_remove_tlb_entry(2); __tlb_remove_page(); if (pte_present(2)) // nope
tlb_finish_mmu(); // continue without TLBI(2) // whoopsie
tlb_finish_mmu(); tlb_flush() -> TLBI(2)
And we can fix that by having tlb_finish_mmu() sync up. Never let a concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers have completed.
This should not be too hard to make happen.
This synchronization sounds much more expensive than what I proposed. But I agree that cache-lines that move from one CPU to another might become an issue. But I think that the scheme I suggested would minimize this overhead.
Well, it would have a lot more unconditional atomic ops. My scheme only waits when there is actual concurrency.
Well, something has to give. I didn’t think that if the same core does the atomic op it would be too expensive.
I _think_ something like the below ought to work, but its not even been near a compiler. The only problem is the unconditional wakeup; we can play games to avoid that if we want to continue with this.
Ideally we'd only do this when there's been actual overlap, but I've not found a sensible way to detect that.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 4ef4bbe78a1d..b70e35792d29 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) * * Therefore we must rely on tlb_flush_*() to guarantee order. */
- atomic_dec(&mm->tlb_flush_pending);
- if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
wake_up_var(&mm->tlb_flush_pending);
- } else {
wait_event_var(&mm->tlb_flush_pending,
!atomic_read_acquire(&mm->tlb_flush_pending));
- }
}
It still seems very expensive to me, at least for certain workloads (e.g., Apache with multithreaded MPM).
It may be possible to avoid false-positive nesting indications (when the flushes do not overlap) by creating a new struct mmu_gather_pending, with something like:
struct mmu_gather_pending { u64 start; u64 end; struct mmu_gather_pending *next; }
tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending (pointing to the linked list) and find whether there is any overlap. This would still require synchronization (acquiring a lock when allocating and deallocating or something fancier).
On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
And we can fix that by having tlb_finish_mmu() sync up. Never let a concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers have completed.
This should not be too hard to make happen.
This synchronization sounds much more expensive than what I proposed. But I agree that cache-lines that move from one CPU to another might become an issue. But I think that the scheme I suggested would minimize this overhead.
Well, it would have a lot more unconditional atomic ops. My scheme only waits when there is actual concurrency.
Well, something has to give. I didn’t think that if the same core does the atomic op it would be too expensive.
They're still at least 20 cycles a pop, uncontended.
I _think_ something like the below ought to work, but its not even been near a compiler. The only problem is the unconditional wakeup; we can play games to avoid that if we want to continue with this.
Ideally we'd only do this when there's been actual overlap, but I've not found a sensible way to detect that.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 4ef4bbe78a1d..b70e35792d29 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) * * Therefore we must rely on tlb_flush_*() to guarantee order. */
- atomic_dec(&mm->tlb_flush_pending);
- if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
wake_up_var(&mm->tlb_flush_pending);
- } else {
wait_event_var(&mm->tlb_flush_pending,
!atomic_read_acquire(&mm->tlb_flush_pending));
- }
}
It still seems very expensive to me, at least for certain workloads (e.g., Apache with multithreaded MPM).
Is that Apache-MPM workload triggering this lots? Having a known benchmark for this stuff is good for when someone has time to play with things.
It may be possible to avoid false-positive nesting indications (when the flushes do not overlap) by creating a new struct mmu_gather_pending, with something like:
struct mmu_gather_pending { u64 start; u64 end; struct mmu_gather_pending *next; }
tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending (pointing to the linked list) and find whether there is any overlap. This would still require synchronization (acquiring a lock when allocating and deallocating or something fancier).
We have an interval_tree for this, and yes, that's how far I got :/
The other thing I was thinking of is trying to detect overlap through the page-tables themselves, but we have a distinct lack of storage there.
The things is, if this threaded monster runs on all CPUs (busy front end server) and does a ton of invalidation due to all the short lived request crud, then all the extra invalidations will add up too. Having to do process (machine in this case) wide invalidations is expensive, having to do more of them surely isn't cheap either.
So there might be something to win here.
On May 13, 2019, at 1:36 AM, Peter Zijlstra peterz@infradead.org wrote:
On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
And we can fix that by having tlb_finish_mmu() sync up. Never let a concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers have completed.
This should not be too hard to make happen.
This synchronization sounds much more expensive than what I proposed. But I agree that cache-lines that move from one CPU to another might become an issue. But I think that the scheme I suggested would minimize this overhead.
Well, it would have a lot more unconditional atomic ops. My scheme only waits when there is actual concurrency.
Well, something has to give. I didn’t think that if the same core does the atomic op it would be too expensive.
They're still at least 20 cycles a pop, uncontended.
I _think_ something like the below ought to work, but its not even been near a compiler. The only problem is the unconditional wakeup; we can play games to avoid that if we want to continue with this.
Ideally we'd only do this when there's been actual overlap, but I've not found a sensible way to detect that.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 4ef4bbe78a1d..b70e35792d29 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) * * Therefore we must rely on tlb_flush_*() to guarantee order. */
- atomic_dec(&mm->tlb_flush_pending);
- if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
wake_up_var(&mm->tlb_flush_pending);
- } else {
wait_event_var(&mm->tlb_flush_pending,
!atomic_read_acquire(&mm->tlb_flush_pending));
- }
}
It still seems very expensive to me, at least for certain workloads (e.g., Apache with multithreaded MPM).
Is that Apache-MPM workload triggering this lots? Having a known benchmark for this stuff is good for when someone has time to play with things.
Setting Apache2 with mpm_worker causes every request to go through mmap-writev-munmap flow on every thread. I didn’t run this workload after the patches that downgrade the mmap_sem to read before the page-table zapping were introduced. I presume these patches would allow the page-table zapping to be done concurrently, and therefore would hit this flow.
It may be possible to avoid false-positive nesting indications (when the flushes do not overlap) by creating a new struct mmu_gather_pending, with something like:
struct mmu_gather_pending { u64 start; u64 end; struct mmu_gather_pending *next; }
tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending (pointing to the linked list) and find whether there is any overlap. This would still require synchronization (acquiring a lock when allocating and deallocating or something fancier).
We have an interval_tree for this, and yes, that's how far I got :/
The other thing I was thinking of is trying to detect overlap through the page-tables themselves, but we have a distinct lack of storage there.
I tried to think about saving some generation info somewhere in the page-struct, but I could not come up with a reasonable solution that would not requite to traverse all the page tables again one the TLB flush is done.
The things is, if this threaded monster runs on all CPUs (busy front end server) and does a ton of invalidation due to all the short lived request crud, then all the extra invalidations will add up too. Having to do process (machine in this case) wide invalidations is expensive, having to do more of them surely isn't cheap either.
So there might be something to win here.
Yes. I remember that these full TLB flushes leave their mark.
BTW: sometimes you don’t see the effect of these full TLB flushes as much in VMs. I encountered a strange phenomenon at the time - INVLPG for an arbitrary page cause my Haswell machine flush the entire TLB, when the INVLPG was issued inside a VM. It took me quite some time to analyze this problem. Eventually Intel told me that’s part of what is called “page fracturing” - if the host uses 4k pages in the EPT, they (usually) need to flush the entire TLB for any INVLPG. That’s happens since they don’t know the size of the flushed page.
I really need to finish my blog-post about it some day.
On Mon, May 13, 2019 at 09:11:38AM +0000, Nadav Amit wrote:
BTW: sometimes you don’t see the effect of these full TLB flushes as much in VMs. I encountered a strange phenomenon at the time - INVLPG for an arbitrary page cause my Haswell machine flush the entire TLB, when the INVLPG was issued inside a VM. It took me quite some time to analyze this problem. Eventually Intel told me that’s part of what is called “page fracturing” - if the host uses 4k pages in the EPT, they (usually) need to flush the entire TLB for any INVLPG. That’s happens since they don’t know the size of the flushed page.
Cute... if only they'd given us an interface to tell them... :-)
On Mon, May 13, 2019 at 09:11:38AM +0000, Nadav Amit wrote:
On May 13, 2019, at 1:36 AM, Peter Zijlstra peterz@infradead.org wrote:
On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
And we can fix that by having tlb_finish_mmu() sync up. Never let a concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers have completed.
This should not be too hard to make happen.
This synchronization sounds much more expensive than what I proposed. But I agree that cache-lines that move from one CPU to another might become an issue. But I think that the scheme I suggested would minimize this overhead.
Well, it would have a lot more unconditional atomic ops. My scheme only waits when there is actual concurrency.
Well, something has to give. I didn’t think that if the same core does the atomic op it would be too expensive.
They're still at least 20 cycles a pop, uncontended.
I _think_ something like the below ought to work, but its not even been near a compiler. The only problem is the unconditional wakeup; we can play games to avoid that if we want to continue with this.
Ideally we'd only do this when there's been actual overlap, but I've not found a sensible way to detect that.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 4ef4bbe78a1d..b70e35792d29 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) * * Therefore we must rely on tlb_flush_*() to guarantee order. */
- atomic_dec(&mm->tlb_flush_pending);
- if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
wake_up_var(&mm->tlb_flush_pending);
- } else {
wait_event_var(&mm->tlb_flush_pending,
!atomic_read_acquire(&mm->tlb_flush_pending));
- }
}
It still seems very expensive to me, at least for certain workloads (e.g., Apache with multithreaded MPM).
Is that Apache-MPM workload triggering this lots? Having a known benchmark for this stuff is good for when someone has time to play with things.
Setting Apache2 with mpm_worker causes every request to go through mmap-writev-munmap flow on every thread. I didn’t run this workload after the patches that downgrade the mmap_sem to read before the page-table zapping were introduced. I presume these patches would allow the page-table zapping to be done concurrently, and therefore would hit this flow.
Hmm, I don't think so: munmap() still has to take the semaphore for write initially, so it will be serialised against other munmap() threads even after they've downgraded afaict.
The initial bug report was about concurrent madvise() vs munmap().
Will
On May 13, 2019, at 9:37 AM, Will Deacon will.deacon@arm.com wrote:
On Mon, May 13, 2019 at 09:11:38AM +0000, Nadav Amit wrote:
On May 13, 2019, at 1:36 AM, Peter Zijlstra peterz@infradead.org wrote:
On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
> And we can fix that by having tlb_finish_mmu() sync up. Never let a > concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers > have completed. > > This should not be too hard to make happen.
This synchronization sounds much more expensive than what I proposed. But I agree that cache-lines that move from one CPU to another might become an issue. But I think that the scheme I suggested would minimize this overhead.
Well, it would have a lot more unconditional atomic ops. My scheme only waits when there is actual concurrency.
Well, something has to give. I didn’t think that if the same core does the atomic op it would be too expensive.
They're still at least 20 cycles a pop, uncontended.
I _think_ something like the below ought to work, but its not even been near a compiler. The only problem is the unconditional wakeup; we can play games to avoid that if we want to continue with this.
Ideally we'd only do this when there's been actual overlap, but I've not found a sensible way to detect that.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 4ef4bbe78a1d..b70e35792d29 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) * * Therefore we must rely on tlb_flush_*() to guarantee order. */
- atomic_dec(&mm->tlb_flush_pending);
- if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
wake_up_var(&mm->tlb_flush_pending);
- } else {
wait_event_var(&mm->tlb_flush_pending,
!atomic_read_acquire(&mm->tlb_flush_pending));
- }
}
It still seems very expensive to me, at least for certain workloads (e.g., Apache with multithreaded MPM).
Is that Apache-MPM workload triggering this lots? Having a known benchmark for this stuff is good for when someone has time to play with things.
Setting Apache2 with mpm_worker causes every request to go through mmap-writev-munmap flow on every thread. I didn’t run this workload after the patches that downgrade the mmap_sem to read before the page-table zapping were introduced. I presume these patches would allow the page-table zapping to be done concurrently, and therefore would hit this flow.
Hmm, I don't think so: munmap() still has to take the semaphore for write initially, so it will be serialised against other munmap() threads even after they've downgraded afaict.
The initial bug report was about concurrent madvise() vs munmap().
I guess you are right (and I’m wrong).
Short search suggests that ebizzy might be affected (a thread by Mel Gorman): https://lkml.org/lkml/2015/2/2/493
On Mon, May 13, 2019 at 05:06:03PM +0000, Nadav Amit wrote:
On May 13, 2019, at 9:37 AM, Will Deacon will.deacon@arm.com wrote:
On Mon, May 13, 2019 at 09:11:38AM +0000, Nadav Amit wrote:
On May 13, 2019, at 1:36 AM, Peter Zijlstra peterz@infradead.org wrote:
On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
>> And we can fix that by having tlb_finish_mmu() sync up. Never let a >> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers >> have completed. >> >> This should not be too hard to make happen. > > This synchronization sounds much more expensive than what I proposed. But I > agree that cache-lines that move from one CPU to another might become an > issue. But I think that the scheme I suggested would minimize this overhead.
Well, it would have a lot more unconditional atomic ops. My scheme only waits when there is actual concurrency.
Well, something has to give. I didn???t think that if the same core does the atomic op it would be too expensive.
They're still at least 20 cycles a pop, uncontended.
I _think_ something like the below ought to work, but its not even been near a compiler. The only problem is the unconditional wakeup; we can play games to avoid that if we want to continue with this.
Ideally we'd only do this when there's been actual overlap, but I've not found a sensible way to detect that.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 4ef4bbe78a1d..b70e35792d29 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) * * Therefore we must rely on tlb_flush_*() to guarantee order. */
- atomic_dec(&mm->tlb_flush_pending);
- if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
wake_up_var(&mm->tlb_flush_pending);
- } else {
wait_event_var(&mm->tlb_flush_pending,
!atomic_read_acquire(&mm->tlb_flush_pending));
- }
}
It still seems very expensive to me, at least for certain workloads (e.g., Apache with multithreaded MPM).
Is that Apache-MPM workload triggering this lots? Having a known benchmark for this stuff is good for when someone has time to play with things.
Setting Apache2 with mpm_worker causes every request to go through mmap-writev-munmap flow on every thread. I didn???t run this workload after the patches that downgrade the mmap_sem to read before the page-table zapping were introduced. I presume these patches would allow the page-table zapping to be done concurrently, and therefore would hit this flow.
Hmm, I don't think so: munmap() still has to take the semaphore for write initially, so it will be serialised against other munmap() threads even after they've downgraded afaict.
The initial bug report was about concurrent madvise() vs munmap().
I guess you are right (and I???m wrong).
Short search suggests that ebizzy might be affected (a thread by Mel Gorman): https://lkml.org/lkml/2015/2/2/493
Glibc has since been fixed to be less munmap/mmap intensive and the system CPU usage of ebizzy is generally negligible unless configured so specifically use mmap/munmap instead of malloc/free which is unrealistic for good application behaviour.
On Mon, May 13, 2019 at 10:36:06AM +0200, Peter Zijlstra wrote:
On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
It may be possible to avoid false-positive nesting indications (when the flushes do not overlap) by creating a new struct mmu_gather_pending, with something like:
struct mmu_gather_pending { u64 start; u64 end; struct mmu_gather_pending *next; }
tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending (pointing to the linked list) and find whether there is any overlap. This would still require synchronization (acquiring a lock when allocating and deallocating or something fancier).
We have an interval_tree for this, and yes, that's how far I got :/
The other thing I was thinking of is trying to detect overlap through the page-tables themselves, but we have a distinct lack of storage there.
We might just use some state in the pmd, there's still 2 _pt_pad_[12] in struct page to 'use'. So we could come up with some tlb generation scheme that would detect conflict.
On May 13, 2019, at 2:12 AM, Peter Zijlstra peterz@infradead.org wrote:
On Mon, May 13, 2019 at 10:36:06AM +0200, Peter Zijlstra wrote:
On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
It may be possible to avoid false-positive nesting indications (when the flushes do not overlap) by creating a new struct mmu_gather_pending, with something like:
struct mmu_gather_pending { u64 start; u64 end; struct mmu_gather_pending *next; }
tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending (pointing to the linked list) and find whether there is any overlap. This would still require synchronization (acquiring a lock when allocating and deallocating or something fancier).
We have an interval_tree for this, and yes, that's how far I got :/
The other thing I was thinking of is trying to detect overlap through the page-tables themselves, but we have a distinct lack of storage there.
We might just use some state in the pmd, there's still 2 _pt_pad_[12] in struct page to 'use'. So we could come up with some tlb generation scheme that would detect conflict.
It is rather easy to come up with a scheme (and I did similar things) if you flush the table while you hold the page-tables lock. But if you batch across page-tables it becomes harder.
Thinking about it while typing, perhaps it is simpler than I think - if you need to flush range that runs across more than a single table, you are very likely to flush a range of more than 33 entries, so anyhow you are likely to do a full TLB flush.
So perhaps just avoiding the batching if only entries from a single table are flushed would be enough.
On Mon, May 13, 2019 at 09:21:01AM +0000, Nadav Amit wrote:
On May 13, 2019, at 2:12 AM, Peter Zijlstra peterz@infradead.org wrote:
The other thing I was thinking of is trying to detect overlap through the page-tables themselves, but we have a distinct lack of storage there.
We might just use some state in the pmd, there's still 2 _pt_pad_[12] in struct page to 'use'. So we could come up with some tlb generation scheme that would detect conflict.
It is rather easy to come up with a scheme (and I did similar things) if you flush the table while you hold the page-tables lock. But if you batch across page-tables it becomes harder.
Yeah; finding that out now. I keep finding holes :/
Thinking about it while typing, perhaps it is simpler than I think - if you need to flush range that runs across more than a single table, you are very likely to flush a range of more than 33 entries, so anyhow you are likely to do a full TLB flush.
We can't rely on the 33, that x86 specific. Other architectures can have another (or no) limit on that.
So perhaps just avoiding the batching if only entries from a single table are flushed would be enough.
That's near to what Will suggested initially, just flush the entire thing when there's a conflict.
On May 13, 2019, at 4:27 AM, Peter Zijlstra peterz@infradead.org wrote:
On Mon, May 13, 2019 at 09:21:01AM +0000, Nadav Amit wrote:
On May 13, 2019, at 2:12 AM, Peter Zijlstra peterz@infradead.org wrote:
The other thing I was thinking of is trying to detect overlap through the page-tables themselves, but we have a distinct lack of storage there.
We might just use some state in the pmd, there's still 2 _pt_pad_[12] in struct page to 'use'. So we could come up with some tlb generation scheme that would detect conflict.
It is rather easy to come up with a scheme (and I did similar things) if you flush the table while you hold the page-tables lock. But if you batch across page-tables it becomes harder.
Yeah; finding that out now. I keep finding holes :/
You can use Uhlig’s dissertation for inspiration (Section 4.4).
[1] https://www.researchgate.net/publication/36450482_Scalability_of_microkernel...
Thinking about it while typing, perhaps it is simpler than I think - if you need to flush range that runs across more than a single table, you are very likely to flush a range of more than 33 entries, so anyhow you are likely to do a full TLB flush.
We can't rely on the 33, that x86 specific. Other architectures can have another (or no) limit on that.
I wonder whether there are architectures that do no invalidate the TLB entry by entry, experiencing these kind of overheads.
So perhaps just avoiding the batching if only entries from a single table are flushed would be enough.
That's near to what Will suggested initially, just flush the entire thing when there's a conflict.
One question is how you define a conflict. IIUC, Will suggests same mm marks a conflict. In addition, I suggest that if you only remove a single entry (or few ones), you would just not batch and do the flushing while holding the page-table lock.
linux-stable-mirror@lists.linaro.org