After commit 7f56b58a92aaf2c ("locking/mcs: Use smp_cond_load_acquire() in MCS spin loop") Loongson-3 fails to boot. This is because Loongson-3 has SFB (Store Fill Buffer) and the weak-ordering may cause READ_ONCE() to get an old value in a tight loop. So in smp_cond_load_acquire() we need a __smp_rmb() before the READ_ONCE() loop.
This patch introduce a Loongson-specific smp_cond_load_acquire(). And it should be backported to as early as linux-4.5, in which release the smp_cond_acquire() is introduced.
There may be other cases where memory barriers is needed, we will fix them one by one.
Cc: stable@vger.kernel.org Signed-off-by: Huacai Chen chenhc@lemote.com --- arch/mips/include/asm/barrier.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h index a5eb1bb..e8c4c63 100644 --- a/arch/mips/include/asm/barrier.h +++ b/arch/mips/include/asm/barrier.h @@ -222,6 +222,24 @@ #define __smp_mb__before_atomic() __smp_mb__before_llsc() #define __smp_mb__after_atomic() smp_llsc_mb()
+#ifdef CONFIG_CPU_LOONGSON3 +/* Loongson-3 need a __smp_rmb() before READ_ONCE() loop */ +#define smp_cond_load_acquire(ptr, cond_expr) \ +({ \ + typeof(ptr) __PTR = (ptr); \ + typeof(*ptr) VAL; \ + __smp_rmb(); \ + for (;;) { \ + VAL = READ_ONCE(*__PTR); \ + if (cond_expr) \ + break; \ + cpu_relax(); \ + } \ + __smp_rmb(); \ + VAL; \ +}) +#endif /* CONFIG_CPU_LOONGSON3 */ + #include <asm-generic/barrier.h>
#endif /* __ASM_BARRIER_H */
Hi Huacai,
On Mon, Jul 09, 2018 at 10:26:38AM +0800, Huacai Chen wrote:
After commit 7f56b58a92aaf2c ("locking/mcs: Use smp_cond_load_acquire() in MCS spin loop") Loongson-3 fails to boot. This is because Loongson-3 has SFB (Store Fill Buffer) and the weak-ordering may cause READ_ONCE() to get an old value in a tight loop. So in smp_cond_load_acquire() we need a __smp_rmb() before the READ_ONCE() loop.
This patch introduce a Loongson-specific smp_cond_load_acquire(). And it should be backported to as early as linux-4.5, in which release the smp_cond_acquire() is introduced.
There may be other cases where memory barriers is needed, we will fix them one by one.
Cc: stable@vger.kernel.org Signed-off-by: Huacai Chen chenhc@lemote.com
arch/mips/include/asm/barrier.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h index a5eb1bb..e8c4c63 100644 --- a/arch/mips/include/asm/barrier.h +++ b/arch/mips/include/asm/barrier.h @@ -222,6 +222,24 @@ #define __smp_mb__before_atomic() __smp_mb__before_llsc() #define __smp_mb__after_atomic() smp_llsc_mb() +#ifdef CONFIG_CPU_LOONGSON3 +/* Loongson-3 need a __smp_rmb() before READ_ONCE() loop */ +#define smp_cond_load_acquire(ptr, cond_expr) \ +({ \
- typeof(ptr) __PTR = (ptr); \
- typeof(*ptr) VAL; \
- __smp_rmb(); \
- for (;;) { \
VAL = READ_ONCE(*__PTR); \
if (cond_expr) \
break; \
cpu_relax(); \
- } \
- __smp_rmb(); \
- VAL; \
+}) +#endif /* CONFIG_CPU_LOONGSON3 */
The discussion on v1 of this patch [1] seemed to converge on the view that Loongson suffers from the same problem as ARM platforms which enable the CONFIG_ARM_ERRATA_754327 workaround, and that we might require a similar workaround.
Is there a reason you've not done that, and instead tweaked your patch that's specific to the smp_cond_load_acquire() case? I'm not comfortable with fixing just this one case when there could be many similar problematic pieces of code you just haven't hit yet.
Please also keep the LKMM maintainers on copy for this - their feedback will be valuable & I'll be much more comfortable applying a workaround for Loongson's behavior here if it's one that they're OK with.
Thanks, Paul
[1] https://www.linux-mips.org/archives/linux-mips/2018-06/msg00139.html
Hi, Paul and Peter,
I think we find the real root cause, READ_ONCE() doesn't need any barriers, the problematic code is queued_spin_lock_slowpath() in kernel/locking/qspinlock.c:
if (old & _Q_TAIL_MASK) { prev = decode_tail(old);
/* Link @node into the waitqueue. */ WRITE_ONCE(prev->next, node);
pv_wait_node(node, prev); arch_mcs_spin_lock_contended(&node->locked);
/* * While waiting for the MCS lock, the next pointer may have * been set by another lock waiter. We optimistically load * the next pointer & prefetch the cacheline for writing * to reduce latency in the upcoming MCS unlock operation. */ next = READ_ONCE(node->next); if (next) prefetchw(next); }
After WRITE_ONCE(prev->next, node); arch_mcs_spin_lock_contended() enter a READ_ONCE() loop, so the effect of WRITE_ONCE() is invisible by other cores because of the write buffer. As a result, arch_mcs_spin_lock_contended() will wait for ever because the waiters of prev->next will wait for ever. I think the right way to fix this is flush SFB after this WRITE_ONCE(), but I don't have a good solution: 1, MIPS has wbflush() which can be used to flush SFB, but other archs don't have; 2, Every arch has mb(), and add mb() after WRITE_ONCE() can actually solve Loongson's problem, but in syntax, mb() is different from wbflush(); 3, Maybe we can define a Loongson-specific WRITE_ONCE(), but not every WRITE_ONCE() need wbflush(), we only need wbflush() between WRITE_ONCE() and a READ_ONCE() loop.
Any ideas?
Huacai
On Tue, Jul 10, 2018 at 12:49 AM, Paul Burton paul.burton@mips.com wrote:
Hi Huacai,
On Mon, Jul 09, 2018 at 10:26:38AM +0800, Huacai Chen wrote:
After commit 7f56b58a92aaf2c ("locking/mcs: Use smp_cond_load_acquire() in MCS spin loop") Loongson-3 fails to boot. This is because Loongson-3 has SFB (Store Fill Buffer) and the weak-ordering may cause READ_ONCE() to get an old value in a tight loop. So in smp_cond_load_acquire() we need a __smp_rmb() before the READ_ONCE() loop.
This patch introduce a Loongson-specific smp_cond_load_acquire(). And it should be backported to as early as linux-4.5, in which release the smp_cond_acquire() is introduced.
There may be other cases where memory barriers is needed, we will fix them one by one.
Cc: stable@vger.kernel.org Signed-off-by: Huacai Chen chenhc@lemote.com
arch/mips/include/asm/barrier.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h index a5eb1bb..e8c4c63 100644 --- a/arch/mips/include/asm/barrier.h +++ b/arch/mips/include/asm/barrier.h @@ -222,6 +222,24 @@ #define __smp_mb__before_atomic() __smp_mb__before_llsc() #define __smp_mb__after_atomic() smp_llsc_mb()
+#ifdef CONFIG_CPU_LOONGSON3 +/* Loongson-3 need a __smp_rmb() before READ_ONCE() loop */ +#define smp_cond_load_acquire(ptr, cond_expr) \ +({ \
typeof(ptr) __PTR = (ptr); \
typeof(*ptr) VAL; \
__smp_rmb(); \
for (;;) { \
VAL = READ_ONCE(*__PTR); \
if (cond_expr) \
break; \
cpu_relax(); \
} \
__smp_rmb(); \
VAL; \
+}) +#endif /* CONFIG_CPU_LOONGSON3 */
The discussion on v1 of this patch [1] seemed to converge on the view that Loongson suffers from the same problem as ARM platforms which enable the CONFIG_ARM_ERRATA_754327 workaround, and that we might require a similar workaround.
Is there a reason you've not done that, and instead tweaked your patch that's specific to the smp_cond_load_acquire() case? I'm not comfortable with fixing just this one case when there could be many similar problematic pieces of code you just haven't hit yet.
Please also keep the LKMM maintainers on copy for this - their feedback will be valuable & I'll be much more comfortable applying a workaround for Loongson's behavior here if it's one that they're OK with.
Thanks, Paul
[1] https://www.linux-mips.org/archives/linux-mips/2018-06/msg00139.html
On Tue, Jul 10, 2018 at 12:26:34PM +0800, Huacai Chen wrote:
Hi, Paul and Peter,
I think we find the real root cause, READ_ONCE() doesn't need any barriers, the problematic code is queued_spin_lock_slowpath() in kernel/locking/qspinlock.c:
if (old & _Q_TAIL_MASK) { prev = decode_tail(old); /* Link @node into the waitqueue. */ WRITE_ONCE(prev->next, node); pv_wait_node(node, prev); arch_mcs_spin_lock_contended(&node->locked); /* * While waiting for the MCS lock, the next pointer may have * been set by another lock waiter. We optimistically load * the next pointer & prefetch the cacheline for writing * to reduce latency in the upcoming MCS unlock operation. */ next = READ_ONCE(node->next); if (next) prefetchw(next); }
After WRITE_ONCE(prev->next, node); arch_mcs_spin_lock_contended() enter a READ_ONCE() loop, so the effect of WRITE_ONCE() is invisible by other cores because of the write buffer.
And _that_ is a hardware bug. Also please explain how that is different from the ARM bug mentioned elsewhere.
As a result, arch_mcs_spin_lock_contended() will wait for ever because the waiters of prev->next will wait for ever. I think the right way to fix this is flush SFB after this WRITE_ONCE(), but I don't have a good solution: 1, MIPS has wbflush() which can be used to flush SFB, but other archs don't have;
Sane archs don't need this.
2, Every arch has mb(), and add mb() after WRITE_ONCE() can actually solve Loongson's problem, but in syntax, mb() is different from wbflush();
Still wrong, because any non-broken arch doesn't need that flush to begin with.
3, Maybe we can define a Loongson-specific WRITE_ONCE(), but not every WRITE_ONCE() need wbflush(), we only need wbflush() between WRITE_ONCE() and a READ_ONCE() loop.
No no no no ...
So now explain why the cpu_relax() hack that arm did doesn't work for you?
On Tue, Jul 10, 2018 at 11:36:37AM +0200, Peter Zijlstra wrote:
So now explain why the cpu_relax() hack that arm did doesn't work for you?
So below is the patch I think you want; if not explain in detail how this is wrong.
diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h index af34afbc32d9..e59773de6528 100644 --- a/arch/mips/include/asm/processor.h +++ b/arch/mips/include/asm/processor.h @@ -386,7 +386,17 @@ unsigned long get_wchan(struct task_struct *p); #define KSTK_ESP(tsk) (task_pt_regs(tsk)->regs[29]) #define KSTK_STATUS(tsk) (task_pt_regs(tsk)->cp0_status)
+#ifdef CONFIG_CPU_LOONGSON3 +/* + * Loongson-3 has a CPU bug where the store buffer gets starved when stuck in a + * read loop. Since spin loops of any kind should have a cpu_relax() in them, + * force a store-buffer flush from cpu_relax() such that any pending writes + * will become available as expected. + */ +#define cpu_relax() smp_mb() +#else #define cpu_relax() barrier() +#endif
/* * Return_address is a replacement for __builtin_return_address(count)
Hi, Peter,
I'm afraid that you have missing something......
Firstly, our previous conclusion (READ_ONCE need a barrier to avoid 'reads prioritised over writes') is totally wrong. So define cpu_relax() to smp_mb() like ARM11MPCore is incorrect, even if it can 'solve' Loongson's problem. Secondly, I think the real problem is like this: 1, CPU0 set the lock to 0, then do something; 2, While CPU0 is doing something, CPU1 set the flag to 1 with WRITE_ONCE(), and then wait the lock become to 1 with a READ_ONCE() loop; 3, After CPU0 complete its work, it wait the flag become to 1, and if so then set the lock to 1; 4, If the lock becomes to 1, CPU1 will leave the READ_ONCE() loop. If without SFB, everything is OK. But with SFB in step 2, a READ_ONCE() loop is right after WRITE_ONCE(), which makes the flag cached in SFB (so be invisible by other CPUs) for ever, then both CPU0 and CPU1 wait for ever.
I don't think this is a hardware bug, in design, SFB will flushed to L1 cache in three cases: 1, data in SFB is full (be a complete cache line); 2, there is a subsequent read access in the same cache line; 3, a 'sync' instruction is executed.
In this case, there is no other memory access (read or write) between WRITE_ONCE() and READ_ONCE() loop. So Case 1 and Case 2 will not happen, and the only way to make the flag be visible is wbflush (wbflush is sync in Loongson's case).
I think this problem is not only happens on Loongson, but will happen on other CPUs which have write buffer (unless the write buffer has a 4th case to be flushed).
Huacai
------------------ Original ------------------ From: "Peter Zijlstra"peterz@infradead.org; Date: Tue, Jul 10, 2018 06:54 PM To: "Huacai Chen"chenhc@lemote.com; Cc: "Paul Burton"paul.burton@mips.com; "Ralf Baechle"ralf@linux-mips.org; "James Hogan"jhogan@kernel.org; "linux-mips"linux-mips@linux-mips.org; "Fuxin Zhang"zhangfx@lemote.com; "wuzhangjin"wuzhangjin@gmail.com; "stable"stable@vger.kernel.org; "Alan Stern"stern@rowland.harvard.edu; "Andrea Parri"andrea.parri@amarulasolutions.com; "Will Deacon"will.deacon@arm.com; "Boqun Feng"boqun.feng@gmail.com; "Nicholas Piggin"npiggin@gmail.com; "David Howells"dhowells@redhat.com; "Jade Alglave"j.alglave@ucl.ac.uk; "Luc Maranget"luc.maranget@inria.fr; "Paul E. McKenney"paulmck@linux.vnet.ibm.com; "Akira Yokosawa"akiyks@gmail.com; "LKML"linux-kernel@vger.kernel.org; Subject: Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3
On Tue, Jul 10, 2018 at 11:36:37AM +0200, Peter Zijlstra wrote:
So now explain why the cpu_relax() hack that arm did doesn't work for you?
So below is the patch I think you want; if not explain in detail how this is wrong.
diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h index af34afbc32d9..e59773de6528 100644 --- a/arch/mips/include/asm/processor.h +++ b/arch/mips/include/asm/processor.h @@ -386,7 +386,17 @@ unsigned long get_wchan(struct task_struct *p); #define KSTK_ESP(tsk) (task_pt_regs(tsk)->regs[29]) #define KSTK_STATUS(tsk) (task_pt_regs(tsk)->cp0_status)
+#ifdef CONFIG_CPU_LOONGSON3 +/* + * Loongson-3 has a CPU bug where the store buffer gets starved when stuck in a + * read loop. Since spin loops of any kind should have a cpu_relax() in them, + * force a store-buffer flush from cpu_relax() such that any pending writes + * will become available as expected. + */ +#define cpu_relax() smp_mb() +#else #define cpu_relax() barrier() +#endif
/* * Return_address is a replacement for __builtin_return_address(count)
Please!! Learn to use email.
A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing in e-mail?
Also, wrap non-quoted lines to 78 characters.
On Tue, Jul 10, 2018 at 07:45:22PM +0800, 陈华才 wrote:
I'm afraid that you have missing something......
Firstly, our previous conclusion (READ_ONCE need a barrier to avoid 'reads prioritised over writes') is totally wrong. So define cpu_relax() to smp_mb() like ARM11MPCore is incorrect, even if it can 'solve' Loongson's problem. Secondly, I think the real problem is like this:
1, CPU0 set the lock to 0, then do something; 2, While CPU0 is doing something, CPU1 set the flag to 1 with WRITE_ONCE(), and then wait the lock become to 1 with a READ_ONCE() loop; 3, After CPU0 complete its work, it wait the flag become to 1, and if so then set the lock to 1; 4, If the lock becomes to 1, CPU1 will leave the READ_ONCE() loop.
If without SFB, everything is OK. But with SFB in step 2, a READ_ONCE() loop is right after WRITE_ONCE(), which makes the flag cached in SFB (so be invisible by other CPUs) for ever, then both CPU0 and CPU1 wait for ever.
Sure.. we all got that far. And no, this isn't the _real_ problem. This is a manifestation of the problem.
The problem is that your SFB is broken (per the Linux requirements). We require that stores will become visible. That is, they must not indefinitely (for whatever reason) stay in the store buffer.
I don't think this is a hardware bug, in design, SFB will flushed to L1 cache in three cases:
1, data in SFB is full (be a complete cache line); 2, there is a subsequent read access in the same cache line; 3, a 'sync' instruction is executed.
And I think this _is_ a hardware bug. You just designed the bug instead of it being by accident.
In this case, there is no other memory access (read or write) between WRITE_ONCE() and READ_ONCE() loop. So Case 1 and Case 2 will not happen, and the only way to make the flag be visible is wbflush (wbflush is sync in Loongson's case).
I think this problem is not only happens on Loongson, but will happen on other CPUs which have write buffer (unless the write buffer has a 4th case to be flushed).
It doesn't happen an _any_ other architecture except that dodgy ARM11MPCore part. Linux hard relies on stores to become available _eventually_.
Still, even with the rules above, the best work-around is still the very same cpu_relax() hack.
On Tue, Jul 10, 2018 at 02:17:27PM +0200, Peter Zijlstra wrote:
Please!! Learn to use email.
A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing in e-mail?
Also, wrap non-quoted lines to 78 characters.
On Tue, Jul 10, 2018 at 07:45:22PM +0800, 陈华才 wrote:
I'm afraid that you have missing something......
Firstly, our previous conclusion (READ_ONCE need a barrier to avoid 'reads prioritised over writes') is totally wrong. So define cpu_relax() to smp_mb() like ARM11MPCore is incorrect, even if it can 'solve' Loongson's problem. Secondly, I think the real problem is like this:
1, CPU0 set the lock to 0, then do something; 2, While CPU0 is doing something, CPU1 set the flag to 1 with WRITE_ONCE(), and then wait the lock become to 1 with a READ_ONCE() loop; 3, After CPU0 complete its work, it wait the flag become to 1, and if so then set the lock to 1; 4, If the lock becomes to 1, CPU1 will leave the READ_ONCE() loop.
Are there specific loops in the kernel whose conditions are controlled by READ_ONCE() that don't contain cpu_relax(), smp_mb(), etc.? One way to find them given your description of your hardware is to make cpu_relax() be smp_mb() as Peter suggests, and then run tests to find the problems.
Or have you already done this?
Thanx, Paul
If without SFB, everything is OK. But with SFB in step 2, a READ_ONCE() loop is right after WRITE_ONCE(), which makes the flag cached in SFB (so be invisible by other CPUs) for ever, then both CPU0 and CPU1 wait for ever.
Sure.. we all got that far. And no, this isn't the _real_ problem. This is a manifestation of the problem.
The problem is that your SFB is broken (per the Linux requirements). We require that stores will become visible. That is, they must not indefinitely (for whatever reason) stay in the store buffer.
I don't think this is a hardware bug, in design, SFB will flushed to L1 cache in three cases:
1, data in SFB is full (be a complete cache line); 2, there is a subsequent read access in the same cache line; 3, a 'sync' instruction is executed.
And I think this _is_ a hardware bug. You just designed the bug instead of it being by accident.
In this case, there is no other memory access (read or write) between WRITE_ONCE() and READ_ONCE() loop. So Case 1 and Case 2 will not happen, and the only way to make the flag be visible is wbflush (wbflush is sync in Loongson's case).
I think this problem is not only happens on Loongson, but will happen on other CPUs which have write buffer (unless the write buffer has a 4th case to be flushed).
It doesn't happen an _any_ other architecture except that dodgy ARM11MPCore part. Linux hard relies on stores to become available _eventually_.
Still, even with the rules above, the best work-around is still the very same cpu_relax() hack.
Hello,
On Tue, Jul 10, 2018 at 02:17:27PM +0200, Peter Zijlstra wrote:
1, CPU0 set the lock to 0, then do something; 2, While CPU0 is doing something, CPU1 set the flag to 1 with WRITE_ONCE(), and then wait the lock become to 1 with a READ_ONCE() loop; 3, After CPU0 complete its work, it wait the flag become to 1, and if so then set the lock to 1; 4, If the lock becomes to 1, CPU1 will leave the READ_ONCE() loop.
If without SFB, everything is OK. But with SFB in step 2, a READ_ONCE() loop is right after WRITE_ONCE(), which makes the flag cached in SFB (so be invisible by other CPUs) for ever, then both CPU0 and CPU1 wait for ever.
Sure.. we all got that far. And no, this isn't the _real_ problem. This is a manifestation of the problem.
The problem is that your SFB is broken (per the Linux requirements). We require that stores will become visible. That is, they must not indefinitely (for whatever reason) stay in the store buffer.
For the record, my understanding is that this doesn't really comply with the MIPS architecture either. It doesn't cover store buffers explicitly but does cover the more general notion of caches.
From section 2.8.8.2 "Cached Memory Access" of the Introduction to the
MIPS64 Architecture document, revision 5.04 [1]:
In a cached access, physical memory and all caches in the system containing a copy of the physical location are used to resolve the access. A copy of a location is coherent if the copy was placed in the cache by a cached coherent access; a copy of a location is noncoherent if the copy was placed in the cache by a cached noncoherent access. (Coherency is dictated by the system architecture, not the processor implementation.)
Caches containing a coherent copy of the location are examined and/or modified to keep the contents of the location coherent. It is not possible to predict whether caches holding a noncoherent copy of the location will be examined and/or modified during a cached coherent access.
All of the accesses Linux is performing are cached coherent.
I'm not sure which is the intent (I can ask if someone's interested), but you could either:
1) Consider the store buffer a cache, in which case loads need to check all store buffers from all CPUs because of the "all caches" part of the first quoted sentence.
or
2) Decide store buffers aren't covered by the MIPS architecture documentation at all in which case the only sane thing to do would be to make it transparent to software (and here Loongson's isn't).
I don't think this is a hardware bug, in design, SFB will flushed to L1 cache in three cases:
1, data in SFB is full (be a complete cache line); 2, there is a subsequent read access in the same cache line; 3, a 'sync' instruction is executed.
And I think this _is_ a hardware bug. You just designed the bug instead of it being by accident.
Presuming that the data in the SFB isn't visible to other cores, and presuming that case 2 is only covering loads from the same core that contains the SFB, I agree that this doesn't seem like valid behavior.
Thanks, Paul
[1] https://www.mips.com/?do-download=introduction-to-the-mips64-architecture-v5...
From: Paul Burton
Sent: 10 July 2018 18:11
...
I'm not sure which is the intent (I can ask if someone's interested), but you could either:
- Consider the store buffer a cache, in which case loads need to check all store buffers from all CPUs because of the "all caches" part of the first quoted sentence.
or
- Decide store buffers aren't covered by the MIPS architecture documentation at all in which case the only sane thing to do would be to make it transparent to software (and here Loongson's isn't)
...
Store buffers are common and are never transparent to multi-threaded code. They are largely why you need locks.
At least on (early) sparc systems they were between the execution unit and the data cache.
I also suspect that 'write starvation' is also common - after all the purpose of the store buffer is to do reads in preference to writes in order to reduce the cpu stalls waiting for the memory bus (probably the cpu to cache interface).
I think your example is just: *(volatile int *)xxx = 1; while (!*(volatile int *)yyy) continue; running on two cpu with xxx and yyy swapped?
You need a stronger bus cycle in there somewhere.
David
On Wed, Jul 11, 2018 at 10:04:52AM +0000, David Laight wrote:
I also suspect that 'write starvation' is also common - after all the purpose of the store buffer is to do reads in preference to writes in order to reduce the cpu stalls waiting for the memory bus (probably the cpu to cache interface).
I think your example is just: *(volatile int *)xxx = 1; while (!*(volatile int *)yyy) continue; running on two cpu with xxx and yyy swapped?
Yep. And Linux has been relying on that working for (afaict) basically forever.
You need a stronger bus cycle in there somewhere.
Since all spin-wait loops _should_ have cpu_relax() that is the natural place to put it.
On 2018-7-10 Tue at 20:17:27,Peter Zijlstra Wrote:
Hi Peter Since Huacai unable to send email via client, I'm going to reply for him
Sure.. we all got that far. And no, this isn't the _real_ problem. This is a manifestation of the problem.
The problem is that your SFB is broken (per the Linux requirements). We require that stores will become visible. That is, they must not indefinitely (for whatever reason) stay in the store buffer.
I don't think this is a hardware bug, in design, SFB will flushed to L1 cache in three cases:
1, data in SFB is full (be a complete cache line); 2, there is a subsequent read access in the same cache line; 3, a 'sync' instruction is executed.
And I think this _is_ a hardware bug. You just designed the bug instead of it being by accident.
Yes, we understood that this hardware feature is not supported by LKML, so it should be a hardware bug for LKML.
It doesn't happen an _any_ other architecture except that dodgy ARM11MPCore part. Linux hard relies on stores to become available _eventually_.
Still, even with the rules above, the best work-around is still the very same cpu_relax() hack.
As you say, SFB makes Loongson not fully SMP-coherent. However, modify cpu_relax can solve the current problem, but not so straight forward. On the other hand, providing a Loongson-specific WRITE_ONCE looks more reasonable, because it the eliminate the "non-cohrency". So we can solve the bug from the root.
Thanks. -- Jiaxun Yang
On Wed, Jul 11, 2018 at 06:05:51PM +0800, Jiaxun Yang wrote:
On 2018-7-10 Tue at 20:17:27,Peter Zijlstra Wrote:
Hi Peter Since Huacai unable to send email via client, I'm going to reply for him
Sure.. we all got that far. And no, this isn't the _real_ problem. This is a manifestation of the problem.
The problem is that your SFB is broken (per the Linux requirements). We require that stores will become visible. That is, they must not indefinitely (for whatever reason) stay in the store buffer.
I don't think this is a hardware bug, in design, SFB will flushed to L1 cache in three cases:
1, data in SFB is full (be a complete cache line); 2, there is a subsequent read access in the same cache line; 3, a 'sync' instruction is executed.
And I think this _is_ a hardware bug. You just designed the bug instead of it being by accident.
Yes, we understood that this hardware feature is not supported by LKML, so it should be a hardware bug for LKML.
It doesn't happen an _any_ other architecture except that dodgy ARM11MPCore part. Linux hard relies on stores to become available _eventually_.
Still, even with the rules above, the best work-around is still the very same cpu_relax() hack.
As you say, SFB makes Loongson not fully SMP-coherent. However, modify cpu_relax can solve the current problem, but not so straight forward. On the other hand, providing a Loongson-specific WRITE_ONCE looks more reasonable, because it the eliminate the "non-cohrency". So we can solve the bug from the root.
Curious, but why is it not straight-forward to hack cpu_relax()? If you try to hack WRITE_ONCE, you also need to hack atomic_set, atomic64_set and all the places that should be using WRITE_ONCE but aren't ;~)
Will
On Wed, Jul 11, 2018 at 11:21:06AM +0100, Will Deacon wrote:
On Wed, Jul 11, 2018 at 06:05:51PM +0800, Jiaxun Yang wrote:
On 2018-7-10 Tue at 20:17:27,Peter Zijlstra Wrote:
Still, even with the rules above, the best work-around is still the very same cpu_relax() hack.
As you say, SFB makes Loongson not fully SMP-coherent. However, modify cpu_relax can solve the current problem, but not so straight forward. On the other hand, providing a Loongson-specific WRITE_ONCE looks more reasonable, because it the eliminate the "non-cohrency". So we can solve the bug from the root.
Curious, but why is it not straight-forward to hack cpu_relax()? If you try to hack WRITE_ONCE, you also need to hack atomic_set, atomic64_set and all the places that should be using WRITE_ONCE but aren't ;~)
Right.
The problem isn't stores pre-se, normal progress should contain enough stores to flush out 'old' bits in the natural order of things. But the problem is spin-wait loops that inhibit normal progress (and thereby store-buffer flushing).
And all spin-wait loops should be having cpu_relax() in them. So cpu_relax() is the natural place to fix this.
Adding SYNC to WRITE_ONCE()/atomic* will hurt performance lots and will ultimately not guarantee anything more; and as Will said, keep you chasing dragons where people forgot to use WRITE_ONCE() where they maybe should've.
From: Peter Zijlstra
Sent: 11 July 2018 12:10
..
Adding SYNC to WRITE_ONCE()/atomic* will hurt performance lots and will ultimately not guarantee anything more; and as Will said, keep you chasing dragons where people forgot to use WRITE_ONCE() where they maybe should've.
Also WRITE_ONCE() is there to solve an entirely different problem. If you have a function that does: <lots of code without any function calls> foo->bar = 1; the compiler is allowed to write other (unrelated) values to foo->bar in the generated code for <lots of code>.
A long time ago I used a compiler that managed to convert: if (foo->bar == 2) foo->bar = 3; into: if (foo->bar == 2) { foo->bar = 0; foo->bar = 3; } When an interrupt read the value 0 a lot of linked list got screwed up. WRITE_ONCE() is there to ensure that doesn't happen. (In my case 'foo' was a 2-bit wide bitfield, and I suspect you can't use WRITE_ONCE() on bitfields.)
David
Hi Huacai,
On Tue, Jul 10, 2018 at 07:45:22PM +0800, 陈华才 wrote:
I don't think this is a hardware bug, in design, SFB will flushed to L1 cache in three cases: 1, data in SFB is full (be a complete cache line); 2, there is a subsequent read access in the same cache line; 3, a 'sync' instruction is executed.
I'd expect successful LL/SC, cache maintenance (and potentially TLB) operations to flush your SFB as well, not that I think that provides a better workaround than throwing a 'sync' into cpu_relax(). I assume the SFB is all physically addressed?
Generally, CPU architectures guarantee that store buffers drain "in finite time" which is a pretty crappy guarantee, but one which tends to be sufficient in practice and therefore relied upon by software.
Will
linux-stable-mirror@lists.linaro.org