After commit 7f56b58a92aaf2c ("locking/mcs: Use smp_cond_load_acquire() in MCS spin loop") Loongson-3 fails to boot. This is because Loongson-3 has SFB (Store Fill Buffer) and READ_ONCE() may get an old value in a tight loop. So in smp_cond_load_acquire() we need a __smp_mb() after every READ_ONCE().
This patch introduce a Loongson-specific smp_cond_load_acquire(). And it should be backported to as early as linux-4.5, in which release the smp_cond_acquire() is introduced.
Cc: stable@vger.kernel.org Signed-off-by: Huacai Chen chenhc@lemote.com --- arch/mips/include/asm/barrier.h | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h index a5eb1bb..4ea384d 100644 --- a/arch/mips/include/asm/barrier.h +++ b/arch/mips/include/asm/barrier.h @@ -222,6 +222,23 @@ #define __smp_mb__before_atomic() __smp_mb__before_llsc() #define __smp_mb__after_atomic() smp_llsc_mb()
+#ifdef CONFIG_CPU_LOONGSON3 +/* Loongson-3 need a __smp_mb() after READ_ONCE() here */ +#define smp_cond_load_acquire(ptr, cond_expr) \ +({ \ + typeof(ptr) __PTR = (ptr); \ + typeof(*ptr) VAL; \ + for (;;) { \ + VAL = READ_ONCE(*__PTR); \ + __smp_mb(); \ + if (cond_expr) \ + break; \ + cpu_relax(); \ + } \ + VAL; \ +}) +#endif /* CONFIG_CPU_LOONGSON3 */ + #include <asm-generic/barrier.h>
#endif /* __ASM_BARRIER_H */
Hi Huacai,
On Fri, Jun 15, 2018 at 02:07:38PM +0800, Huacai Chen wrote:
After commit 7f56b58a92aaf2c ("locking/mcs: Use smp_cond_load_acquire() in MCS spin loop") Loongson-3 fails to boot. This is because Loongson-3 has SFB (Store Fill Buffer) and READ_ONCE() may get an old value in a tight loop. So in smp_cond_load_acquire() we need a __smp_mb() after every READ_ONCE().
Thanks - modifying smp_cond_load_acquire() is a step better than modifying arch_mcs_spin_lock_contended() to avoid it, but I'm still not sure we've reached the root of the problem. If tight loops using READ_ONCE() are at fault then what's special about smp_cond_load_acquire()? Could other such loops not hit the same problem?
Is the scenario you encounter the same as that outlined in the "DATA DEPENDENCY BARRIERS (HISTORICAL)" section of Documentation/memory-barriers.txt by any chance? If so then perhaps it would be better to implement smp_read_barrier_depends(), or just raw read_barrier_depends() depending upon how your SFB functions.
Is there any public documentation describing the behaviour of the store fill buffer in Loongson-3?
Part of the problem is that I'm still not sure what's actually happening in your system - it would be helpful to have further information in the commit message about what actually happens. For example if you could walk us through an example of the problem step by step in the style of the diagrams you'll see in Documentation/memory-barriers.txt then I think that would help us to see what the best solution here is.
I've copied the LKMM maintainers in case they have further input.
Thanks, Paul
This patch introduce a Loongson-specific smp_cond_load_acquire(). And it should be backported to as early as linux-4.5, in which release the smp_cond_acquire() is introduced.
Cc: stable@vger.kernel.org Signed-off-by: Huacai Chen chenhc@lemote.com
arch/mips/include/asm/barrier.h | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h index a5eb1bb..4ea384d 100644 --- a/arch/mips/include/asm/barrier.h +++ b/arch/mips/include/asm/barrier.h @@ -222,6 +222,23 @@ #define __smp_mb__before_atomic() __smp_mb__before_llsc() #define __smp_mb__after_atomic() smp_llsc_mb() +#ifdef CONFIG_CPU_LOONGSON3 +/* Loongson-3 need a __smp_mb() after READ_ONCE() here */ +#define smp_cond_load_acquire(ptr, cond_expr) \ +({ \
- typeof(ptr) __PTR = (ptr); \
- typeof(*ptr) VAL; \
- for (;;) { \
VAL = READ_ONCE(*__PTR); \
__smp_mb(); \
if (cond_expr) \
break; \
cpu_relax(); \
- } \
- VAL; \
+}) +#endif /* CONFIG_CPU_LOONGSON3 */
#include <asm-generic/barrier.h>
#endif /* __ASM_BARRIER_H */
2.7.0
Hi, Paul,
First of all, could you please check why linux-mips reject e-mails from lemote.com? Of course I can send e-mails by gmail, but my gmail can't receive e-mails from linux-mips since March, 2018.
I have already read Documentation/memory-barriers.txt, but I don't think we should define a smp_read_barrier_depends() for Loongson-3. Because Loongson-3's behavior isn't like Alpha, and in syntax, this is not a data-dependent issue.
There is no document about Loongson-3's SFB. In my opinion, SFB looks like the L0 cache but sometimes it is out of cache-coherent machanism (L1 cache's cross-core coherency is maintained by hardware, but not always true for SFB). smp_mb() is needed for smp_cond_load_acquire(), but not every READ_ONCE().
Huacai
------------------ Original ------------------ From: "Paul Burton"paul.burton@mips.com; Date: Tue, Jun 19, 2018 02:51 AM To: "Huacai Chen"chenhc@lemote.com; Cc: "Ralf Baechle"ralf@linux-mips.org; "James Hogan"james.hogan@mips.com; "linux-mips"linux-mips@linux-mips.org; "Fuxin Zhang"zhangfx@lemote.com; "wuzhangjin"wuzhangjin@gmail.com; "Huacai Chen"chenhuacai@gmail.com; "stable"stable@vger.kernel.org; "Alan Stern"stern@rowland.harvard.edu; "AndreaParri"andrea.parri@amarulasolutions.com; "Will Deacon"will.deacon@arm.com; "Peter Zijlstra"peterz@infradead.org; "Boqun Feng"boqun.feng@gmail.com; "Nicholas Piggin"npiggin@gmail.com; "David Howells"dhowells@redhat.com; "Jade Alglave"j.alglave@ucl.ac.uk; "Luc Maranget"luc.maranget@inria.fr; "Paul E. McKenney"paulmck@linux.vnet.ibm.com; "Akira Yokosawa"akiyks@gmail.com; "linux-kernel"linux-kernel@vger.kernel.org; Subject: Re: [PATCH] MIPS: implement smp_cond_load_acquire() for Loongson-3
Hi Huacai,
On Fri, Jun 15, 2018 at 02:07:38PM +0800, Huacai Chen wrote:
After commit 7f56b58a92aaf2c ("locking/mcs: Use smp_cond_load_acquire() in MCS spin loop") Loongson-3 fails to boot. This is because Loongson-3 has SFB (Store Fill Buffer) and READ_ONCE() may get an old value in a tight loop. So in smp_cond_load_acquire() we need a __smp_mb() after every READ_ONCE().
Thanks - modifying smp_cond_load_acquire() is a step better than modifying arch_mcs_spin_lock_contended() to avoid it, but I'm still not sure we've reached the root of the problem. If tight loops using READ_ONCE() are at fault then what's special about smp_cond_load_acquire()? Could other such loops not hit the same problem?
Is the scenario you encounter the same as that outlined in the "DATA DEPENDENCY BARRIERS (HISTORICAL)" section of Documentation/memory-barriers.txt by any chance? If so then perhaps it would be better to implement smp_read_barrier_depends(), or just raw read_barrier_depends() depending upon how your SFB functions.
Is there any public documentation describing the behaviour of the store fill buffer in Loongson-3?
Part of the problem is that I'm still not sure what's actually happening in your system - it would be helpful to have further information in the commit message about what actually happens. For example if you could walk us through an example of the problem step by step in the style of the diagrams you'll see in Documentation/memory-barriers.txt then I think that would help us to see what the best solution here is.
I've copied the LKMM maintainers in case they have further input.
Thanks, Paul
This patch introduce a Loongson-specific smp_cond_load_acquire(). And it should be backported to as early as linux-4.5, in which release the smp_cond_acquire() is introduced.
Cc: stable@vger.kernel.org Signed-off-by: Huacai Chen chenhc@lemote.com
arch/mips/include/asm/barrier.h | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h index a5eb1bb..4ea384d 100644 --- a/arch/mips/include/asm/barrier.h +++ b/arch/mips/include/asm/barrier.h @@ -222,6 +222,23 @@ #define __smp_mb__before_atomic() __smp_mb__before_llsc() #define __smp_mb__after_atomic() smp_llsc_mb() +#ifdef CONFIG_CPU_LOONGSON3 +/* Loongson-3 need a __smp_mb() after READ_ONCE() here */ +#define smp_cond_load_acquire(ptr, cond_expr) \ +({ \
- typeof(ptr) __PTR = (ptr); \
- typeof(*ptr) VAL; \
- for (;;) { \
- VAL = READ_ONCE(*__PTR); \
- __smp_mb(); \
- if (cond_expr) \
- break; \
- cpu_relax(); \
- } \
- VAL; \
+}) +#endif /* CONFIG_CPU_LOONGSON3 */
#include <asm-generic/barrier.h>
#endif /* __ASM_BARRIER_H */
2.7.0
On Tue, Jun 19, 2018 at 02:40:14PM +0800, 陈华才 wrote:
Hi, Paul,
First of all, could you please check why linux-mips reject e-mails from lemote.com? Of course I can send e-mails by gmail, but my gmail can't receive e-mails from linux-mips since March, 2018.
Could you please learn to use email? No top posting and wrap lines at 78 chars.
I have already read Documentation/memory-barriers.txt, but I don't think we should define a smp_read_barrier_depends() for Loongson-3. Because Loongson-3's behavior isn't like Alpha, and in syntax, this is not a data-dependent issue.
Agreed, this is not a data-dependency issue.
There is no document about Loongson-3's SFB. In my opinion, SFB looks like the L0 cache but sometimes it is out of cache-coherent machanism (L1 cache's cross-core coherency is maintained by hardware, but not always true for SFB). smp_mb() is needed for smp_cond_load_acquire(), but not every READ_ONCE().
Linux does _NOT_ support non-coherent SMP. If your system is not fully coherent, you're out of luck.
But please, explain in excruciating detail what exactly you need that smp_mb for. If, like I posited in my previous email, it is to ensure remote store buffer flushes, then your machine is terminally broken.
Hi, Peter and Paul,
Loongson-3's Store Fill Buffer is nearly the same as your "Store Buffer", and it increases the memory ordering weakness. So, smp_cond_load_acquire() only need a __smp_mb() before the loop, not after every READ_ONCE(). In other word, the following code is just OK:
#define smp_cond_load_acquire(ptr, cond_expr) \ ({ \ typeof(ptr) __PTR = (ptr); \ typeof(*ptr) VAL; \ __smp_mb(); \ for (;;) { \ VAL = READ_ONCE(*__PTR); \ if (cond_expr) \ break; \ cpu_relax(); \ } \ __smp_mb(); \ VAL; \ })
the __smp_mb() before loop is used to avoid "reads prioritised over writes", which is caused by SFB's weak ordering and similar to ARM11MPCore (mentioned by Will Deacon).
Huacai
------------------ Original ------------------ From: "Peter Zijlstra"peterz@infradead.org; Date: Tue, Jun 19, 2018 03:22 PM To: "陈华才"chenhc@lemote.com; Cc: "Paul Burton"paul.burton@mips.com; "Ralf Baechle"ralf@linux-mips.org; "James Hogan"james.hogan@mips.com; "linux-mips"linux-mips@linux-mips.org; "Fuxin Zhang"zhangfx@lemote.com; "wuzhangjin"wuzhangjin@gmail.com; "Huacai Chen"chenhuacai@gmail.com; "stable"stable@vger.kernel.org; "Alan Stern"stern@rowland.harvard.edu; "AndreaParri"andrea.parri@amarulasolutions.com; "Will Deacon"will.deacon@arm.com; "Boqun Feng"boqun.feng@gmail.com; "Nicholas Piggin"npiggin@gmail.com; "David Howells"dhowells@redhat.com; "Jade Alglave"j.alglave@ucl.ac.uk; "Luc Maranget"luc.maranget@inria.fr; "Paul E. McKenney"paulmck@linux.vnet.ibm.com; "Akira Yokosawa"akiyks@gmail.com; "linux-kernel"linux-kernel@vger.kernel.org; Subject: Re: [PATCH] MIPS: implement smp_cond_load_acquire() for Loongson-3
On Tue, Jun 19, 2018 at 02:40:14PM +0800, 陈华才 wrote:
Hi, Paul,
First of all, could you please check why linux-mips reject e-mails from lemote.com? Of course I can send e-mails by gmail, but my gmail can't receive e-mails from linux-mips since March, 2018.
Could you please learn to use email? No top posting and wrap lines at 78 chars.
I have already read Documentation/memory-barriers.txt, but I don't think we should define a smp_read_barrier_depends() for Loongson-3. Because Loongson-3's behavior isn't like Alpha, and in syntax, this is not a data-dependent issue.
Agreed, this is not a data-dependency issue.
There is no document about Loongson-3's SFB. In my opinion, SFB looks like the L0 cache but sometimes it is out of cache-coherent machanism (L1 cache's cross-core coherency is maintained by hardware, but not always true for SFB). smp_mb() is needed for smp_cond_load_acquire(), but not every READ_ONCE().
Linux does _NOT_ support non-coherent SMP. If your system is not fully coherent, you're out of luck.
But please, explain in excruciating detail what exactly you need that smp_mb for. If, like I posited in my previous email, it is to ensure remote store buffer flushes, then your machine is terminally broken.
On Wed, Jun 20, 2018 at 11:31:55AM +0800, 陈华才 wrote:
Loongson-3's Store Fill Buffer is nearly the same as your "Store Buffer", and it increases the memory ordering weakness. So, smp_cond_load_acquire() only need a __smp_mb() before the loop, not after every READ_ONCE(). In other word, the following code is just OK:
#define smp_cond_load_acquire(ptr, cond_expr) \ ({ \ typeof(ptr) __PTR = (ptr); \ typeof(*ptr) VAL; \ __smp_mb(); \ for (;;) { \ VAL = READ_ONCE(*__PTR); \ if (cond_expr) \ break; \ cpu_relax(); \ } \ __smp_mb(); \ VAL; \ })
the __smp_mb() before loop is used to avoid "reads prioritised over writes", which is caused by SFB's weak ordering and similar to ARM11MPCore (mentioned by Will Deacon).
Sure, but smp_cond_load_acquire() isn't the only place you'll see this sort of pattern in the kernel. In other places, the only existing arch hook is cpu_relax(), so unless you want to audit all loops and add a special MIPs-specific smp_mb() to those that are affected, I think your only option is to stick it in cpu_relax().
I assume you don't have a control register that can disable this prioritisation in the SFB?
Will
On Wed, Jun 20, 2018 at 09:17:16AM +0100, Will Deacon wrote:
On Wed, Jun 20, 2018 at 11:31:55AM +0800, 陈华才 wrote:
Loongson-3's Store Fill Buffer is nearly the same as your "Store Buffer", and it increases the memory ordering weakness. So, smp_cond_load_acquire() only need a __smp_mb() before the loop, not after every READ_ONCE(). In other word, the following code is just OK:
#define smp_cond_load_acquire(ptr, cond_expr) \ ({ \ typeof(ptr) __PTR = (ptr); \ typeof(*ptr) VAL; \ __smp_mb(); \ for (;;) { \ VAL = READ_ONCE(*__PTR); \ if (cond_expr) \ break; \ cpu_relax(); \ } \ __smp_mb(); \ VAL; \ })
the __smp_mb() before loop is used to avoid "reads prioritised over writes", which is caused by SFB's weak ordering and similar to ARM11MPCore (mentioned by Will Deacon).
Sure, but smp_cond_load_acquire() isn't the only place you'll see this sort of pattern in the kernel. In other places, the only existing arch hook is cpu_relax(), so unless you want to audit all loops and add a special MIPs-specific smp_mb() to those that are affected, I think your only option is to stick it in cpu_relax().
I assume you don't have a control register that can disable this prioritisation in the SFB?
Right, I think we also want to clarify that this 'feature' is not supported by the Linux kernel in general and LKMM in specific.
It really is a CPU bug. And the cpu_relax() change is a best effort work-around.
On Mon, Jun 18, 2018 at 11:51:41AM -0700, Paul Burton wrote:
Hi Huacai,
On Fri, Jun 15, 2018 at 02:07:38PM +0800, Huacai Chen wrote:
After commit 7f56b58a92aaf2c ("locking/mcs: Use smp_cond_load_acquire() in MCS spin loop") Loongson-3 fails to boot. This is because Loongson-3 has SFB (Store Fill Buffer) and READ_ONCE() may get an old value in a tight loop. So in smp_cond_load_acquire() we need a __smp_mb() after every READ_ONCE().
Thanks - modifying smp_cond_load_acquire() is a step better than modifying arch_mcs_spin_lock_contended() to avoid it, but I'm still not sure we've reached the root of the problem.
Agreed, this looks entirely dodgy.
If tight loops using READ_ONCE() are at fault then what's special about smp_cond_load_acquire()? Could other such loops not hit the same problem?
Right again, Linux has a number of places where it relies on loops like this.
for (;;) { if (READ_ONCE(*ptr)) break;
cpu_relax(); }
That is assumed to terminate -- provided the store to make *ptr != 0 happens of course.
And this has nothing to do with store buffers per se, sure store-buffers might delay the store from being visible for a (little) while, but we very much assume store buffers will not indefinitely hold on to data.
The proposed __smp_mb() doesn't make any kind of sense here. I presume it's effect is to flush remote store buffers, but that is not something LKMM allows for.
Is the scenario you encounter the same as that outlined in the "DATA DEPENDENCY BARRIERS (HISTORICAL)" section of Documentation/memory-barriers.txt by any chance? If so then perhaps it would be better to implement smp_read_barrier_depends(), or just raw read_barrier_depends() depending upon how your SFB functions.
That doesn't make any sense, there is no actual data dependency here. We load a single variable. Data dependencies are where you have (at least) 2 loads, where the second depends on the first, like:
struct obj *obj = rcu_dereference(objp); int val = obj->val;
Here the load of val depends on the load of obj.
Is there any public documentation describing the behaviour of the store fill buffer in Loongson-3?
I know of Store Buffers, but what exactly is a Store Fill Buffer? Better would be to get a coherent memory model document on Loongson, because this all smells horribly.
Part of the problem is that I'm still not sure what's actually happening in your system - it would be helpful to have further information in the commit message about what actually happens. For example if you could walk us through an example of the problem step by step in the style of the diagrams you'll see in Documentation/memory-barriers.txt then I think that would help us to see what the best solution here is.
I've copied the LKMM maintainers in case they have further input.
Thanks, patches like proposed certainly require closer scrutiny and I agree with you that this needs far better explanation.
Hi all,
On Tue, Jun 19, 2018 at 09:17:10AM +0200, Peter Zijlstra wrote:
On Mon, Jun 18, 2018 at 11:51:41AM -0700, Paul Burton wrote:
On Fri, Jun 15, 2018 at 02:07:38PM +0800, Huacai Chen wrote:
After commit 7f56b58a92aaf2c ("locking/mcs: Use smp_cond_load_acquire() in MCS spin loop") Loongson-3 fails to boot. This is because Loongson-3 has SFB (Store Fill Buffer) and READ_ONCE() may get an old value in a tight loop. So in smp_cond_load_acquire() we need a __smp_mb() after every READ_ONCE().
Thanks - modifying smp_cond_load_acquire() is a step better than modifying arch_mcs_spin_lock_contended() to avoid it, but I'm still not sure we've reached the root of the problem.
Agreed, this looks entirely dodgy.
If tight loops using READ_ONCE() are at fault then what's special about smp_cond_load_acquire()? Could other such loops not hit the same problem?
Right again, Linux has a number of places where it relies on loops like this.
for (;;) { if (READ_ONCE(*ptr)) break;
cpu_relax();
}
That is assumed to terminate -- provided the store to make *ptr != 0 happens of course.
And this has nothing to do with store buffers per se, sure store-buffers might delay the store from being visible for a (little) while, but we very much assume store buffers will not indefinitely hold on to data.
We had an issue 8 years ago with the 11MPCore CPU where reads were prioritised over writes, so code doing something like:
WRITE_ONCE(*foo, 1); while (!READ_ONCE(*bar));
might never make the store to foo visible to other CPUs. This caused a livelock in KGDB, where two CPUs were doing this on opposite variables (i.e. the "SB" litmus test, but with the reads looping until they read 1).
See 534be1d5a2da ("ARM: 6194/1: change definition of cpu_relax() for ARM11MPCore") for the ugly fix, assuming that the "Store Fill Buffer" suffers from the same disease.
Will
linux-stable-mirror@lists.linaro.org