Do spinlocks really need DMB?

List overview All Threads
Download

newer

older

[ACTIVITY] (Shawn Guo) Week of Oct...

[ACTIVITY] (Anton Vorontsov)...

Peter Fordham

30 Oct 2012 30 Oct '12

8:55 p.m.

Hi,

I'm working with a dual Cortex-A9 system and I've been trying to understand exactly why spinlock functions need to use DMB. It seems that as long as the merging store buffer is flushed the lock value should end up in the L1 on the unlocking core and the SCU should either invalidate or update the value in the L1 of the other core. This is enough to maintain coherency and safe locking right? And doesn't STREX skip the merging store buffer anyway, meaning we don't even need the flush?

DMB appears to be something of a blunt hammer, especially since it defaults to the system domain, which likely means a write all the way to main memory, which can be expensive.

Are the DMBs in the locks there as a workaround for drivers that don't use smp_mb properly?

I'm currently seeing, based on the performance counters, about 5% of my system cycles disappearing in stalls caused by DMB.

-Pete

Show replies by date

Viresh Kumar

31 Oct 31 Oct

2 a.m.

On 31 October 2012 02:25, Peter Fordham peter.fordham@gmail.com wrote:

...

I'm working with a dual Cortex-A9 system and I've been trying to understand exactly why spinlock functions need to use DMB. It seems that as long as the merging store buffer is flushed the lock value should end up in the L1 on the unlocking core and the SCU should either invalidate or update the value in the L1 of the other core. This is enough to maintain coherency and safe locking right? And doesn't STREX skip the merging store buffer anyway, meaning we don't even need the flush?

DMB appears to be something of a blunt hammer, especially since it defaults to the system domain, which likely means a write all the way to main memory, which can be expensive.

Are the DMBs in the locks there as a workaround for drivers that don't use smp_mb properly?

I'm currently seeing, based on the performance counters, about 5% of my system cycles disappearing in stalls caused by DMB.

Hi Peter,

Following is taken from Documentation/memory-barriers.txt:

(5) LOCK operations.

This acts as a one-way permeable barrier. It guarantees that all memory operations after the LOCK operation will appear to happen after the LOCK operation with respect to the other components of the system.

Memory operations that occur before a LOCK operation may appear to happen after it completes.

(6) UNLOCK operations.

This also acts as a one-way permeable barrier. It guarantees that all memory operations before the UNLOCK operation will appear to happen before the UNLOCK operation with respect to the other components of the system.

Memory operations that occur after an UNLOCK operation may appear to happen before it completes.

Because ARMv6 and above are weakly ordered, we need to guarantee that the code after the lock must execute after the lock is taken and so a barrier there.

Also in unlock we must guarantee that code before the unlock must execute before the unlocking is done. So a smp_mb() at the beginning.

Adding Russell and Arnd in cc to correct my statements :)

-- viresh

Viresh Kumar

2:09 a.m.

On 31 October 2012 07:30, Viresh Kumar viresh.kumar@linaro.org wrote:

...

Following is taken from Documentation/memory-barriers.txt:

(5) LOCK operations.
 This acts as a one-way permeable barrier.  It guarantees that all memory
 operations after the LOCK operation will appear to happen after the LOCK
 operation with respect to the other components of the system.

 Memory operations that occur before a LOCK operation may appear to happen
 after it completes.
(6) UNLOCK operations.
 This also acts as a one-way permeable barrier.  It guarantees that all
 memory operations before the UNLOCK operation will appear to happen before
 the UNLOCK operation with respect to the other components of the system.

 Memory operations that occur after an UNLOCK operation may appear to
 happen before it completes.
Because ARMv6 and above are weakly ordered, we need to guarantee that the code after the lock must execute after the lock is taken and so a barrier there.

Also in unlock we must guarantee that code before the unlock must execute before the unlocking is done. So a smp_mb() at the beginning.

Adding Russell and Arnd in cc to correct my statements :)

This is the patch that added this behavior:

commit 6d9b37a3a80195d317887ff81aad6a58a66954b5 Author: Russell King rmk@dyn-67.arm.linux.org.uk Date: Tue Jul 26 19:44:26 2005 +0100

[PATCH] ARM SMP: Add ARMv6 memory barriers

Convert explicit gcc asm-based memory barriers into smp_mb() calls. These change between barrier() and the ARMv6 data memory barrier instruction depending on whether ARMv6 SMP is enabled.

Signed-off-by: Russell King rmk+kernel@arm.linux.org.uk --- include/asm-arm/bitops.h | 4 ++-- include/asm-arm/locks.h | 36 ++++++++++++++++++++++++------------ include/asm-arm/spinlock.h | 53 ++++++++++++++++++++++++++++++++++++++--------------- include/asm-arm/system.h | 5 +++++ 4 files changed, 69 insertions(+), 29 deletions(-)

diff --git a/include/asm-arm/spinlock.h b/include/asm-arm/spinlock.h index 9705d5e..1f906d0 100644 --- a/include/asm-arm/spinlock.h +++ b/include/asm-arm/spinlock.h @@ -8,9 +8,10 @@ /* * ARMv6 Spin-locking. * - * We (exclusively) read the old value, and decrement it. If it - * hits zero, we may have won the lock, so we try (exclusively) - * storing it. + * We exclusively read the old value. If it is zero, we may have + * won the lock, so we try exclusively storing it. A memory barrier + * is required after we get a lock, and before we release it, because + * V6 CPUs are assumed to have weakly ordered memory. * * Unlocked value: 0 * Locked value: 1 @@ -41,7 +42,9 @@ static inline void _raw_spin_lock(spinlock_t *lock) " bne 1b" : "=&r" (tmp) : "r" (&lock->lock), "r" (1) - : "cc", "memory"); + : "cc"); + + smp_mb(); }

Peter Fordham

6 p.m.

OK so if I understand you correctly your saying that the memory barriers aren't fundamental to the correct operation of this spinlock itself but are required to make all the memory access started inside the critical section actually complete inside the critical section, right?

I guess what I'm saying is that isn't it up to the calling code to decide exactly how this should be done rather than the locking code. For example, if I'm just using a spinlock in the place of a mutex in a time critical section of where I want to avoid a call into the scheduler and I'm not really poking registers in a memory mapped device, just updating some shared data structure in normal cache-able memory, it seems inappropriate to use DMBs. I see examples of this in the ext4 code.

In real driver code where we do access registers, shouldn't the driver be responsible for calling smp_mb where appropriate?

-Pete

Viresh Kumar

6:11 p.m.

On 31 October 2012 23:30, Peter Fordham peter.fordham@gmail.com wrote:

...

OK so if I understand you correctly your saying that the memory barriers aren't fundamental to the correct operation of this spinlock itself but are required to make all the memory access started inside the critical section actually complete inside the critical section, right?

Partially.

...

I guess what I'm saying is that isn't it up to the calling code to decide exactly how this should be done rather than the locking code. For example, if I'm just using a spinlock in the place of a mutex in a time critical section of where I want to avoid a call into the scheduler and I'm not really poking registers in a memory mapped device, just updating some shared data structure in normal cache-able memory, it seems inappropriate to use DMBs. I see examples of this in the ext4 code.

In real driver code where we do access registers, shouldn't the driver be responsible for calling smp_mb where appropriate?

Its not about registers at all. Suppose you have following code sequence:

global variable x = 0;

fn-1() { while (1) { spin_lock() if (x == 1) { spin_unlock(); return; } spin_unlock(); } }

And in an interrupt handler

handler() { spin_lock(); x=1;

if (some-other-event) x = 0; spin_unlock(); }

Now, suppose this some-other-event is returning true for us. Then the expected behavior of this code should be, the first function is running an infinite loop.

But for ARMv6 and above normal memory is weakly ordered. i.e. compiler and processor are free to move instructions doing reads writes to memory before and after other instructions. Also, they can break one memory transaction into two and join 2 into one... etc..

Now who will guarantee that, x is always read after taking the lock in function 1? Because if x moves before the instruction doing the increment on spin_lock variable, then value of x may be read when it is made 1 by the handler and so before the lock is released by the handler.

So, it is responsibility of lock,unlock to make sure that instructions executed within lock/unlock must stay within these two and don't jump out. This is not guaranteed with ARMv6 and above.

STREX and LDREX are for atomic load and store on SMP systems

Hope you understood what i am pointing at??

-- viresh

Peter Fordham

6:50 p.m.

Yes, thanks I think I've got the weakly ordered thing straight in my mind now.

The DMB is needed in the SMP case because the other processor may see the memory accesses happening in a different order without it. This isn't an issue on uni-processor since it's view of ordering is self-consistent, so it's safe to use smp_mb rather than mb.