OK so if I understand you correctly your saying that the memory barriers aren't fundamental to the correct operation of this spinlock itself but are required to make all the memory access started inside the critical section actually complete inside the critical section, right?
I guess what I'm saying is that isn't it up to the calling code to decide exactly how this should be done rather than the locking code. For example, if I'm just using a spinlock in the place of a mutex in a time critical section of where I want to avoid a call into the scheduler and I'm not really poking registers in a memory mapped device, just updating some shared data structure in normal cache-able memory, it seems inappropriate to use DMBs. I see examples of this in the ext4 code.
In real driver code where we do access registers, shouldn't the driver be responsible for calling smp_mb where appropriate?
-Pete