 
            On Tue, Aug 16, 2022 at 8:02 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, Aug 16, 2022 at 10:49 AM Jon Nettleton jon@solid-run.com wrote:
It is moot if Linus has already taken the patch, but with a stock kernel config I am still seeing a slight performance dip but only ~1-2% in the specific tests I was running.
It would be interesting to hear if you can pinpoint in the profiles where the time is spent.
It might be some random place that really doesn't care about ordering at all, and then we could easily rewrite _that_ particular case to do the unordered test explicitly, ie something like
if (test_and_set_bit()) ...
if (test_bit() || test_and_set_bit()) ...or even introduce an explicitly unordered "test_and_set_bit_relaxed()" thing.
Linus
This is very interesting, the additional performance overhead doesn't seem to be coming from within the kernel but from userspace. Comparing patched and unpatched kernels I am seeing more cycles being taken up by glibc atomics like __aarch64_cas4_acq and __aarch64_ldadd4_acq_rel.
I need to test further to see if there is less effect on a system with less cores, This is a 16-core Cortex-A72, it is possible this is less of an issue on 4 core A72's and A53's.
-Jon