Re: Fwd: Scalable memory operations with multiple cores

7 Dec 2010


      On Tue, 7 Dec 2010 08:34:37 +0100 Robert Fekete robert.fekete@linaro.org said:
i ran into this many years ago (back in 2004 or so). interestingly amd
athalons got zero slowdown - but i was totally bus limited, so no speedup
either. i was pushing what the bus could do. 2 cores with 2 threads hammering
away at memory heavy ops simply took 2x as long. on intel cpu's (i didn't text
later until i had a core 2) i noticed major degradation doing the same thing -
as you see now.
it was actually an alpha blending loop with some inline mmx asm
that worked on a fairly large rgba 32bit pixel dataset with each thread taking
an alternate line OR an alternate region (top/bottom half) of the src and dest
blocks. no locks, nothing else except hammering through data in very long loops.
my conclusion is that i'm somehow tripping some cache prefetch logic to go
"silly" with 2 cores hammering away at different src/dest memory regions, and
that there isn't a lot i can do that is "portable", and non-portable stuff
would require i figure out just how to fool the cache logic back into doing
what i want per cpu variant and generation, detect every one of those properly
etc. etc.
operations that were compute heavy (for me that was full linear interpolated
upscaling in the cpu) i got an almost 2x speedup on 2 cores. but simple
blending or copying(blitting) i was seeing half the speed of a single
thread working on the same total src+dst data regions even. unlike the paper
from intel - my code did zero allocation during the process - it simply created
some random data in memory then looped over it until i had done enough loops to
have a good measurement. the same memory each time - big enough to be well
beyond l2 cache sizes, so evicting of cache shouldn't have been an issue either
- i suspect it was all in prefetch logic. never did prove it though. :)
just thought i'd share :) maybe one day i'll come back to look into it again,
but for now it has meant that i am putting off further multi-core cpu rendering
stuff and considering breaking it up more into pipeline stages with expensive
logic+compute stages feeding into dumber "blend and copy that junk" stages,
instead of trying to be a gpu and just split the output buffer into regions and
assign a region per core (which is the current solution- simpler for the code
by far, but a win on some sides, and a loss on others, so in the end just
excess complexity for no real gain).
...
Hi Everyone knowing awfully lot about memory management and multicore op,
My name is Robert Fekete and I work in the ST-Ericsson Landing Team.
I have a question regarding multicore SMP aware memory operations libc, and
I hope you can have an answer or at least direct me to someone who might
know better. I have already mailed Michael Hope who recommended me to mail
this list(of course).
The issue I am experiencing is that when I am running a memset operation(on
different non overlapping memory areas) on both cores simultaneously they
will affect each other and prolong the work of each processes/cpu's by a
factor of ~3-4(measured in Lauterabach).
i.e. assume a memset op that takes around 1ms (some really big buffer in a
for loop many times)
Non-simultaneous memsets: (total time 2ms, each process execution time 1ms,
x == running)
core1 :xxxxoooo
core2 :ooooxxxx
Simultaneous memsets: (total time 3ms , each process 3ms)
core1 :xxxxxxxxxxxx
core2 :xxxxxxxxxxxx
Well there are some factors that can explain parts of it.
There is a bandwidth limitation which will peak on one memset meaning that
two memsets cannot possibly go faster and a proper value should be around
2ms(not 3-4ms), The L2$ is 512k which may lead to severe L2$ mess up since
the sceduler is extremely fair and fine granular between the cores. L1$ will
also get hurt.
But the real issue is that the two totally different memops(big ones though)
in parallell will mess up for each other. Not good if one of the processes
on core 1(for example) is a high prio compositor and the second process on
core 2 is a lowprio crap application. The the Low prio crap App will
severely mess up for the compositor. OS prio does not propagate to bus
access prio.
Is there a way to make this behaviour a bit more deterministic...for the
high prio op at least, like a scalable memory operation libc variant.
I read an article about a similar issue on Intel CPUs and the solution in
this case is a scalable memory allocator lib.
http://software.intel.com/en-us/blogs/2009/08/21/is-your-memory-management-m...
More academic reading of the phenomenon:
http://books.google.com/books?id=NF-C2ZQZXekC&pg=PA353&lpg=PA353&...
BR
/Robert Fekete
-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    raster@rasterman.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Fwd: Scalable memory operations with multiple cores