On Tue, 7 Dec 2010 08:34:37 +0100 Robert Fekete robert.fekete@linaro.org said:
i ran into this many years ago (back in 2004 or so). interestingly amd athalons got zero slowdown - but i was totally bus limited, so no speedup either. i was pushing what the bus could do. 2 cores with 2 threads hammering away at memory heavy ops simply took 2x as long. on intel cpu's (i didn't text later until i had a core 2) i noticed major degradation doing the same thing - as you see now.
it was actually an alpha blending loop with some inline mmx asm that worked on a fairly large rgba 32bit pixel dataset with each thread taking an alternate line OR an alternate region (top/bottom half) of the src and dest blocks. no locks, nothing else except hammering through data in very long loops.
my conclusion is that i'm somehow tripping some cache prefetch logic to go "silly" with 2 cores hammering away at different src/dest memory regions, and that there isn't a lot i can do that is "portable", and non-portable stuff would require i figure out just how to fool the cache logic back into doing what i want per cpu variant and generation, detect every one of those properly etc. etc.
operations that were compute heavy (for me that was full linear interpolated upscaling in the cpu) i got an almost 2x speedup on 2 cores. but simple blending or copying(blitting) i was seeing half the speed of a single thread working on the same total src+dst data regions even. unlike the paper from intel - my code did zero allocation during the process - it simply created some random data in memory then looped over it until i had done enough loops to have a good measurement. the same memory each time - big enough to be well beyond l2 cache sizes, so evicting of cache shouldn't have been an issue either - i suspect it was all in prefetch logic. never did prove it though. :)
just thought i'd share :) maybe one day i'll come back to look into it again, but for now it has meant that i am putting off further multi-core cpu rendering stuff and considering breaking it up more into pipeline stages with expensive logic+compute stages feeding into dumber "blend and copy that junk" stages, instead of trying to be a gpu and just split the output buffer into regions and assign a region per core (which is the current solution- simpler for the code by far, but a win on some sides, and a loss on others, so in the end just excess complexity for no real gain).
Hi Everyone knowing awfully lot about memory management and multicore op,
My name is Robert Fekete and I work in the ST-Ericsson Landing Team.
I have a question regarding multicore SMP aware memory operations libc, and I hope you can have an answer or at least direct me to someone who might know better. I have already mailed Michael Hope who recommended me to mail this list(of course).
The issue I am experiencing is that when I am running a memset operation(on different non overlapping memory areas) on both cores simultaneously they will affect each other and prolong the work of each processes/cpu's by a factor of ~3-4(measured in Lauterabach).
i.e. assume a memset op that takes around 1ms (some really big buffer in a for loop many times)
Non-simultaneous memsets: (total time 2ms, each process execution time 1ms, x == running) core1 :xxxxoooo core2 :ooooxxxx
Simultaneous memsets: (total time 3ms , each process 3ms) core1 :xxxxxxxxxxxx core2 :xxxxxxxxxxxx
Well there are some factors that can explain parts of it. There is a bandwidth limitation which will peak on one memset meaning that two memsets cannot possibly go faster and a proper value should be around 2ms(not 3-4ms), The L2$ is 512k which may lead to severe L2$ mess up since the sceduler is extremely fair and fine granular between the cores. L1$ will also get hurt.
But the real issue is that the two totally different memops(big ones though) in parallell will mess up for each other. Not good if one of the processes on core 1(for example) is a high prio compositor and the second process on core 2 is a lowprio crap application. The the Low prio crap App will severely mess up for the compositor. OS prio does not propagate to bus access prio.
Is there a way to make this behaviour a bit more deterministic...for the high prio op at least, like a scalable memory operation libc variant.
I read an article about a similar issue on Intel CPUs and the solution in this case is a scalable memory allocator lib. http://software.intel.com/en-us/blogs/2009/08/21/is-your-memory-management-m...
More academic reading of the phenomenon: http://books.google.com/books?id=NF-C2ZQZXekC&pg=PA353&lpg=PA353&...
BR /Robert Fekete