Hi Everyone knowing awfully lot about memory management and multicore op,
My name is Robert Fekete and I work in the ST-Ericsson Landing Team.
I have a question regarding multicore SMP aware memory operations libc, and I hope you can have an answer or at least direct me to someone who might know better. I have already mailed Michael Hope who recommended me to mail this list(of course).
The issue I am experiencing is that when I am running a memset operation(on different non overlapping memory areas) on both cores simultaneously they will affect each other and prolong the work of each processes/cpu's by a factor of ~3-4(measured in Lauterabach).
i.e. assume a memset op that takes around 1ms (some really big buffer in a for loop many times)
Non-simultaneous memsets: (total time 2ms, each process execution time 1ms, x == running) core1 :xxxxoooo core2 :ooooxxxx
Simultaneous memsets: (total time 3ms , each process 3ms) core1 :xxxxxxxxxxxx core2 :xxxxxxxxxxxx
Well there are some factors that can explain parts of it. There is a bandwidth limitation which will peak on one memset meaning that two memsets cannot possibly go faster and a proper value should be around 2ms(not 3-4ms), The L2$ is 512k which may lead to severe L2$ mess up since the sceduler is extremely fair and fine granular between the cores. L1$ will also get hurt.
But the real issue is that the two totally different memops(big ones though) in parallell will mess up for each other. Not good if one of the processes on core 1(for example) is a high prio compositor and the second process on core 2 is a lowprio crap application. The the Low prio crap App will severely mess up for the compositor. OS prio does not propagate to bus access prio.
Is there a way to make this behaviour a bit more deterministic...for the high prio op at least, like a scalable memory operation libc variant.
I read an article about a similar issue on Intel CPUs and the solution in this case is a scalable memory allocator lib. http://software.intel.com/en-us/blogs/2009/08/21/is-your-memory-management-m...
More academic reading of the phenomenon: http://books.google.com/books?id=NF-C2ZQZXekC&pg=PA353&lpg=PA353&...
BR /Robert Fekete
On Tue, 7 Dec 2010 08:34:37 +0100 Robert Fekete robert.fekete@linaro.org said:
i ran into this many years ago (back in 2004 or so). interestingly amd athalons got zero slowdown - but i was totally bus limited, so no speedup either. i was pushing what the bus could do. 2 cores with 2 threads hammering away at memory heavy ops simply took 2x as long. on intel cpu's (i didn't text later until i had a core 2) i noticed major degradation doing the same thing - as you see now.
it was actually an alpha blending loop with some inline mmx asm that worked on a fairly large rgba 32bit pixel dataset with each thread taking an alternate line OR an alternate region (top/bottom half) of the src and dest blocks. no locks, nothing else except hammering through data in very long loops.
my conclusion is that i'm somehow tripping some cache prefetch logic to go "silly" with 2 cores hammering away at different src/dest memory regions, and that there isn't a lot i can do that is "portable", and non-portable stuff would require i figure out just how to fool the cache logic back into doing what i want per cpu variant and generation, detect every one of those properly etc. etc.
operations that were compute heavy (for me that was full linear interpolated upscaling in the cpu) i got an almost 2x speedup on 2 cores. but simple blending or copying(blitting) i was seeing half the speed of a single thread working on the same total src+dst data regions even. unlike the paper from intel - my code did zero allocation during the process - it simply created some random data in memory then looped over it until i had done enough loops to have a good measurement. the same memory each time - big enough to be well beyond l2 cache sizes, so evicting of cache shouldn't have been an issue either - i suspect it was all in prefetch logic. never did prove it though. :)
just thought i'd share :) maybe one day i'll come back to look into it again, but for now it has meant that i am putting off further multi-core cpu rendering stuff and considering breaking it up more into pipeline stages with expensive logic+compute stages feeding into dumber "blend and copy that junk" stages, instead of trying to be a gpu and just split the output buffer into regions and assign a region per core (which is the current solution- simpler for the code by far, but a win on some sides, and a loss on others, so in the end just excess complexity for no real gain).
Hi Everyone knowing awfully lot about memory management and multicore op,
My name is Robert Fekete and I work in the ST-Ericsson Landing Team.
I have a question regarding multicore SMP aware memory operations libc, and I hope you can have an answer or at least direct me to someone who might know better. I have already mailed Michael Hope who recommended me to mail this list(of course).
The issue I am experiencing is that when I am running a memset operation(on different non overlapping memory areas) on both cores simultaneously they will affect each other and prolong the work of each processes/cpu's by a factor of ~3-4(measured in Lauterabach).
i.e. assume a memset op that takes around 1ms (some really big buffer in a for loop many times)
Non-simultaneous memsets: (total time 2ms, each process execution time 1ms, x == running) core1 :xxxxoooo core2 :ooooxxxx
Simultaneous memsets: (total time 3ms , each process 3ms) core1 :xxxxxxxxxxxx core2 :xxxxxxxxxxxx
Well there are some factors that can explain parts of it. There is a bandwidth limitation which will peak on one memset meaning that two memsets cannot possibly go faster and a proper value should be around 2ms(not 3-4ms), The L2$ is 512k which may lead to severe L2$ mess up since the sceduler is extremely fair and fine granular between the cores. L1$ will also get hurt.
But the real issue is that the two totally different memops(big ones though) in parallell will mess up for each other. Not good if one of the processes on core 1(for example) is a high prio compositor and the second process on core 2 is a lowprio crap application. The the Low prio crap App will severely mess up for the compositor. OS prio does not propagate to bus access prio.
Is there a way to make this behaviour a bit more deterministic...for the high prio op at least, like a scalable memory operation libc variant.
I read an article about a similar issue on Intel CPUs and the solution in this case is a scalable memory allocator lib. http://software.intel.com/en-us/blogs/2009/08/21/is-your-memory-management-m...
More academic reading of the phenomenon: http://books.google.com/books?id=NF-C2ZQZXekC&pg=PA353&lpg=PA353&...
BR /Robert Fekete
On 7 December 2010 07:34, Robert Fekete robert.fekete@linaro.org wrote:
Hi Everyone knowing awfully lot about memory management and multicore op,
My name is Robert Fekete and I work in the ST-Ericsson Landing Team.
I have a question regarding multicore SMP aware memory operations libc, and I hope you can have an answer or at least direct me to someone who might know better. I have already mailed Michael Hope who recommended me to mail this list(of course).
The issue I am experiencing is that when I am running a memset operation(on different non overlapping memory areas) on both cores simultaneously they will affect each other and prolong the work of each processes/cpu's by a factor of ~3-4(measured in Lauterabach).
i.e. assume a memset op that takes around 1ms (some really big buffer in a for loop many times)
Non-simultaneous memsets: (total time 2ms, each process execution time 1ms, x == running) core1 :xxxxoooo core2 :ooooxxxx
Simultaneous memsets: (total time 3ms , each process 3ms) core1 :xxxxxxxxxxxx core2 :xxxxxxxxxxxx
Well there are some factors that can explain parts of it. There is a bandwidth limitation which will peak on one memset meaning that two memsets cannot possibly go faster and a proper value should be around 2ms(not 3-4ms), The L2$ is 512k which may lead to severe L2$ mess up since the sceduler is extremely fair and fine granular between the cores. L1$ will also get hurt.
Since I already had a memset bandwidth test and access to a 4 core A9 vexpress I thought I'd recreate it; the results are kind of fun. These are loops writing 256k at a time with memset (which maybe dangerously close to L2 size) in a tight loop, and printing out a MB/s figure every few seconds:
1xmemset loop - ~1900MB/s 2xmemset loop - 2x1340MB/s -> 2680MB/s total 3xmemset loop - 2x450MB/s + 1x270 (!!!) -> 1170MB/s total 4xmemset loop - 3x160MB/s + 1x280 (!!!) -> 760MB/s total
This is using libc memset. So I guess what we're seeing there is that 1 CPU isn't managing to saturate a bus, where as two between them are saturating it and getting a fair share each. What's happening at 3 and 4 CPUs seems similar to your case where things are dropping badly; I really don't know why the behaviour is so asymmetric on the 3 and 4 core case.
(When using a simple armv7 routine I wrote the results were similar except in the 2 core case where I didn't seem to get the benefit, which is an interesting indicator that how the code is written might leave it more or less susceptible to some interactions).
But the real issue is that the two totally different memops(big ones though)
in parallell will mess up for each other. Not good if one of the processes on core 1(for example) is a high prio compositor and the second process on core 2 is a lowprio crap application. The the Low prio crap App will severely mess up for the compositor. OS prio does not propagate to bus access prio.
I don't think it's unusual to have this problem; if you are sharing a limited resource be it disk bandwidth, ram bandwidth or any bus inbetween you're going to hit the same problem. I also don't think it's too unusual to find the non-linearity where when lots of things fight over the same resource the total drops - although I am surprised it's that bad.
Is there a way to make this behaviour a bit more deterministic...for the
high prio op at least, like a scalable memory operation libc variant.
I guess one thing to understand are where the bottlenecks are and whether there are counters on each bus to see what is happening. I can see that a system could be designed that would prioritise memory ops from one core over others, I don't know if any ARMs do that or not.
I read an article about a similar issue on Intel CPUs and the solution in
this case is a scalable memory allocator lib.
http://software.intel.com/en-us/blogs/2009/08/21/is-your-memory-management-m...
More academic reading of the phenomenon:
http://books.google.com/books?id=NF-C2ZQZXekC&pg=PA353&lpg=PA353&...
Dave