Hello Sumit,

Thanks for the reply.

I am using the pandaboard. I am trying to partition data-parallel programs and run them in parallel on the ARM A9s, ARM M3s and DSP. All the processors have to work on the same arrays. Hence I allocate memory for these arrays using gembuffers and pass the physical address of these arrays to the remote processors using rpmsg. Since there might be cross-processor dependencies, I have to do some sort of cache coherency which I am currently doing through the attachment-detachment mechanism. But this seems to be expensive.

Documentation/dma-buf-sharing.txt, says that

"Because existing importing subsystems might presume coherent mappings for userspace, the exporter needs to set up a coherent mapping. If that's not possible, it needs to fake coherency by manually shooting down ptes when leaving the cpu domain and flushing caches at fault time. ..... If the above shootdown dance turns out to be too expensive in certain scenarios, we can extend dma-buf with a more explicit cache tracking scheme for userspace mappings. But the current assumption is that using mmap is always a slower path, so some inefficiencies should be acceptable".

I believe the attachment-detachment mechanism shoots down ptes and flushes caches to fake coherency. But i guess the ptes does not have to be shot down since they don't change during the program execution. Only the cache needs to be flushed.

You say that, "To me, using attachment detachment mechanism to 'simulate' cache coherency sounds fairly wrong - Ideally, cache coherency management is to be done by the exporter in your system".

Could you give some details on how the exporter(A9 here) can simulate cache coherency. The linux API or functions to use. If you can point me to some similar implementation in the linux kernel that would be very helpful.

Thanks in advance.

--Kiran

On Thu, Mar 27, 2014 at 5:23 AM, Sumit Semwal <sumit.semwal@linaro.org> wrote:

Hi Kiran,

On 24 March 2014 02:35, kiran <kiranchandramohan@gmail.com> wrote:
> Hi,
>
> I was looking at some code(given below) which seems to perform very badly
> when attachments and detachments to used to simulate cache coherency.

Care to give more details on your use case, which processor(s) are you
targeting etc? To me, using attachment detachment mechanism to
'simulate' cache coherency sounds fairly wrong - Ideally, cache
coherency management is to be done by the exporter in your system. I
presume you have had a chance to read through the
Documentation/dma-buf-sharing.txt file, which gives some good details
on what to expect from dma-bufs etc?

Best regards,
~Sumit.

> In the code below, when remote_attach is false(ie no remote processors),
> using just the two A9 cores the following code runs in 8.8 seconds. But when
> remote_attach is true then even though there are other cores also executing
> and sharing the workload the following code takes 52.7 seconds. This shows
> that detach and attach is very heavy for this kind of code. (The system call
> detach performs dma_buf_unmap_attachment and dma_buf_detach, system call
> attach performs dma_buf_attach and dma_buf_map_attachment).
>
> for (k = 0; k < N; k++) {
> if(remote_attach) {
> detach(path) ;
> attach(path) ;
> }
>
> for(i = start_indx; i < end_indx; i++) {
> for (j = 0; j < N; j++) {
> if(path[i][j] < (path[i][k] + path[k][j])) {
> path[i][j] = path[i][k] + path[k][j] ;
> }
> }
> }
> }
>
> I would like to manage the cache explicitly and flush cache lines rather
> than pages to reduce overhead. I also want to access these buffers from the
> userspace. I can change some kernel code for this. Where should I start ?
>
> Thanks in advance.
>
> --Kiran
>
>
>

> _______________________________________________
> Linaro-mm-sig mailing list
> Linaro-mm-sig@lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>

--
Thanks and regards,

Sumit Semwal
Graphics Engineer - Graphics working group
Linaro.org │ Open source software for ARM SoCs

--
regards,
Kiran C