Thanks Rob. I will try performing cache maintenance with dma_{map,unmap}_page(). 

--Kiran


On Fri, Mar 28, 2014 at 12:08 PM, Rob Clark <robdclark@gmail.com> wrote:
On Fri, Mar 28, 2014 at 7:16 AM, kiran <kiranchandramohan@gmail.com> wrote:
> Hello Sumit,
>
> Sorry, I might have misled you. My mistake saying A9 is the exporter.
>
> Yes I am using dri gembuffers here. And it is as per Rob Clark's suggestion
> that I used. He suggested to use dri gembuffers probably because that was
> the easiest way to get started and this was over an year back when my
> usecase was much more simpler. Using the dri gembuffers a lot of things came
> for free like (contiguous?) memory allocation from the kernel, 1:1 mapping
> mapping of virtual to physical in the remote processors, mmap for user
> space, the dmabuf implementation etc.
>
> Yes, I am using dri's implementation as an attachment detachment mechanism.
> I understand that the DRI didnt have a need to implement this, but what I
> was saying was that I need to implement this. I dont know whether it is
> easier to implement another dma-buf exporter or hack dri's implementation
> for the same.
>
> One of the reasons that I asked this question is because you say in
> Documentation/dma-buf-sharing.txt that you can consider extending dma-buf
> with a more explicit cache tracking scheme for userspace mappings. I am
> asking for this explicit cache tracking scheme for userspace mappings. How
> would you have gone about implementing it ?
> "If the above shootdown dance turns out to be too expensive in certain
> scenarios, we can extend dma-buf with a more explicit cache tracking scheme
> for userspace mappings. But the current assumption is that using mmap is
> always a slower path, so some inefficiencies should be acceptable."
>
> I think what happens currently is on mmap page fault, the driver marks the
> pages that are accessed and during map attachment it calls
> unmap_mapping_range to flush the ptes and cache(?). Sources are in the files
> given below.
> http://lxr.free-electrons.com/source/drivers/gpu/drm/omapdrm/omap_gem_dmabuf.c
> http://lxr.free-electrons.com/source/drivers/gpu/drm/omapdrm/omap_gem.c
>

the PTE shootdown is so we hit the fault handler, so that we know when
CPU touches the buffer again.  I suspect what you want is some API to
userspace to explicitly flush buffers (or portions of buffers) to
avoid the performance overhead of page table shootdown and fault
handling.

>From the kernel side, dma_{map,unmap}_page() are what omapdrm uses for
cache maintenance.

BR,
-R

> Which linux kernel function/API should I use to cache flush the addresses
> modified by A9 and invalidate the addresses modified by the remote
> processors(M3 and DSP) ?
>
> Thanks for your reply.
>
> --Kiran
>
>
> On Fri, Mar 28, 2014 at 6:06 AM, Sumit Semwal <sumit.semwal@linaro.org>
> wrote:
>>
>> Hi Kiran,
>>
>> On 27 March 2014 17:45, kiran <kiranchandramohan@gmail.com> wrote:
>> > Hello Sumit,
>> >
>> > Thanks for the reply.
>> >
>> > I am using the pandaboard. I am trying to partition data-parallel
>> > programs
>> > and run them in parallel on the ARM A9s, ARM M3s and DSP. All the
>> > processors
>> > have to work on the same arrays. Hence I allocate memory for these
>> > arrays
>> > using gembuffers and pass the physical address of these arrays to the
>> > remote
>> ^^^ Do you mean you use dri GEM buffers for non-dri usage? Again, I
>> think that's probably not recommended, but I'll let the DRI experts
>> comment on it (Daniel, Rob: care to comment?)
>> > processors using rpmsg. Since there might be cross-processor
>> > dependencies, I
>> > have to do some sort of cache coherency which I am currently doing
>> > through
>> > the attachment-detachment mechanism. But this seems to be expensive.
>> >
>> > Documentation/dma-buf-sharing.txt, says that
>> > "Because existing importing subsystems might presume coherent mappings
>> > for
>> > userspace, the exporter needs to set up a coherent mapping. If that's
>> > not
>> > possible, it needs to fake coherency by manually shooting down ptes when
>> > leaving the cpu domain and flushing caches at fault time. ..... If the
>> > above
>> > shootdown dance turns out to be too expensive in certain scenarios, we
>> > can
>> > extend dma-buf with a more explicit cache tracking scheme for userspace
>> > mappings. But the current assumption is that using mmap is always a
>> > slower
>> > path, so some inefficiencies should be acceptable".
>> >
>> > I believe the attachment-detachment mechanism shoots down ptes and
>> > flushes
>> > caches to fake coherency. But i guess the ptes does not have to be shot
>> > down
>> Are you relying on the DRI Prime's implementation as an exporter
>> device to assume that its attachment-detachment mechanism is actually
>> doing it? AFAIK, DRI hasn't had a need to implement this, but I'd
>> again let Rob / Daniel comment on it.
>> > since they don't change during the program execution. Only the cache
>> > needs
>> > to be flushed.
>> >
>> > You say that, "To me, using attachment detachment mechanism to
>> > 'simulate'
>> > cache coherency sounds fairly wrong - Ideally, cache coherency
>> > management is
>> > to be done by the exporter in your system".
>> > Could you give some details on how the exporter(A9 here) can simulate
>> > cache
>> > coherency. The linux API or functions to use. If you can point me to
>> > some
>> > similar implementation in the linux kernel that would be very helpful.
>> I might be misunderstanding it, but from your statement above, it
>> seems you feel that A9 is the 'exporter' referred to in the
>> dma-buf-sharing.txt. That's not so; in fact, an exporter is any linux
>> driver / device subsystem (currently like DRI for gem buffers, or
>> V4L2...) that knows / wants to handle the allocation, coherency and
>> mapping aspects of the buffers. gembuffers are a part of DRI
>> framework, which is intended for graphics rendering and display
>> usecases.
>>
>> For the needs you stated above, which seem to be fairly non-graphics
>> in nature, I'd think implementing a dma-buf exporter is a good idea -
>> you could then handle when or if you want to shoot down ptes or not,
>> and lot of usecase dependent things. You might also want to check with
>> the TI guys if they have some implementation already in place?
>>
>> Hope this helps,
>> Best regards,
>> ~Sumit.
>> > Thanks in advance.
>> >
>> > --Kiran
>> >
>> >
>> >
>> >
>> > On Thu, Mar 27, 2014 at 5:23 AM, Sumit Semwal <sumit.semwal@linaro.org>
>> > wrote:
>> >>
>> >> Hi Kiran,
>> >>
>> >> On 24 March 2014 02:35, kiran <kiranchandramohan@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I was looking at some code(given below) which seems to perform very
>> >> > badly
>> >> > when attachments and detachments to used to simulate cache coherency.
>> >> Care to give more details on your use case, which processor(s) are you
>> >> targeting etc? To me, using attachment detachment mechanism to
>> >> 'simulate' cache coherency sounds fairly wrong - Ideally, cache
>> >> coherency management is to be done by the exporter in your system. I
>> >> presume you have had a chance to read through the
>> >> Documentation/dma-buf-sharing.txt file, which gives some good details
>> >> on what to expect from dma-bufs etc?
>> >>
>> >> Best regards,
>> >> ~Sumit.
>> >>
>> >> > In the code below, when remote_attach is false(ie no remote
>> >> > processors),
>> >> > using just the two A9 cores the following code runs in 8.8 seconds.
>> >> > But
>> >> > when
>> >> > remote_attach is true then even though there are other cores also
>> >> > executing
>> >> > and sharing the workload the following code takes 52.7 seconds. This
>> >> > shows
>> >> > that detach and attach is very heavy for this kind of code. (The
>> >> > system
>> >> > call
>> >> > detach performs dma_buf_unmap_attachment and dma_buf_detach, system
>> >> > call
>> >> > attach performs dma_buf_attach and dma_buf_map_attachment).
>> >> >
>> >> > for (k = 0; k < N; k++) {
>> >> >     if(remote_attach) {
>> >> >         detach(path) ;
>> >> >         attach(path) ;
>> >> >     }
>> >> >
>> >> >     for(i = start_indx; i < end_indx; i++) {
>> >> >         for (j = 0; j < N; j++) {
>> >> >             if(path[i][j] < (path[i][k] + path[k][j])) {
>> >> >                 path[i][j] = path[i][k] + path[k][j] ;
>> >> >             }
>> >> >         }
>> >> >     }
>> >> > }
>> >> >
>> >> > I would like to manage the cache explicitly and flush cache lines
>> >> > rather
>> >> > than pages to reduce overhead. I also want to access these buffers
>> >> > from
>> >> > the
>> >> > userspace. I can change some kernel code for this. Where should I
>> >> > start
>> >> > ?
>> >> >
>> >> > Thanks in advance.
>> >> >
>> >> > --Kiran
>> >> >
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > Linaro-mm-sig mailing list
>> >> > Linaro-mm-sig@lists.linaro.org
>> >> > http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Thanks and regards,
>> >>
>> >> Sumit Semwal
>> >> Graphics Engineer - Graphics working group
>> >> Linaro.org │ Open source software for ARM SoCs
>> >
>> >
>> >
>> >
>> > --
>> > regards,
>> > Kiran C
>>
>>
>>
>> --
>> Thanks and regards,
>>
>> Sumit Semwal
>> Graphics Engineer - Graphics working group
>> Linaro.org │ Open source software for ARM SoCs
>
>
>
>
> --
> regards,
> Kiran C



--
regards,
Kiran C