Performance issues with dmabuf

List overview All Threads
Download

newer

older

Re: [Linaro-mm-sig] [PATCH 2/2]...

[RFC] dma-buf: Implement test...

kiran

23 Mar 2014 23 Mar '14

9:05 p.m.

Hi,

I was looking at some code(given below) which seems to perform very badly when attachments and detachments to used to simulate cache coherency. In the code below, when remote_attach is false(ie no remote processors), using just the two A9 cores the following code runs in 8.8 seconds. But when remote_attach is true then even though there are other cores also executing and sharing the workload the following code takes 52.7 seconds. This shows that detach and attach is very heavy for this kind of code. (The system call detach performs dma_buf_unmap_attachment and dma_buf_detach, system call attach performs dma_buf_attach and dma_buf_map_attachment).

for (k = 0; k < N; k++) { if(remote_attach) { detach(path) ; attach(path) ; }

for(i = start_indx; i < end_indx; i++) { for (j = 0; j < N; j++) { if(path[i][j] < (path[i][k] + path[k][j])) { path[i][j] = path[i][k] + path[k][j] ; } } } }

I would like to manage the cache explicitly and flush cache lines rather than pages to reduce overhead. I also want to access these buffers from the userspace. I can change some kernel code for this. Where should I start ?

Thanks in advance.

--Kiran

Attachments:

attachment.html (text/html — 1.4 KB)

Show replies by date

Sumit Semwal

27 Mar 27 Mar

5:23 a.m.

Hi Kiran,

On 24 March 2014 02:35, kiran kiranchandramohan@gmail.com wrote:

...

Hi,

I was looking at some code(given below) which seems to perform very badly when attachments and detachments to used to simulate cache coherency.

Care to give more details on your use case, which processor(s) are you targeting etc? To me, using attachment detachment mechanism to 'simulate' cache coherency sounds fairly wrong - Ideally, cache coherency management is to be done by the exporter in your system. I presume you have had a chance to read through the Documentation/dma-buf-sharing.txt file, which gives some good details on what to expect from dma-bufs etc?

Best regards, ~Sumit.

...

In the code below, when remote_attach is false(ie no remote processors), using just the two A9 cores the following code runs in 8.8 seconds. But when remote_attach is true then even though there are other cores also executing and sharing the workload the following code takes 52.7 seconds. This shows that detach and attach is very heavy for this kind of code. (The system call detach performs dma_buf_unmap_attachment and dma_buf_detach, system call attach performs dma_buf_attach and dma_buf_map_attachment).

for (k = 0; k < N; k++) { if(remote_attach) { detach(path) ; attach(path) ; }
for(i = start_indx; i < end_indx; i++) {
    for (j = 0; j < N; j++) {
        if(path[i][j] < (path[i][k] + path[k][j])) {
            path[i][j] = path[i][k] + path[k][j] ;
        }
    }
}
}

I would like to manage the cache explicitly and flush cache lines rather than pages to reduce overhead. I also want to access these buffers from the userspace. I can change some kernel code for this. Where should I start ?

Thanks in advance.

--Kiran

Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig

-- Thanks and regards, Sumit Semwal Graphics Engineer - Graphics working group Linaro.org │ Open source software for ARM SoCs

kiran

12:15 p.m.

Hello Sumit,

Thanks for the reply.

I am using the pandaboard. I am trying to partition data-parallel programs and run them in parallel on the ARM A9s, ARM M3s and DSP. All the processors have to work on the same arrays. Hence I allocate memory for these arrays using gembuffers and pass the physical address of these arrays to the remote processors using rpmsg. Since there might be cross-processor dependencies, I have to do some sort of cache coherency which I am currently doing through the attachment-detachment mechanism. But this seems to be expensive.

Documentation/dma-buf-sharing.txt, says that "Because existing importing subsystems might presume coherent mappings for userspace, the exporter needs to set up a coherent mapping. If that's not possible, it needs to fake coherency by manually shooting down ptes when leaving the cpu domain and flushing caches at fault time. ..... If the above shootdown dance turns out to be too expensive in certain scenarios, we can extend dma-buf with a more explicit cache tracking scheme for userspace mappings. But the current assumption is that using mmap is always a slower path, so some inefficiencies should be acceptable".

I believe the attachment-detachment mechanism shoots down ptes and flushes caches to fake coherency. But i guess the ptes does not have to be shot down since they don't change during the program execution. Only the cache needs to be flushed.

You say that, "To me, using attachment detachment mechanism to 'simulate' cache coherency sounds fairly wrong - Ideally, cache coherency management is to be done by the exporter in your system". Could you give some details on how the exporter(A9 here) can simulate cache coherency. The linux API or functions to use. If you can point me to some similar implementation in the linux kernel that would be very helpful.

Thanks in advance.

--Kiran

On Thu, Mar 27, 2014 at 5:23 AM, Sumit Semwal sumit.semwal@linaro.orgwrote:

...

Hi Kiran,

On 24 March 2014 02:35, kiran kiranchandramohan@gmail.com wrote:

...
Hi,

I was looking at some code(given below) which seems to perform very badly when attachments and detachments to used to simulate cache coherency.

Care to give more details on your use case, which processor(s) are you targeting etc? To me, using attachment detachment mechanism to 'simulate' cache coherency sounds fairly wrong - Ideally, cache coherency management is to be done by the exporter in your system. I presume you have had a chance to read through the Documentation/dma-buf-sharing.txt file, which gives some good details on what to expect from dma-bufs etc?

Best regards, ~Sumit.

...
In the code below, when remote_attach is false(ie no remote processors), using just the two A9 cores the following code runs in 8.8 seconds. But

when

...
remote_attach is true then even though there are other cores also

executing

...
and sharing the workload the following code takes 52.7 seconds. This

shows

...
that detach and attach is very heavy for this kind of code. (The system

call

...
detach performs dma_buf_unmap_attachment and dma_buf_detach, system call attach performs dma_buf_attach and dma_buf_map_attachment).

for (k = 0; k < N; k++) { if(remote_attach) { detach(path) ; attach(path) ; }
for(i = start_indx; i < end_indx; i++) {
    for (j = 0; j < N; j++) {
        if(path[i][j] < (path[i][k] + path[k][j])) {
            path[i][j] = path[i][k] + path[k][j] ;
        }
    }
}
}

I would like to manage the cache explicitly and flush cache lines rather than pages to reduce overhead. I also want to access these buffers from
the

...
userspace. I can change some kernel code for this. Where should I start ?

Thanks in advance.

--Kiran

Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig

-- Thanks and regards,

Sumit Semwal Graphics Engineer - Graphics working group Linaro.org │ Open source software for ARM SoCs

-- regards, Kiran C

Sumit Semwal

28 Mar 28 Mar

6:06 a.m.

Hi Kiran,

On 27 March 2014 17:45, kiran kiranchandramohan@gmail.com wrote:

...

Hello Sumit,

Thanks for the reply.

I am using the pandaboard. I am trying to partition data-parallel programs and run them in parallel on the ARM A9s, ARM M3s and DSP. All the processors have to work on the same arrays. Hence I allocate memory for these arrays using gembuffers and pass the physical address of these arrays to the remote

^^^ Do you mean you use dri GEM buffers for non-dri usage? Again, I think that's probably not recommended, but I'll let the DRI experts comment on it (Daniel, Rob: care to comment?)

...

processors using rpmsg. Since there might be cross-processor dependencies, I have to do some sort of cache coherency which I am currently doing through the attachment-detachment mechanism. But this seems to be expensive.

Documentation/dma-buf-sharing.txt, says that "Because existing importing subsystems might presume coherent mappings for userspace, the exporter needs to set up a coherent mapping. If that's not possible, it needs to fake coherency by manually shooting down ptes when leaving the cpu domain and flushing caches at fault time. ..... If the above shootdown dance turns out to be too expensive in certain scenarios, we can extend dma-buf with a more explicit cache tracking scheme for userspace mappings. But the current assumption is that using mmap is always a slower path, so some inefficiencies should be acceptable".

I believe the attachment-detachment mechanism shoots down ptes and flushes caches to fake coherency. But i guess the ptes does not have to be shot down

Are you relying on the DRI Prime's implementation as an exporter device to assume that its attachment-detachment mechanism is actually doing it? AFAIK, DRI hasn't had a need to implement this, but I'd again let Rob / Daniel comment on it.

...

since they don't change during the program execution. Only the cache needs to be flushed.

You say that, "To me, using attachment detachment mechanism to 'simulate' cache coherency sounds fairly wrong - Ideally, cache coherency management is to be done by the exporter in your system". Could you give some details on how the exporter(A9 here) can simulate cache coherency. The linux API or functions to use. If you can point me to some similar implementation in the linux kernel that would be very helpful.

I might be misunderstanding it, but from your statement above, it seems you feel that A9 is the 'exporter' referred to in the dma-buf-sharing.txt. That's not so; in fact, an exporter is any linux driver / device subsystem (currently like DRI for gem buffers, or V4L2...) that knows / wants to handle the allocation, coherency and mapping aspects of the buffers. gembuffers are a part of DRI framework, which is intended for graphics rendering and display usecases.

For the needs you stated above, which seem to be fairly non-graphics in nature, I'd think implementing a dma-buf exporter is a good idea - you could then handle when or if you want to shoot down ptes or not, and lot of usecase dependent things. You might also want to check with the TI guys if they have some implementation already in place?

Hope this helps, Best regards, ~Sumit.

...

Thanks in advance.

--Kiran

On Thu, Mar 27, 2014 at 5:23 AM, Sumit Semwal sumit.semwal@linaro.org wrote:

...
Hi Kiran,

On 24 March 2014 02:35, kiran kiranchandramohan@gmail.com wrote:

...
Hi,

I was looking at some code(given below) which seems to perform very badly when attachments and detachments to used to simulate cache coherency.

Care to give more details on your use case, which processor(s) are you targeting etc? To me, using attachment detachment mechanism to 'simulate' cache coherency sounds fairly wrong - Ideally, cache coherency management is to be done by the exporter in your system. I presume you have had a chance to read through the Documentation/dma-buf-sharing.txt file, which gives some good details on what to expect from dma-bufs etc?

Best regards, ~Sumit.

...
In the code below, when remote_attach is false(ie no remote processors), using just the two A9 cores the following code runs in 8.8 seconds. But when remote_attach is true then even though there are other cores also executing and sharing the workload the following code takes 52.7 seconds. This shows that detach and attach is very heavy for this kind of code. (The system call detach performs dma_buf_unmap_attachment and dma_buf_detach, system call attach performs dma_buf_attach and dma_buf_map_attachment).

for (k = 0; k < N; k++) { if(remote_attach) { detach(path) ; attach(path) ; }
for(i = start_indx; i < end_indx; i++) {
    for (j = 0; j < N; j++) {
        if(path[i][j] < (path[i][k] + path[k][j])) {
            path[i][j] = path[i][k] + path[k][j] ;
        }
    }
}
}

I would like to manage the cache explicitly and flush cache lines rather than pages to reduce overhead. I also want to access these buffers from the userspace. I can change some kernel code for this. Where should I start ?

Thanks in advance.

--Kiran

Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
-- Thanks and regards,

Sumit Semwal Graphics Engineer - Graphics working group Linaro.org │ Open source software for ARM SoCs
-- regards, Kiran C

-- Thanks and regards, Sumit Semwal Graphics Engineer - Graphics working group Linaro.org │ Open source software for ARM SoCs

kiran

11:16 a.m.

Hello Sumit,

Sorry, I might have misled you. My mistake saying A9 is the exporter.

Yes I am using dri gembuffers here. And it is as per Rob Clark's suggestion that I used. He suggested to use dri gembuffers probably because that was the easiest way to get started and this was over an year back when my usecase was much more simpler. Using the dri gembuffers a lot of things came for free like (contiguous?) memory allocation from the kernel, 1:1 mapping mapping of virtual to physical in the remote processors, mmap for user space, the dmabuf implementation etc.

Yes, I am using dri's implementation as an attachment detachment mechanism. I understand that the DRI didnt have a need to implement this, but what I was saying was that I need to implement this. I dont know whether it is easier to implement another dma-buf exporter or hack dri's implementation for the same.

One of the reasons that I asked this question is because you say in Documentation/dma-buf-sharing.txt that you can consider extending dma-buf with a more explicit cache tracking scheme for userspace mappings. I am asking for this explicit cache tracking scheme for userspace mappings. How would you have gone about implementing it ? "If the above shootdown dance turns out to be too expensive in certain scenarios, we can extend dma-buf with a more explicit cache tracking scheme for userspace mappings. But the current assumption is that using mmap is always a slower path, so some inefficiencies should be acceptable."

I think what happens currently is on mmap page fault, the driver marks the pages that are accessed and during map attachment it calls unmap_mapping_range to flush the ptes and cache(?). Sources are in the files given below. http://lxr.free-electrons.com/source/drivers/gpu/drm/omapdrm/omap_gem_dmabuf... http://lxr.free-electrons.com/source/drivers/gpu/drm/omapdrm/omap_gem.c

Which linux kernel function/API should I use to cache flush the addresses modified by A9 and invalidate the addresses modified by the remote processors(M3 and DSP) ?

Thanks for your reply.

--Kiran

On Fri, Mar 28, 2014 at 6:06 AM, Sumit Semwal sumit.semwal@linaro.orgwrote:

...

Hi Kiran,

On 27 March 2014 17:45, kiran kiranchandramohan@gmail.com wrote:

...
Hello Sumit,

Thanks for the reply.

I am using the pandaboard. I am trying to partition data-parallel

programs

...
and run them in parallel on the ARM A9s, ARM M3s and DSP. All the

processors

...
have to work on the same arrays. Hence I allocate memory for these arrays using gembuffers and pass the physical address of these arrays to the

remote ^^^ Do you mean you use dri GEM buffers for non-dri usage? Again, I think that's probably not recommended, but I'll let the DRI experts comment on it (Daniel, Rob: care to comment?)

...
processors using rpmsg. Since there might be cross-processor

dependencies, I

...
have to do some sort of cache coherency which I am currently doing

through

...
the attachment-detachment mechanism. But this seems to be expensive.

Documentation/dma-buf-sharing.txt, says that "Because existing importing subsystems might presume coherent mappings

for

...
userspace, the exporter needs to set up a coherent mapping. If that's not possible, it needs to fake coherency by manually shooting down ptes when leaving the cpu domain and flushing caches at fault time. ..... If the

above

...
shootdown dance turns out to be too expensive in certain scenarios, we

can

...
extend dma-buf with a more explicit cache tracking scheme for userspace mappings. But the current assumption is that using mmap is always a

slower

...
path, so some inefficiencies should be acceptable".

I believe the attachment-detachment mechanism shoots down ptes and

flushes

...
caches to fake coherency. But i guess the ptes does not have to be shot

down Are you relying on the DRI Prime's implementation as an exporter device to assume that its attachment-detachment mechanism is actually doing it? AFAIK, DRI hasn't had a need to implement this, but I'd again let Rob / Daniel comment on it.

...
since they don't change during the program execution. Only the cache

needs

...
to be flushed.

You say that, "To me, using attachment detachment mechanism to 'simulate' cache coherency sounds fairly wrong - Ideally, cache coherency

management is

...
to be done by the exporter in your system". Could you give some details on how the exporter(A9 here) can simulate

cache

...
coherency. The linux API or functions to use. If you can point me to some similar implementation in the linux kernel that would be very helpful.

I might be misunderstanding it, but from your statement above, it seems you feel that A9 is the 'exporter' referred to in the dma-buf-sharing.txt. That's not so; in fact, an exporter is any linux driver / device subsystem (currently like DRI for gem buffers, or V4L2...) that knows / wants to handle the allocation, coherency and mapping aspects of the buffers. gembuffers are a part of DRI framework, which is intended for graphics rendering and display usecases.

For the needs you stated above, which seem to be fairly non-graphics in nature, I'd think implementing a dma-buf exporter is a good idea - you could then handle when or if you want to shoot down ptes or not, and lot of usecase dependent things. You might also want to check with the TI guys if they have some implementation already in place?

Hope this helps, Best regards, ~Sumit.

...
Thanks in advance.

--Kiran

On Thu, Mar 27, 2014 at 5:23 AM, Sumit Semwal sumit.semwal@linaro.org wrote:

...
Hi Kiran,

On 24 March 2014 02:35, kiran kiranchandramohan@gmail.com wrote:

...
Hi,

I was looking at some code(given below) which seems to perform very badly when attachments and detachments to used to simulate cache coherency.

Care to give more details on your use case, which processor(s) are you targeting etc? To me, using attachment detachment mechanism to 'simulate' cache coherency sounds fairly wrong - Ideally, cache coherency management is to be done by the exporter in your system. I presume you have had a chance to read through the Documentation/dma-buf-sharing.txt file, which gives some good details on what to expect from dma-bufs etc?

Best regards, ~Sumit.

...
In the code below, when remote_attach is false(ie no remote

processors),

...
...
...
using just the two A9 cores the following code runs in 8.8 seconds.

But

...
...
...
when remote_attach is true then even though there are other cores also executing and sharing the workload the following code takes 52.7 seconds. This shows that detach and attach is very heavy for this kind of code. (The

system

...
...
...
call detach performs dma_buf_unmap_attachment and dma_buf_detach, system

call

...
...
...
attach performs dma_buf_attach and dma_buf_map_attachment).

for (k = 0; k < N; k++) { if(remote_attach) { detach(path) ; attach(path) ; }
for(i = start_indx; i < end_indx; i++) {
    for (j = 0; j < N; j++) {
        if(path[i][j] < (path[i][k] + path[k][j])) {
            path[i][j] = path[i][k] + path[k][j] ;
        }
    }
}
}

I would like to manage the cache explicitly and flush cache lines
rather

...
...
...
than pages to reduce overhead. I also want to access these buffers

from

...
...
...
the userspace. I can change some kernel code for this. Where should I

start

...
...
...
?

Thanks in advance.

--Kiran

Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig

-- Thanks and regards,

Sumit Semwal Graphics Engineer - Graphics working group Linaro.org │ Open source software for ARM SoCs

-- regards, Kiran C

-- Thanks and regards,

Sumit Semwal Graphics Engineer - Graphics working group Linaro.org │ Open source software for ARM SoCs

-- regards, Kiran C

Rob Clark

12:08 p.m.

On Fri, Mar 28, 2014 at 7:16 AM, kiran kiranchandramohan@gmail.com wrote:

...

Hello Sumit,

Sorry, I might have misled you. My mistake saying A9 is the exporter.

Yes I am using dri gembuffers here. And it is as per Rob Clark's suggestion that I used. He suggested to use dri gembuffers probably because that was the easiest way to get started and this was over an year back when my usecase was much more simpler. Using the dri gembuffers a lot of things came for free like (contiguous?) memory allocation from the kernel, 1:1 mapping mapping of virtual to physical in the remote processors, mmap for user space, the dmabuf implementation etc.

Yes, I am using dri's implementation as an attachment detachment mechanism. I understand that the DRI didnt have a need to implement this, but what I was saying was that I need to implement this. I dont know whether it is easier to implement another dma-buf exporter or hack dri's implementation for the same.

One of the reasons that I asked this question is because you say in Documentation/dma-buf-sharing.txt that you can consider extending dma-buf with a more explicit cache tracking scheme for userspace mappings. I am asking for this explicit cache tracking scheme for userspace mappings. How would you have gone about implementing it ? "If the above shootdown dance turns out to be too expensive in certain scenarios, we can extend dma-buf with a more explicit cache tracking scheme for userspace mappings. But the current assumption is that using mmap is always a slower path, so some inefficiencies should be acceptable."

I think what happens currently is on mmap page fault, the driver marks the pages that are accessed and during map attachment it calls unmap_mapping_range to flush the ptes and cache(?). Sources are in the files given below. http://lxr.free-electrons.com/source/drivers/gpu/drm/omapdrm/omap_gem_dmabuf... http://lxr.free-electrons.com/source/drivers/gpu/drm/omapdrm/omap_gem.c

the PTE shootdown is so we hit the fault handler, so that we know when CPU touches the buffer again. I suspect what you want is some API to userspace to explicitly flush buffers (or portions of buffers) to avoid the performance overhead of page table shootdown and fault handling.

...

From the kernel side, dma_{map,unmap}_page() are what omapdrm uses for

cache maintenance.

BR, -R

...

Which linux kernel function/API should I use to cache flush the addresses modified by A9 and invalidate the addresses modified by the remote processors(M3 and DSP) ?

Thanks for your reply.

--Kiran

On Fri, Mar 28, 2014 at 6:06 AM, Sumit Semwal sumit.semwal@linaro.org wrote:

...
Hi Kiran,

On 27 March 2014 17:45, kiran kiranchandramohan@gmail.com wrote:

...
Hello Sumit,

Thanks for the reply.

I am using the pandaboard. I am trying to partition data-parallel programs and run them in parallel on the ARM A9s, ARM M3s and DSP. All the processors have to work on the same arrays. Hence I allocate memory for these arrays using gembuffers and pass the physical address of these arrays to the remote

^^^ Do you mean you use dri GEM buffers for non-dri usage? Again, I think that's probably not recommended, but I'll let the DRI experts comment on it (Daniel, Rob: care to comment?)

...
processors using rpmsg. Since there might be cross-processor dependencies, I have to do some sort of cache coherency which I am currently doing through the attachment-detachment mechanism. But this seems to be expensive.

Documentation/dma-buf-sharing.txt, says that "Because existing importing subsystems might presume coherent mappings for userspace, the exporter needs to set up a coherent mapping. If that's not possible, it needs to fake coherency by manually shooting down ptes when leaving the cpu domain and flushing caches at fault time. ..... If the above shootdown dance turns out to be too expensive in certain scenarios, we can extend dma-buf with a more explicit cache tracking scheme for userspace mappings. But the current assumption is that using mmap is always a slower path, so some inefficiencies should be acceptable".

I believe the attachment-detachment mechanism shoots down ptes and flushes caches to fake coherency. But i guess the ptes does not have to be shot down

Are you relying on the DRI Prime's implementation as an exporter device to assume that its attachment-detachment mechanism is actually doing it? AFAIK, DRI hasn't had a need to implement this, but I'd again let Rob / Daniel comment on it.

...
since they don't change during the program execution. Only the cache needs to be flushed.

You say that, "To me, using attachment detachment mechanism to 'simulate' cache coherency sounds fairly wrong - Ideally, cache coherency management is to be done by the exporter in your system". Could you give some details on how the exporter(A9 here) can simulate cache coherency. The linux API or functions to use. If you can point me to some similar implementation in the linux kernel that would be very helpful.

I might be misunderstanding it, but from your statement above, it seems you feel that A9 is the 'exporter' referred to in the dma-buf-sharing.txt. That's not so; in fact, an exporter is any linux driver / device subsystem (currently like DRI for gem buffers, or V4L2...) that knows / wants to handle the allocation, coherency and mapping aspects of the buffers. gembuffers are a part of DRI framework, which is intended for graphics rendering and display usecases.

For the needs you stated above, which seem to be fairly non-graphics in nature, I'd think implementing a dma-buf exporter is a good idea - you could then handle when or if you want to shoot down ptes or not, and lot of usecase dependent things. You might also want to check with the TI guys if they have some implementation already in place?

Hope this helps, Best regards, ~Sumit.

...
Thanks in advance.

--Kiran

On Thu, Mar 27, 2014 at 5:23 AM, Sumit Semwal sumit.semwal@linaro.org wrote:

...
Hi Kiran,

On 24 March 2014 02:35, kiran kiranchandramohan@gmail.com wrote:

...
Hi,

I was looking at some code(given below) which seems to perform very badly when attachments and detachments to used to simulate cache coherency.

Care to give more details on your use case, which processor(s) are you targeting etc? To me, using attachment detachment mechanism to 'simulate' cache coherency sounds fairly wrong - Ideally, cache coherency management is to be done by the exporter in your system. I presume you have had a chance to read through the Documentation/dma-buf-sharing.txt file, which gives some good details on what to expect from dma-bufs etc?

Best regards, ~Sumit.

...
In the code below, when remote_attach is false(ie no remote processors), using just the two A9 cores the following code runs in 8.8 seconds. But when remote_attach is true then even though there are other cores also executing and sharing the workload the following code takes 52.7 seconds. This shows that detach and attach is very heavy for this kind of code. (The system call detach performs dma_buf_unmap_attachment and dma_buf_detach, system call attach performs dma_buf_attach and dma_buf_map_attachment).

for (k = 0; k < N; k++) { if(remote_attach) { detach(path) ; attach(path) ; }
for(i = start_indx; i < end_indx; i++) {
    for (j = 0; j < N; j++) {
        if(path[i][j] < (path[i][k] + path[k][j])) {
            path[i][j] = path[i][k] + path[k][j] ;
        }
    }
}
}

I would like to manage the cache explicitly and flush cache lines rather than pages to reduce overhead. I also want to access these buffers from the userspace. I can change some kernel code for this. Where should I start ?

Thanks in advance.

--Kiran

Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
-- Thanks and regards,

Sumit Semwal Graphics Engineer - Graphics working group Linaro.org │ Open source software for ARM SoCs
-- regards, Kiran C
-- Thanks and regards,

Sumit Semwal Graphics Engineer - Graphics working group Linaro.org │ Open source software for ARM SoCs
-- regards, Kiran C

kiran

31 Mar 31 Mar

4:11 p.m.

Thanks Rob. I will try performing cache maintenance with dma_{map,unmap}_page().

--Kiran

On Fri, Mar 28, 2014 at 12:08 PM, Rob Clark robdclark@gmail.com wrote:

...

On Fri, Mar 28, 2014 at 7:16 AM, kiran kiranchandramohan@gmail.com wrote:

...
Hello Sumit,

Sorry, I might have misled you. My mistake saying A9 is the exporter.

Yes I am using dri gembuffers here. And it is as per Rob Clark's

suggestion

...
that I used. He suggested to use dri gembuffers probably because that was the easiest way to get started and this was over an year back when my usecase was much more simpler. Using the dri gembuffers a lot of things

came

...
for free like (contiguous?) memory allocation from the kernel, 1:1

mapping

...
mapping of virtual to physical in the remote processors, mmap for user space, the dmabuf implementation etc.

Yes, I am using dri's implementation as an attachment detachment

mechanism.

...
I understand that the DRI didnt have a need to implement this, but what I was saying was that I need to implement this. I dont know whether it is easier to implement another dma-buf exporter or hack dri's implementation for the same.

One of the reasons that I asked this question is because you say in Documentation/dma-buf-sharing.txt that you can consider extending dma-buf with a more explicit cache tracking scheme for userspace mappings. I am asking for this explicit cache tracking scheme for userspace mappings.

How

...
would you have gone about implementing it ? "If the above shootdown dance turns out to be too expensive in certain scenarios, we can extend dma-buf with a more explicit cache tracking

scheme

...
for userspace mappings. But the current assumption is that using mmap is always a slower path, so some inefficiencies should be acceptable."

I think what happens currently is on mmap page fault, the driver marks

the

...
pages that are accessed and during map attachment it calls unmap_mapping_range to flush the ptes and cache(?). Sources are in the

files

...
given below.

http://lxr.free-electrons.com/source/drivers/gpu/drm/omapdrm/omap_gem_dmabuf...

...
http://lxr.free-electrons.com/source/drivers/gpu/drm/omapdrm/omap_gem.c

the PTE shootdown is so we hit the fault handler, so that we know when CPU touches the buffer again. I suspect what you want is some API to userspace to explicitly flush buffers (or portions of buffers) to avoid the performance overhead of page table shootdown and fault handling.

From the kernel side, dma_{map,unmap}_page() are what omapdrm uses for cache maintenance.

BR, -R

...
Which linux kernel function/API should I use to cache flush the addresses modified by A9 and invalidate the addresses modified by the remote processors(M3 and DSP) ?

Thanks for your reply.

--Kiran

On Fri, Mar 28, 2014 at 6:06 AM, Sumit Semwal sumit.semwal@linaro.org wrote:

...
Hi Kiran,

On 27 March 2014 17:45, kiran kiranchandramohan@gmail.com wrote:

...
Hello Sumit,

Thanks for the reply.

I am using the pandaboard. I am trying to partition data-parallel programs and run them in parallel on the ARM A9s, ARM M3s and DSP. All the processors have to work on the same arrays. Hence I allocate memory for these arrays using gembuffers and pass the physical address of these arrays to the remote

^^^ Do you mean you use dri GEM buffers for non-dri usage? Again, I think that's probably not recommended, but I'll let the DRI experts comment on it (Daniel, Rob: care to comment?)

...
processors using rpmsg. Since there might be cross-processor dependencies, I have to do some sort of cache coherency which I am currently doing through the attachment-detachment mechanism. But this seems to be expensive.

Documentation/dma-buf-sharing.txt, says that "Because existing importing subsystems might presume coherent mappings for userspace, the exporter needs to set up a coherent mapping. If that's not possible, it needs to fake coherency by manually shooting down ptes

when

...
...
...
leaving the cpu domain and flushing caches at fault time. ..... If the above shootdown dance turns out to be too expensive in certain scenarios, we can extend dma-buf with a more explicit cache tracking scheme for

userspace

...
...
...
mappings. But the current assumption is that using mmap is always a slower path, so some inefficiencies should be acceptable".

I believe the attachment-detachment mechanism shoots down ptes and flushes caches to fake coherency. But i guess the ptes does not have to be

shot

...
...
...
down

Are you relying on the DRI Prime's implementation as an exporter device to assume that its attachment-detachment mechanism is actually doing it? AFAIK, DRI hasn't had a need to implement this, but I'd again let Rob / Daniel comment on it.

...
since they don't change during the program execution. Only the cache needs to be flushed.

You say that, "To me, using attachment detachment mechanism to 'simulate' cache coherency sounds fairly wrong - Ideally, cache coherency management is to be done by the exporter in your system". Could you give some details on how the exporter(A9 here) can simulate cache coherency. The linux API or functions to use. If you can point me to some similar implementation in the linux kernel that would be very helpful.

I might be misunderstanding it, but from your statement above, it seems you feel that A9 is the 'exporter' referred to in the dma-buf-sharing.txt. That's not so; in fact, an exporter is any linux driver / device subsystem (currently like DRI for gem buffers, or V4L2...) that knows / wants to handle the allocation, coherency and mapping aspects of the buffers. gembuffers are a part of DRI framework, which is intended for graphics rendering and display usecases.

For the needs you stated above, which seem to be fairly non-graphics in nature, I'd think implementing a dma-buf exporter is a good idea - you could then handle when or if you want to shoot down ptes or not, and lot of usecase dependent things. You might also want to check with the TI guys if they have some implementation already in place?

Hope this helps, Best regards, ~Sumit.

...
Thanks in advance.

--Kiran

On Thu, Mar 27, 2014 at 5:23 AM, Sumit Semwal <

sumit.semwal@linaro.org>

...
...
...
wrote:

...
Hi Kiran,

On 24 March 2014 02:35, kiran kiranchandramohan@gmail.com wrote:

...
Hi,

I was looking at some code(given below) which seems to perform very badly when attachments and detachments to used to simulate cache

coherency.

...
...
...
...
Care to give more details on your use case, which processor(s) are

you

...
...
...
...
targeting etc? To me, using attachment detachment mechanism to 'simulate' cache coherency sounds fairly wrong - Ideally, cache coherency management is to be done by the exporter in your system. I presume you have had a chance to read through the Documentation/dma-buf-sharing.txt file, which gives some good details on what to expect from dma-bufs etc?

Best regards, ~Sumit.

...
In the code below, when remote_attach is false(ie no remote processors), using just the two A9 cores the following code runs in 8.8 seconds. But when remote_attach is true then even though there are other cores also executing and sharing the workload the following code takes 52.7 seconds.

This

...
...
...
...
...
shows that detach and attach is very heavy for this kind of code. (The system call detach performs dma_buf_unmap_attachment and dma_buf_detach, system call attach performs dma_buf_attach and dma_buf_map_attachment).

for (k = 0; k < N; k++) { if(remote_attach) { detach(path) ; attach(path) ; }
for(i = start_indx; i < end_indx; i++) {
    for (j = 0; j < N; j++) {
        if(path[i][j] < (path[i][k] + path[k][j])) {
            path[i][j] = path[i][k] + path[k][j] ;
        }
    }
}
}

I would like to manage the cache explicitly and flush cache lines rather than pages to reduce overhead. I also want to access these buffers from the userspace. I can change some kernel code for this. Where should I start ?

Thanks in advance.

--Kiran

Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
-- Thanks and regards,

Sumit Semwal Graphics Engineer - Graphics working group Linaro.org │ Open source software for ARM SoCs
-- regards, Kiran C
-- Thanks and regards,

Sumit Semwal Graphics Engineer - Graphics working group Linaro.org │ Open source software for ARM SoCs
-- regards, Kiran C

-- regards, Kiran C

4229

days inactive

4237

days old

linaro-mm-sig@lists.linaro.org

6 comments

participants

tags (0)

participants (3)

kiran
Rob Clark
Sumit Semwal