Re: [Linaro-mm-sig] [PATCH] RFC: dma-buf: userspace mmap support - Linaro-mm-sig

16 Mar 2012


      On Fri, Mar 16, 2012 at 12:24 PM, Tom Cooksey tom.cooksey@arm.com wrote:
...
...
From: Rob Clark rob@ti.com
Enable optional userspace access to dma-buf buffers via mmap() on the
dma-buf file descriptor.  Userspace access to the buffer should be
bracketed with DMA_BUF_IOCTL_{PREPARE,FINISH}_ACCESS ioctl calls to
give the exporting driver a chance to deal with cache synchronization
and such for cached userspace mappings without resorting to page
faulting tricks.  The reasoning behind this is that, while drm
drivers tend to have all the mechanisms in place for dealing with
page faulting tricks, other driver subsystems may not.  And in
addition, while page faulting tricks make userspace simpler, there
are some associated overheads.
Speaking for the ARM Mali T6xx driver point of view, this API looks
good for us. Our use-case for mmap is glReadPixels and
glTex[Sub]Image2D on buffers the driver has imported via dma_buf. In
the case of glReadPixels, the finish ioctl isn't strictly necessary
as the CPU won't have written to the buffer and so doesn't need
flushing. As such, we'd get an additional cache flush which isn't
really necessary. But hey, it's glReadPixels - it's supposed to be
slow. :-)
I think requiring the finish ioctl in the API contract is a good
idea, even if the CPU has only done a ro access as it allows future
enhancements*. To "fix" the unnecessary flush in glReadPixels, I
think we'd like to keep the finish but see an "access type"
parameter added to prepare ioctl indicating if the access is ro or
rw to allow the cache flush in finish to be skipped if the access
was ro. As Rebecca says, a debug feature could even be added to
re-map the pages as ro in prepare(ro) to catch naughty accesses. I'd
also go as far as to say the debug feature should completely unmap
the pages after finish too. Though for us, both the access-type
parameter and debug features are "nice to haves" - we can make
progress with the code as it currently stands (assuming exporters
start using the API that is).
Perhaps it isn't a bad idea to include access-type bitmask in the
first version.  It would help optimize a bit the cache operations.
...
Something which also came up when discussing internally is the topic
of mmap APIs of the importing device driver. For example, I believe
DRM has an mmap API on GEM buffer objects. If a new dma_buf import
ioctl was added to GEM (maybe the PRIME patches already add this),
how would GEM's bo mmap API work?
My first thought is maybe we should just dis-allow this for now until
we have a chance to see if there are any possible issues with an
importer mmap'ing the buffer to userspace.  We could possible have a
helper dma_buf_mmap() fxn which in turn calls dmabuf ops->mmap() so
the mmap'ing is actually performed by the exporter on behalf of the
importer.
...

Future enhancements: The prepare/finish bracketing could be used

as part of a wider synchronization scheme with other devices.
E.g. If another device was writing to the buffer, the prepare ioctl
could block until that device had finished accessing that buffer.
In the same way, another device could be blocked from accessing that
buffer until the client process called finish. We have already
started playing with such a scheme in the T6xx driver stack we're
terming "kernel dependency system". In this scheme each buffer has a
FIFO of "buffer consumers" waiting to access a buffer. The idea
being that a "buffer consumer" is fairly abstract and could be any
device or userspace process participating in the synchronization
scheme. Examples would be GPU jobs, display controller "scan-out"
jobs, etc.
So for example, a userspace application could dispatch a GPU
fragment shading job into the GPU's kernel driver which will write
to a KMS scanout buffer. The application then immediately issues a
drm_mode_crtc_page_flip ioctl on the display controller's DRM driver
to display the soon-to-be-rendered buffer. Inside the kernel, the
GPU driver adds the fragment job to the dma_buf's FIFO. As the FIFO
was empty, dma_buf calls into the GPU kernel driver to tell it it
"owns" access to the buffer and the GPU driver schedules the job to
run on the GPU. Upon receiving the drm_mode_crtc_page_flip ioctl,
the DRM/KMS driver adds a scan-out job to the buffer's FIFO.
However, the FIFO already has the GPU's fragment shading job in it
so nothing happens until the GPU job completes. When the GPU job
completes, the GPU driver calls into dma_buf to mark its job
complete. dma_buf then takes the next job in its FIFO which the KMS
driver's scanout job, calls into the KMS driver to schedule the
pageflip. The result? A buffer gets scanned out as soon as it has
finished being rendered without needing a round-trip to userspace.
Sure, there are easier ways to achieve that goal, but the idea is
that the mechanism can be used to synchronize access across multiple
devices, which makes it useful for lots of other use-cases too.
As I say, we have already implemented something which works as I
describe but where the buffers are abstract resources not linked to
dma_buf. I'd like to discuss the finer points of the mechanisms
further, but if it's looking like there's interest in this approach
we'll start re-writing the code we have to sit on-top of dma_buf
and posting it as RFCs to the various lists. The intention is to get
this to mainline, if mainline wants it. :-)
I think we do need some sort of 'sync object' (which might really just
be a 'synchronization queue' object) in the kernel.  It probably
shouldn't be built-in to dma-buf, but I expect we'd want the dma_buf
struct to have a 'struct sync_queue *' (or whatever it ends up being
called).
The sync-queue seems like a reasonable approach for pure cpu-sw based
synchronization.  The only thing I'm not sure is how to also deal with
hw that supports any sort of auto synchronization without cpu sw
involvement.
BR,
-R
...
Personally, what I particularly like about this approach to
synchronization is that it doesn't require any interfaces to be
modified. I think/hope that makes it easier to port existing drivers
and sub-systems to take advantage of it. The buffer itself is the
synchronization object and interfaces already pass buffers around so
don't need modification. There are of course some limitations with
this approach, the main one we can think of being that it can't
really be used for A/V sync. It kinda assumes "jobs" in the FIFO
should be run as soon as the preceding job completes, which isn't
the case when streaming real-time video. Though nothing precludes
more explicit sync objects being used in conjunction with this
approach.
Cheers,
Tom

dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel