On Sat, Sep 10, 2011 at 6:45 AM, Daniel Vetter daniel@ffwll.ch wrote:
On Fri, Sep 09, 2011 at 06:36:23PM -0500, Clark, Rob wrote:
On Fri, Sep 9, 2011 at 1:43 PM, Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Sep 08, 2011 at 05:56:19PM -0500, Clark, Rob wrote:
On Thu, Sep 8, 2011 at 3:28 PM, Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Sep 08, 2011 at 12:39:41PM -0500, Clark, Rob wrote:
On Thu, Sep 8, 2011 at 10:55 AM, Daniel Vetter daniel@ffwll.ch wrote: > On Thu, Sep 08, 2011 at 08:45:43PM +0530, Sumit Semwal wrote: >> On 8 September 2011 19:48, Clark, Rob rob@ti.com wrote: >> > actually, all I think we needed was to add 'struct device *' to >> > get_scatterlist (and maybe doesn't hurt for put).. I'm not entirely >> > sure why we need create/attach_device? The create fxn especially >> > seems wrong.. what if you are the 2nd device to import the buffer? >> > >> Ah... I guess I misunderstood your 'it seems like a sane approach' to mean >> you agree with the 2-step process suggested by Daniel :) - my bad! > > On re-reading Rob's response I think there's a misunderstanding. I don't > want a create function in the dma_buf_ops. The create would be a generic > dma_buf subsystem function that (like in the patch) allocates the struct > and properly registers the fd and what not else. The attach_device would > then be the only new function in dma_buf_ops. > > The confusion probably stems from my other proposal to hide these function > pointers behind small inline helpers for the in-kernel api.
I guess/assume that what you are trying to get to is a way to differentiate between "I am done DMA'ing to the buffer now, but will likely use it again (ie. feel free to cache the mapping)", and "I am done forever w/ the buffer (don't bother caching)"?
Hm, no, I haven't really thought much about the release side. dri2/gem doesn't have any automatic caching of shared buffers, btw, that's all done manually (i.e. in the higher levels like the mesa/ddx drivers) by keeping track of recently shared buffers (and hanging onto them for as long as possible without creating havoc on the other side).
ok, then I've lost sync.. (ok, I admit to be only having time to read email for 5 min here and there, which might be contributing to the problem)..
what was the purpose of attach() then (which couldn't be accomplished in the first get_scatterlist())?
Ok, I'll try to explain with this a use-case we've discussed: A video input device, some yuv plane scanout engine and an mpeg encoder, i.e. the video-shooting usecase. mpeg encoder is optional and only engaged when the user presse the record button.
Now for live-view, userspaces grabs some dma_buf ids (i.e. fds) from the v4l device and then creates a bunch of yuv planes at the scanout device with these. For this case you're right that we could arbitrage the memory layout at the get_scatterlist call of the scanout device (the other user, the frame-grabber, is already known because the dma_buf orginitated there).
But if we throw a third device into the mix, this won't work. Somewhen (and it doesn't matter how much we delay this till the complete pipeline is set up), either the the scanout device or the encoder has to call get_scatterlist before the other one. And as soon is the allocator (respectively the originator) has handed out a scatterlist (either in system address space or device address space) it cannot move the backing storage anymore. If the other device then calls get_scatterlist and has more stringet buffer placement requirements, things stop working.
Now my idea is that when userspace establishes the shared buffer (i.e. subsystems convert the dma_buf fd number into a subsystem specific id), the device taking part in this buffer sharing only calls attach_device, but not get_scatterlist.
After this has taken place for all devices, userspace starts the pipelined. Kernel drivers only then call get_scatterlist and so force the allocation of the backing storage. And by then all participating devices are known, so the right placement can be chosen for the buffer.
My one concern is that, depending one the app/framework above, it may not be possible to know all the involved devices at the start.. or at least this would require more extensive userspace changes, not just userspace framework (like GStreamer) changes..
But how about this.. and I think this approach ensures that there will be no later buffer migration. But in the other case, it ensures that there is at most one migration. (This all assuming the devices can physically share the same buffer.)
struct dma_buf_attachment * dma_buf_attach(struct dma_buf *buf, struct device *dev); --> establish intent to use the buffer w/ particular device
for (... # of times buffer is used ...) { struct scatterlist * dma_buf_get_scatterlist(struct dma_buf_attachment *attach); --> indicate begin of DMA, first call might actually back the buffer w/ pages or migrate buffer
void dma_buf_put_scatterlist(struct dma_buf_attachment *attach, struct scatterlist *); --> indication end of DMA }
void dma_buf_detach(struct dma_buf_attachment *attach); --> indicate buffer is no longer to be used by importing device
I fear we are in violent agreement here ;-) ...
oh, damn, that is what I was afraid of.. j/k ;-)
this is essentially what I was aiming for, minus the rather nice polish you've added in the form of struct dma_buf_attachment.
with this sort of approach, if a new device is attached after the first get_scatterlist the buffer can be, if needed, migrated using the union of all the devices requirements at a point in time when no DMA is active to/from the buffer. But if all the devices are known up front, then you never need to migrate unnecessarily.
Well, the problem is with devices that hang onto mappings for way too long so just waiting for all dma to finish to be able to fix up the buffer placement is a no-go. But I think we can postpone that issue a bit, especially since the drivers that tend to do this (gpus) can also evict objects nilly-willy, so that should be fixable with some explicit kill_your_mappings callback attached to drm_buf_attachment (or full-blown sync objects a là ttm).
I'm ok if the weird fallback cases aren't fast.. I just don't want things to explode catastrophically in weird cases.
I guess in the GPU / deep pipeline case, you can at least set up to get an interrupt back when the GPU is done with some surface (ie. when it gets to a certain point in the command-stream)? I think it is ok if things stall in this case until the GPU pipeline is drained (and if you are targeting 60fps, that is probably still faster than video, likely at 30fps). Again, this is just for the cases where userspace doesn't do what we want, to avoid just complete failure..
If the GPU is the one importing the dmabuf, it just calls put_scatterlist() once it gets some interrupt from the GPU. If the GPU is the one exporting the dmabuf, then get_scatterlist() just blocks until the GPU gets the interrupt from the GPU. (Well, I guess then do you need get_scatterlist_interruptable()?)
This way, if userspace is able to ensure all devices are known upfront, everything is great. But if not, things don't explode and you aren't forced to reallocate buffers in userspace or anything ugly like this.
Optionally perhaps attach could be combined w/ dma_buf_get and detach w/ dma_buf_put.. (although then they probably wouldn't return a 'struct dma_buf *'.. not sure if that is weird. But I guess get+attach and detach+put would always go together and this saves an extra call. I don't have (yet) a strong opinion one way or another on that.
Hm, maybe a seperate attach makes sense if e.g. the userspace abstraction doesn't align with the kernel struct device layout. E.g. you create a gem object out of a dma_buf (attaching it to the gpu struct device) and later a kms scanout buffer out of the gem object (attaching the same dma_buf to the display subsystem). But for the common case of a simple v4l device or a simple encoder, a combined get+attach and put+detach certainly makes sense.
ok, let's keep 'em separate then
BR, -R
Cheers, Daniel
Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48