Hey,
Due to inertia, I thought I would take a shot at implicit synchronization as well. I have just barely enough to make it work for nouveau to synchronize with itself now using the cpu. Hopefully the general idea is correct but I feel the implementation wrong.
There are 2 ways to get deadlocks if no proper care is taken to avoid it, the first being 2 tasks taking each device's lock in a different order, the second 2 devices waiting on completion of each other before starting own work. The easiest way to avoid this is to introduce a global dma_buf_submit_mutex so in cases where synchronization is needed. This way only 1 submission involving dma-buffer synchronisation can be made simultaneously. This will make it impossible to deadlock because even if you take dma mutex->dev a mutex->dev b mutex, and swap a and b, the dmabuf mutex would prevent this from being done at the same time.
That leaves the real problem of the synchronization itself. I felt that because the code involved was already sharing dma-buf's, the easiest way to implement it would be.. another dma-buf. Some hardware might have specific requirements on them, so I haven't pinned down the exact details yet.
It's a bit of intermingling between drm and dma-buf namespace since it is an early wip, any comments are welcome though.
This is what I used so far:
#define DRM_PRIME_FENCE_MAX 2 struct drm_prime_fence { struct dma_buf *sync_buf; uint64_t offset; uint32_t value; enum { /* Nop is to allow preparations in case dma_buf * is different for release, so the call will * never fail at that point. */ DRM_PRIME_FENCE_NOP = 0, DRM_PRIME_FENCE_WAIT_EQ, // DRM_PRIME_FENCE_WAIT_GE, /* block while ((int)(cur - expected) < 0); */ DRM_PRIME_FENCE_SET } op; };
and added to struct dma_buf_ops: /* drm_prime_fence_ is written by function to indicate what is needed * to acquire this buffer, up to DRM_PRIME_FENCE_MAX buffers are allowed * sync_acquire returns a negative value on error, otherwise * amounts of fence ops that need to be executed. * * Release is not allowed to fail and merely returns number of * fence ops that needs to be executed after command stream is done. * Abort occurs when there's a failure between acquire and release, * for example because dma-buf's from multiple devices are involved * and the other one failed to acquire. */ int (*sync_acquire)(struct dma_buf *, struct drm_prime_fence fence[2], unsigned long align, unsigned long release_write); int (*sync_release)(struct dma_buf * struct drm_prime_fence fence[2]); void (*sync_abort)(struct dma_buf *);
I'm not completely sure about this part yet, align can be seen as minimum alignment requirement, ideally I would negotiate those earlier but I haven't found the correct place yet, maybe on attach? nouveau writes a 16 byte stamp as part of it's semaphore ops (4 bytes programmable, 4 bytes pad, 8 bytes timestamp) which is why I need to communicate those requirements somehow. Not all nouveau cards wold support DRM_PRIME_FENCE_WAIT_GE either.
I think there is a great power in making the sync object itself just another dma-buf that can be written to and/or read. Especially since all graphics card have some way to write an arbitrary 4-byte value to an arbitrary location (even the oldest intel cards have a blitter! :D). I'm hoping for more input into making the api better for other users too, which is why I'm posting this as early as I had something working (for some definition of working).
Thoughts? ~Maarten