-----Original Message----- From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk] Sent: 19 March 2012 16:57 To: Tom Cooksey Cc: 'Rob Clark'; linaro-mm-sig@lists.linaro.org; dri- devel@lists.freedesktop.org; linux-media@vger.kernel.org; rschultz@google.com; Rob Clark; sumit.semwal@linaro.org; patches@linaro.org Subject: Re: [PATCH] RFC: dma-buf: userspace mmap support
If the API was to also be used for synchronization it would have to include an atomic "prepare multiple" ioctl which blocked until all the buffers listed by the application were available. In the same
Too slow already. You are now serializing stuff while what we want to do really is
nobody_else_gets_buffers_next([list]) on available(buffer) dispatch_work(buffer)
so that you can maximise parallelism without allowing deadlocks. If you've got a high memory bandwith and 8+ cores the 'stop everything' model isn't great.
Yes, sorry I wasn't clear here. By atomic I meant that a job starts using all buffers at the same time, once they are available. You are right, a job waiting for a list of buffers to become available should not prevent other jobs running or queuing new jobs (eughh). We actually have the option of using asynchronous call-backs in KDS: A driver lists all the buffers it needs when adding a job and that job gets added to the FIFO of each buffer as an atomic operation. However, once the job is added to all the FIFOs, that atomic operation is complete and another job can be "queued" up. When a job completes, it is removed from each buffer's FIFO. At that point, all the "next" jobs in each buffer's FIFO are evaluated to see if they can run. If they can run, the job's "start" call-back is called. There's also a synchronous mode of operation where a blocked thread is "woken up" instead of calling a call-back function. It is this synchronous mode I would imagine would be used for user-space access.
This might be a good argument for keeping synchronization and cache maintenance separate, though even ignoring synchronization I would think being able to issue cache maintenance operations for multiple buffers in a single ioctl might present some small efficiency gains. However as Rob points out, CPU access is already in slow/legacy territory.
Dangerous assumption. I do think they should be separate. For one it makes the case of synchronization needed but hardware cache management much easier to split cleanly. Assuming CPU access is slow/legacy reflects a certain model of relatively slow CPU and accelerators where falling off the acceleration path is bad. On a higher end processor falling off the acceleration path isn't a performance matter so much as a power concern.
On some GPU architectures, glReadPixels is a _very_ heavy-weight operation, so is very much a performance issue and I think always will be. However I think this might be a special case for certain GPUs: Other GPU architectures or device-types might be able to share data with the CPU without such a large impact to performance. The example of writing subtitles onto a video frame decoded by a v4l2 hardware codec seems a good example.
KDS we differentiated jobs which needed "exclusive access" to a buffer and jobs which needed "shared access" to a buffer. Multiple jobs could access a buffer at the same time if those jobs all
Makes sense as it's a reader/writer lock and it reflects MESI/MOESI caching and cache policy in some hardware/software assists.
Actually, this got me thinking... Several ARM implementations rely on CPU/NEON to perform X.Org's 2D operations and those tend to operate directly on the framebuffer. So in that case, both the CPU and display controller need to access the same buffer at the same time, even though one of them is writing to the buffer. This is the main reason we called it shared/exclusive access in KDS rather than read-only/read-write access. In such scenarios, you'd still want to do a CPU cache flush after CPU-based 2D drawing is complete to make sure the display controller "saw" those changes. So yes, perhaps there's actually a use-case where synchronization must be kept separate to cache-maintenance? In which case, it is worth making the proposed prepare/finish API more explicit in that it is a CPU cache invalidate and CPU cache flush operation only? Or are there other things one might want to do in prepare/finish? Automatic cache domain tracking for example?
display controller will be reading the front buffer, but the GPU might also need to read that front buffer. So perhaps adding "read-only" & "read-write" access flags to prepare could also be interpreted as shared & exclusive accesses, if we went down this route for synchronization that is. :-)
mmap includes read/write info so probably using that works out. It also means that you have the stuff mapped in a way that will bus error or segfault anyone who goofs rather than give them the usual 'deep weirdness' behaviour you get with mishandling of caching bits.
I think it might be possible to make the case to cache user-space mappings. In which case, you might want to always mmap read-write but sometimes do a read operation and sometimes a write. So I think we'd prefer not to make the read-only/read-write decision at mmap time.
Cheers,
Tom