(Disclaimer: I come from a graphics background, so sorry if I use graphicsy terminology; please let me know if any of this isn't clear. I tried.)
There is an wide range of hardware capabilities that require different programming approaches in order to perform optimally. We need to define an interface that is flexible enough to handle each of them, or else it won't be used and we'll be right back where we are today: with vendors rolling their own support for the things they need.
I'm going to try to enumerate some of the more unique usage patterns as I see them here.
- Many or all engines may sit behind asynchronous command stream interfaces. Programming is done through "batch buffers"; a set of commands operating on a set of in-memory buffers is prepared and then submitted to the kernel to be queued. The kernel will first make sure all of the buffers are resident (which may require paging or mapping into an IOMMU/GART, a.k.a. "pinning"), then queue the batch of commands. The hardware will process the commands at its earliest convenience, and then interrupt the CPU to notify it that it's done with the buffers (i.e. it can now be "unpinned"). Those familiar with graphics may recognize this programming model as a classic GPU command stream. But it doesn't need to be used exclusively with GPUs; any number of devices may have such an on-demand paging mechanism.
- In contrast, some engines may also stream to or from memory continuously (e.g., video capture or scanout); such buffers need to be pinned for an extended period of time, not tied to the command streams described above.
- There can be multiple different command streams working at the same time on the same buffers. (There may be hardware synchronization primitives between the multiple command streams so the CPU doesn't have to babysit too much, for both performance and power reasons.)
- In some systems, IOMMU/GART may be much smaller than physical memory; older GPUs and SoCs have this. To support these, we need to be able to map and unmap pages into the IOMMU on demand in our host command stream flow. This model also requires patching up pending batch buffers before queueing them to the hardware, to update them to point to the newly-mapped location in the IOMMU.
- In other systems, IOMMU/GART may be much larger than physical memory; more modern GPUs and SoCs have this. With these, we can reserve virtual (IOMMU) address space for each buffer up front. To userspace, the buffers always appear "mapped". This is similar in concept to how the CPU virtual space in userspace sticks around even when the underlying memory is paged out to disk. In this case, pinning is performed at the same time as the small-IOMMU case above, but in the normal/fast case, the pages are never paged out of the IOMMU, and the pin step just increments a refcount to prevent the pages from being evicted. It is desirable to keep the same IOMMU address for: a) implementing features such as http://www.opengl.org/registry/specs/NV/shader_buffer_load.txt (OpenGL client applications and shaders manipulate GPU vaddr pointers directly; a GPU virtual address is assumed to be valid forever). b) performance: scanning through the command buffers to patch up pointers can be very expensive.
One other important note: buffer format properties may be necessary to set up mappings (both CPU and iommu mappings). For example, both types of mappings may need to know tiling properties of the buffer. This may be a property of the mapping itself (consider it baked into the page table entries), not necessarily something a different driver or userspace can program later independently.
Some of the discussion I heard this morning tended towards being overly simplistic and didn't seem to cover each of these cases well. Hopefully this will help get everyone on the same page.
Thanks, Robert
On Tue, May 10, 2011 at 9:35 PM, rmorell@nvidia.com wrote:
One other important note: buffer format properties may be necessary to set up mappings (both CPU and iommu mappings). For example, both types of mappings may need to know tiling properties of the buffer. This may be a property of the mapping itself (consider it baked into the page table entries), not necessarily something a different driver or userspace can program later independently.
I've been thinking a bit about this.. we have something similar with TILER on OMAP4 (which is sort of a system-wide IOMMU that can also do tiling), where you wouldn't see the same data if you just remapped the same physical pages elsewhere. (I think GART's can in some cases untile to provide the CPU with a sensible view of tiled surfaces, which would somehow be similar... but disclaimer: I'm not coming from a desktop graphics background so please correct me as needed ;-)). At least in the OMAP4 case, any other hw block in the system can all read thru the remapped TILER address and continue to see a untiled layout of the pixel data.
But you could still take the hit of copy'ing the buffer contents to a tiled format the first time it was used somewhere where it needed to be tiled, and as long as everyone else is accessing via the TILER address (so they see the untiled view). As long as you are not doing it every time you use the buffer, but only the first time you pass it around. It does get complicated if some hw bits are already using the buffer. Maybe having a sync-object associated with the buffer so you have some way to wait until others are not using the buffer would help. And I think you need similar infrastructure if you wanted to be able to lazily unmap buffers only when you start running out of mapping space.. ie. if your IOMMU has a limited amount of memory that could be mapped at a time.
On the other hand, maybe it is easier just to rev DRI2 protocol to pass more information to xorg driver so it could make a better decision about how to allocate the buffer in the first place.
BR, -R
Hi,
On Wed, May 11, 2011 at 03:59:24AM -0500, Clark, Rob wrote:
I've been thinking a bit about this.. we have something similar with TILER on OMAP4 (which is sort of a system-wide IOMMU that can also do tiling), where you wouldn't see the same data if you just remapped the same physical pages elsewhere. (I think GART's can in some cases untile to provide the CPU with a sensible view of tiled surfaces, which would somehow be similar... but disclaimer: I'm not coming from a desktop graphics background so please correct me as needed ;-)). At least in the OMAP4 case, any other hw block in the system can all read thru the remapped TILER address and continue to see a untiled layout of the pixel data.
Yes, you're correct: most desktop hardware that I know of can expose different views of surfaces, including untiled views of tiled surfaces. Hence how we've managed to this day to avoid introducing a tiled rasteriser in X, despite most target hardware requiring tiling internally. ;)
I believe new(er) NVIDIA hardware both enforces tiling (at least in some cases), and is not able to expose untiled views, though.
Cheers, Daniel
linaro-mm-sig@lists.linaro.org