Device DMA programming and buffer use cases - Linaro-mm-sig

11 May 2011


      (Disclaimer: I come from a graphics background, so sorry if I use graphicsy
terminology; please let me know if any of this isn't clear.  I tried.)
There is an wide range of hardware capabilities that require different
programming approaches in order to perform optimally.  We need to define an
interface that is flexible enough to handle each of them, or else it won't be
used and we'll be right back where we are today: with vendors rolling their own
support for the things they need.
I'm going to try to enumerate some of the more unique usage patterns as
I see them here.
- Many or all engines may sit behind asynchronous command stream interfaces.
  Programming is done through "batch buffers"; a set of commands operating on a
  set of in-memory buffers is prepared and then submitted to the kernel to be
  queued.  The kernel will first make sure all of the buffers are resident
  (which may require paging or mapping into an IOMMU/GART, a.k.a. "pinning"),
  then queue the batch of commands.  The hardware will process the commands at
  its earliest convenience, and then interrupt the CPU to notify it that it's
  done with the buffers (i.e. it can now be "unpinned").
  Those familiar with graphics may recognize this programming model as a
  classic GPU command stream.  But it doesn't need to be used exclusively with
  GPUs; any number of devices may have such an on-demand paging mechanism.
- In contrast, some engines may also stream to or from memory continuously
  (e.g., video capture or scanout); such buffers need to be pinned for an
  extended period of time, not tied to the command streams described above.
- There can be multiple different command streams working at the same time on
  the same buffers.  (There may be hardware synchronization primitives between
  the multiple command streams so the CPU doesn't have to babysit too much, for
  both performance and power reasons.)
- In some systems, IOMMU/GART may be much smaller than physical memory; older
  GPUs and SoCs have this.  To support these, we need to be able to map and
  unmap pages into the IOMMU on demand in our host command stream flow.  This
  model also requires patching up pending batch buffers before queueing them to
  the hardware, to update them to point to the newly-mapped location in the
  IOMMU.
- In other systems, IOMMU/GART may be much larger than physical memory; more
  modern GPUs and SoCs have this.  With these, we can reserve virtual (IOMMU)
  address space for each buffer up front.  To userspace, the buffers always
  appear "mapped".  This is similar in concept to how the CPU virtual space in
  userspace sticks around even when the underlying memory is paged out to disk.
  In this case, pinning is performed at the same time as the small-IOMMU case
  above, but  in the normal/fast case, the pages are never paged out of the
  IOMMU, and the pin step just increments a refcount to prevent the pages from
  being evicted.
  It is desirable to keep the same IOMMU address for:
  a) implementing features such as
  http://www.opengl.org/registry/specs/NV/shader_buffer_load.txt
  (OpenGL client applications and shaders manipulate GPU vaddr pointers
  directly; a GPU virtual address is assumed to be valid forever).
  b) performance: scanning through the command buffers to patch up pointers can
  be very expensive.
One other important note: buffer format properties may be necessary to set up
mappings (both CPU and iommu mappings).  For example, both types of mappings
may need to know tiling properties of the buffer.  This may be a property of
the mapping itself (consider it baked into the page table entries), not
necessarily something a different driver or userspace can program later
independently.
Some of the discussion I heard this morning tended towards being overly
simplistic and didn't seem to cover each of these cases well.  Hopefully this
will help get everyone on the same page.
Thanks,
Robert