Linaro-mm-sig May 2011

linaro-mm-sig@lists.linaro.org

22 participants
24 discussions

[RFC 0/2] ARM: DMA-mapping & IOMMU integration

by Marek Szyprowski

Hello, Folloing the discussion about the driver for IOMMU controller for Samsung Exynos4 platform and Arnd's suggestions I've decided to start working on redesign of dma-mapping implementation for ARM architecture. The goal is to add support for IOMMU in the way preffered by the community :) Some of the ideas about merging dma-mapping api and iommu api comes from the following threads: http://www.spinics.net/lists/linux-media/msg31453.html http://www.spinics.net/lists/arm-kernel/msg122552.html http://www.spinics.net/lists/arm-kernel/msg124416.html They were also discussed on Linaro memory management meeting at UDS (Budapest 9-12 May). I've finaly managed to clean up a bit my works and present the initial, very proof-of-concept version of patches that were ready just before Linaro meeting. What have been implemented: 1. Introduced arm_dma_ops dma_map_ops from include/linux/dma-mapping.h suffers from the following limitations: - lack of start address for sync operations - lack of write-combine methods - lack of mmap to user-space methods - lack of map_single method For the initial version I've decided to use custom arm_dma_ops. Extending common interface will take time, until that I wanted to have something already working. dma_{alloc,free,mmap}_{coherent,writecombine} have been consolidated into dma_{alloc,free,mmap}_attrib what have been suggested on Linaro meeting. New attribute for WRITE_COMBINE memory have been introduced. 2. moved all inline ARM dma-mapping related operations to arch/arm/mm/dma-mapping.c and put them as methods in generic arm_dma_ops structure. The dma-mapping.c code deinitely needs cleanup, but this is just a first step. 3. Added very initial IOMMU support. Right now it is limited only to dma_alloc_attrib, dma_free_attrib and dma_mmap_attrib. It have been tested with s5p-fimc driver on Samsung Exynos4 platform. 4. Adapted Samsung Exynos4 IOMUU driver to make use of the introduced iommu_dma proposal. This patch series contains only patches for common dma-mapping part. There is also a patch that adds driver for Samsung IOMMU controller on Exynos4 platform. All required patches are available on: git://git.infradead.org/users/kmpark/linux-2.6-samsung dma-mapping branch Git web interface: http://git.infradead.org/users/kmpark/linux-2.6-samsung/shortlog/refs/heads… Future: 1. Add all missing operations for IOMMU mappings (map_single/page/sg, sync_*) 2. Move sync_* operations into separate function for better code sharing between iommu and non-iommu dma-mapping code 3. Splitting out dma bounce code from non-bounce into separate set of dma methods. Right now dma-bounce code is compiled conditionally and spread over arch/arm/mm/dma-mapping.c and arch/arm/common/dmabounce.c. 4. Merging dma_map_single with dma_map_page. I haven't investigated deeply why they have separate implementation on ARM. If this is a requirement then dma_map_ops need to be extended with another method. 5. Fix dma_alloc to unmap from linear mapping. 6. Convert IO address space management code from gen-alloc to some simpler bitmap based solution. 7. resolve issues that might araise during discussion & comments Please note that this is very early version of patches, definitely NOT intended for merging. I just wanted to make sure that the direction is right and share the code with others that might want to cooperate on dma-mapping improvements. Best regards -- Marek Szyprowski Samsung Poland R&D Center Patch summary: Marek Szyprowski (2): ARM: Move dma related inlines into arm_dma_ops methods ARM: initial proof-of-concept IOMMU mapper for DMA-mapping arch/arm/Kconfig | 1 + arch/arm/include/asm/device.h | 3 + arch/arm/include/asm/dma-iommu.h | 30 ++ arch/arm/include/asm/dma-mapping.h | 653 +++++++++++------------------ arch/arm/mm/dma-mapping.c | 817 +++++++++++++++++++++++++++++++++--- arch/arm/mm/vmregion.h | 2 +- include/linux/dma-attrs.h | 1 + 7 files changed, 1033 insertions(+), 474 deletions(-) create mode 100644 arch/arm/include/asm/dma-iommu.h -- 1.7.1.569.g6f426

14 years

Video4Linux API for shared buffers

by Mauro Carvalho Chehab

After the mm panels, I had a few discussions with Hans, Rob and Daniel, among others, during the V4L and KMS discussions and after that. Based on those discussions, I'm pretty much convinced that the normal MMAP way of streaming (VIDIOC_[REQBUF|STREAMON|STREAMOFF|QBUF|DQBUF ioctl's) are not the best way to share data with framebuffers. We probably need something that it is close to VIDIOC_FBUF/VIDIOC_OVERLAY, but it is still not the same thing. I suspect that working on such API is somewhat orthogonal to the decision of using a file pointer based or a bufer ID based based kABI for passing the buffer parameters to the newly V4L calls, but we cannot decide about the type of buffer ID that we'll use if we not finish working at an initial RFC for the V4L API, as the way the buffers will be passed into it will depend on how we design such API. It should be also noticed that, while in the shared buffers some definitions can be postponed to happen later (as it is basically a Kernelspace-only ABI - at least initially), the V4L API should be designed to consider all possible scenarios, as "diamonds and userspace API's are forever"(tm). It seems to me that the proper way to develop such API is starting working with Xorg V4L driver, changing it to work with KMS and with the new API (probably porting some parts of it to kernelspace). One of the problems with a shared framebuffer is that an overlayed V4L stream may, at the worse case, be sent to up to 4 different GPU's and/or displays, like: ===================+=================== | | | | D1 +----|---+ D2 | | | V4L| | | +-------------|----+---|--------------| | | | | | | D3 +----+---+ D4 | | | | ======================================= Where D1, D2, D3 and D4 are 4 different displays, and the same V4L framebuffer is partially shared between them (the above is an example of a V4L input, although the reverse scenario of having one frame buffer divided into 4 V4L outputs also seems to be possible). As the same image may be divided into 4 monitors, the buffer filling should be synced with all of them, in order to avoid flipping effects. Also, the buffer can't be re-used until all displays finish reading. Display API's currently has similar issues. From what I understood from Rob and Daniel, this is solved there by dynamically allocating buffers. So, we may need to do something similar to that also at V4L (in a matter of fact, there's currently a proposal to hack REQBUF's, in order to extend V4L API to allow dynamically creating more buffers than used by a stream). It makes sense to me to discuss such proposal together with the above discussions, in order to keep the API consistent. >From my side, I'm expecting that the responsible(s) for the API proposals to also provide with open source drivers and userspace application(s), that allows to test and validate such API RFC. Thanks, Mauro

14 years, 1 month

Notes on the Linaro memory management mini-summit, V4L2

by Sakari Ailus

Hi, Here are my own notes from the Linaro memory management mini-summit in Budapest. I've written them from my own point of view, which is mostly V4L2 in embedded devices and camera related use cases. I attempted to summarise the discussion mostly concentrating into parts which I've considered important and ignored the rest. So please do not consider this as the generic notes of the mini-summit. :-) I still felt like sharing this since it might be found useful by those who are working with similar systems with similar problems. Memory buffer management --- the future ======================================= The memory buffer management can be split to following sub-problems which may have dependencies, both in implementation and possibly in the APIs as well: - Fulfilling buffer allocation requirements - API to allocate buffers - Sharing buffers among kernel subsystems (e.g. V4L2, DRM, FB) - Sharing buffers between processes - Cache coherency What has been agreed that we need kernel to recognise a DMA buffer which may be passed between user processes and different kernel subsystems. Fulfilling buffer allocation requirements ----------------------------------------- APIs, as well as devices, have different requirements on the buffers. It is difficult to come up with generic requirements for buffer allocation, and to keep the solution future-proof is challenging as well. In principle the user is interested in being able to share buffers between subsystems without knowing the exact requirements of the devices, which makes it possible to keep the requirement handling internal to the kernel. Whether this is the way to go or not, will be seen in the future. The buffer allocation remains a problem to be resolved in the future. Majority of the devices' requirements could be filled using a few allocators; one for physically continugous memory and the other for physically non-contiguous memory of single page allocations. Being able to allocate large pages would also be beneficial in many cases. API to allocate buffers ----------------------- It was agreed there was a need to have a generic interface for buffer object creation. This could be either a new system call which would be supported by all devices supporting such buffers in subsystem APIs (such as V4L2), or a new dedicated character device. Different subsystems have different ways of describing the properties of the buffers, such as how the data in the buffer should be interpreted. The V4L2 has width, height, bytesperline and pixel format, for example. The generic buffers should not recognise such properties since this is very subsystem specific information. Instead, the user which is aware of the different subsystems must come with matching set of buffer properties using the subsystem specific interfaces. Sharing buffers among kernel subsystems --------------------------------------- There was discussion on how to refer to generic DMA buffers, and the audience was first mostly split between using buffer IDs to refer to the buffers and using file handles for the purpose. Using file handles have pros and cons compared to the numeric IDs: + Easy life cycle management. Deallocation of buffers no longer in use is trivial. + Access control for files exists already. Passing file descriptors between processes is possible throught Unix sockets. - Allocating extremely large number of buffers would require as many file descriptors. This is not likely to be an important issue. Before the day ended, it was felt that the file handles are the right way to go. The generic DMA buffers further need to be associated to the subsystem buffers. This is up to the subsystem APIs. In V4L2, this would most likely mean that there will be a new buffer type for the generic DMA buffers. Sharing buffers between processes --------------------------------- Numeric IDs can be easily shared between processes while sharing file handles is more difficult. However, it can be done using the Unix sockets between any two processes. This also gives automatically the same access control mechanism as every other file. Access control mechanisms are mandatory when making the buffer shareable between processes. Cache coherency --------------- Cache coherency is seen largely orthogonal to any other sub-problems in memory buffer management. In few cases this might have something in common with buffer allocation. Some architectures, ARM in particular, do not have coherent caches, meaning that the operating system must know when to invalidate or clean various parts of the cache. There are two ways to approach the issue, independently of the cache implementation: 1. Allocate non-cacheable memory, or 2. invalidate or clean (or flush) the cache when necessary. Allocating non-cacheable memory is a valid solution to cache coherency handling in some situations, but mostly only when the buffer is only partially accessed by the CPU or at least not multiple times. In other cases, invalidating or cleaning the cache is the way to go. The exact circumstances in which using non-cacheable memory gives a performance benefit over invalidating or cleaning the cache when necessary are very system and use case dependent. This should be selectable from the user space. The cache invalidation or cleaning can be either on the whole (data) cache or a particular memory area. Performing the operation on a particular memory area may be difficult since it should be done to all mappings of the memory in the system. Also, there is a limit beyond which performing invalidation or clean for an area is always more expensive than a full cache flush: on many machines the cache line size is 64 bytes, and the invalidate/clean must be performed for the whole buffer, which in cameras could be tens of megabytes in size, per every cache line. Mapping buffers to application memory is not always necessary --- the buffers may only be used by the devices, in which case a scatterlist of the pages in the buffer is necessary to map the buffer to the IOMMU. More (impartial :-)) information can be found here: <URL:http://summit.ubuntu.com/uds-o/meeting/linaro-graphics-memory-managemen…> <URL:http://summit.ubuntu.com/uds-o/meeting/linaro-graphics-memory-managemen…> <URL:http://summit.ubuntu.com/uds-o/meeting/linaro-graphics-memory-managemen…> Regards, -- Sakari Ailus sakari.ailus(a)maxwell.research.nokia.com

14 years, 1 month

How to pass a fd between processes?

by Hans Verkuil

Hi all, During the Budapest meetings it was mentioned that you can pass a fd between processes. How does that work? Does someone have a code example or a link to code that does that? Just to satisfy my curiosity. Regards, Hans

14 years, 1 month

Memory Management Discussion

by Sree Kumar

Thanks Jesse for initiating the mailing list. We need to address the requirements of Graphics and Multimedia Accelerators (IPs). What we really need is a permanent solution (at upstream) which accommodates the following requirements and conforms to Graphics and Multimedia use cases. 1.Mechanism to map/unmap the memory. Some of the IPs’ have the ability to address virtual memory and some can address only physically contiguous address space. We need to address both these cases. 2.Mechanism to allocate and release memory. 3.Method to share the memory (ZERO copy is a MUST for better performance) between different device drivers (example output of camera to multimedia encoder). 4.Method to share the memory with different processes in userspace. The sharing mechanism should include built-in security features. Are there any special requirements from V4L or DRM perspectives? Thanks, Sree

14 years, 1 month

Device DMA programming and buffer use cases

by rmorell＠nvidia.com

(Disclaimer: I come from a graphics background, so sorry if I use graphicsy terminology; please let me know if any of this isn't clear. I tried.) There is an wide range of hardware capabilities that require different programming approaches in order to perform optimally. We need to define an interface that is flexible enough to handle each of them, or else it won't be used and we'll be right back where we are today: with vendors rolling their own support for the things they need. I'm going to try to enumerate some of the more unique usage patterns as I see them here. - Many or all engines may sit behind asynchronous command stream interfaces. Programming is done through "batch buffers"; a set of commands operating on a set of in-memory buffers is prepared and then submitted to the kernel to be queued. The kernel will first make sure all of the buffers are resident (which may require paging or mapping into an IOMMU/GART, a.k.a. "pinning"), then queue the batch of commands. The hardware will process the commands at its earliest convenience, and then interrupt the CPU to notify it that it's done with the buffers (i.e. it can now be "unpinned"). Those familiar with graphics may recognize this programming model as a classic GPU command stream. But it doesn't need to be used exclusively with GPUs; any number of devices may have such an on-demand paging mechanism. - In contrast, some engines may also stream to or from memory continuously (e.g., video capture or scanout); such buffers need to be pinned for an extended period of time, not tied to the command streams described above. - There can be multiple different command streams working at the same time on the same buffers. (There may be hardware synchronization primitives between the multiple command streams so the CPU doesn't have to babysit too much, for both performance and power reasons.) - In some systems, IOMMU/GART may be much smaller than physical memory; older GPUs and SoCs have this. To support these, we need to be able to map and unmap pages into the IOMMU on demand in our host command stream flow. This model also requires patching up pending batch buffers before queueing them to the hardware, to update them to point to the newly-mapped location in the IOMMU. - In other systems, IOMMU/GART may be much larger than physical memory; more modern GPUs and SoCs have this. With these, we can reserve virtual (IOMMU) address space for each buffer up front. To userspace, the buffers always appear "mapped". This is similar in concept to how the CPU virtual space in userspace sticks around even when the underlying memory is paged out to disk. In this case, pinning is performed at the same time as the small-IOMMU case above, but in the normal/fast case, the pages are never paged out of the IOMMU, and the pin step just increments a refcount to prevent the pages from being evicted. It is desirable to keep the same IOMMU address for: a) implementing features such as http://www.opengl.org/registry/specs/NV/shader_buffer_load.txt (OpenGL client applications and shaders manipulate GPU vaddr pointers directly; a GPU virtual address is assumed to be valid forever). b) performance: scanning through the command buffers to patch up pointers can be very expensive. One other important note: buffer format properties may be necessary to set up mappings (both CPU and iommu mappings). For example, both types of mappings may need to know tiling properties of the buffer. This may be a property of the mapping itself (consider it baked into the page table entries), not necessarily something a different driver or userspace can program later independently. Some of the discussion I heard this morning tended towards being overly simplistic and didn't seem to cover each of these cases well. Hopefully this will help get everyone on the same page. Thanks, Robert

14 years, 1 month

Media controller + DSS presentation link

by Sumit Semwal

http://elinux.org/images/8/83/Elc2011_semwal.pdf

14 years, 1 month

Short GEM Introduction

by Daniel Vetter

Hi all, A bit later than what I've hoped for, but here we go [Jesse and Dave, please correct/clarify/extend where you see fit]: The core idea of GEM is to identify graphic buffer objects with 32bit ids. The reason being "X runs out of open fds" (KDE easily reaches a few thousand). The core design principle behind GEM is that the kernel is in full control of the allocation of these buffer objects and is free to move the around in any way it sees fit. This is to make concurrent rendering by multiple processes possible while userspace can still assume that it is in sole possession of the gpu - GEM means "graphics execution manager". Below some more details on what GEM is and does, what it does _not_ do and how it relates to other graphic subsystems. GEM does ... ------------ - lifecycle management. Userspace references are associated with the drm fd and get reaped on close (in case userspace forgets about them). - per-device global names to exchange buffers between processes (eg dri2). These names are again 32bit ids. These global ids do not count as userspace references and don't prevent a buffer from being reaped. - it implements very few generic ioctls: * flink for creating a global name for a buffer object * open for getting a per-fd handle to a buffer object with a global name * close for dropping a per-fd handle. - a little bit of kernel-internal helpers to facilitate mmap (by blending multiple buffer objects into the single drm device address space) and a few other things. That's it, i.e. GEM is very much meant to be as simple as possible. Driver-specific GEM ioctls -------------------------- The generic GEM stuff is obviously not very useful. So drivers implement quite a bit driver-specific ioctls, like: - buffer creation. In recent kernels there is some support to create dumb scanout objects for KMS. But they're only really useful for boot-splashs and unaccelerated dumb KMS drivers. Creating buffers usable for rendering is only possible with driver specific ioctls. - command submission. An important part is mapping abstract buffer ids to actual gpu address (and rewriting batchbuffers with these). In the future, with support for virtual gpu address spaces this might change. - tiling management. The kernel needs to know this to correctly tile/detile buffers when moving them around (e.g. evicting from vram). - command completion signalling and gpu/cpu synchronization. There are currently two approaches for implementing a GEM driver: - roll-your-own, used by drm/i915 (and sometimes getting flaked for NIH). - ttm-base: radeon & nouveau. GEM does not ... ---------------- This still leaves out a few things that I've seen mentioned as ideas/requirements here and elsewhere: - cross-device buffer sharing and namespaces (see below) and - buffer format handling and mediation between different users (except tiling as mentioned above). The reason here is that gpus are a mess and one of the worst parts is format handling. Better keep that out of the kernel ... KMS (kernel mode setting) ------------------------- KMS is essentially just a port of the xrandr api to the kernel as an ioctl interface: - crtcs feed (possible multiple) outputs and get their data from a framebuffer object. A major part of KMS is also the support for vsynced-pageflipping of framebuffers. - Internally there's some support infrastructure to simplify drivers (all the drm_*_helper.c code). - framebuffers are created from a opaque driver-specific 32bit id and a format description. For GEM drivers these ids name GEM objects, but that need not be: The recently merged qemu kms driver does not implement gem and has one unique buffer object with id 0. - as mentioned above there newly is a generic ioctl to create an object suitable as a dumb scanout (plus some support to mmap it). - currently KMS has no generic support for overlays (there are driver-specific ioctls in i915 and vmgfx, though). Jesse Barnes has posted an RFC to remedy this: http://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg10415.html GEM and PRIME ------------- PRIME is a proof-of-concept implementation from Dave Airlie for sharing GEM objects between drivers/devices: Buffer sharing is done with a list of struct page pointers. While being shared, buffers can't be moved anymore. No further buffer description is passed along in the kernel, format/layout mediation is to be handled in userspace. Blog-post describing the initial design for sharing buffers between an integrated Intel igd and a discrete ATI gpu: http://airlied.livejournal.com/71734.html Other code using the same framework to render on an Intel igd and display the framebuffer on an usb-connected displayport: http://git.kernel.org/?p=linux/kernel/git/airlied/drm-testing.git;a=shortlo… GEM/KMS and fbdev ----------------- There's some minimal support to emulate an fbdev with a gem/kms driver. Resolution can't be changed and it's unaccelerated. There's been some muttering once in a while to better integrate this with either a kms kernel console driver or by routing fbdev resolution changes to kms. But the main use case is to display a kernel oops, which works. For everything else there's X (or an EGL client that understands kms). -Daniel -- Daniel Vetter Mail: daniel(a)ffwll.ch Mobile: +41 (0)79 365 57 48

14 years, 1 month

Basic buffer object operations and terminology

by Thomas Hellstrom

Hi! I just want to clarify some buffer object operations and terminology that seems confusing to people and that are used by most modern GPU drivers. I think it's useful to be aware of this, going forward in the memory manager discussions. Terminology: Scanout buffer: Buffer that is used for continous access by a device. Needs to be permanently pinned. Pinned buffer: A pinned buffer may not move and may not change backing pages. Allows it to be mapped to a device. Synchronization object: An object that is either in a signaled or non-signaled state. Signaled means that the device is done with the buffer, and has flushed its caches. A synchronization object has a device-specific part that may, for example, contain flushing state. Basic device use of a buffer: Scanout buffers (and perhaps also capture buffers?) are typically pinned. Other buffers that are temporarily used by a GPU and, for example, a video decoding engine or image processor are typically *not* pinned. The usage pattern for submitting any commands that affect the buffer is as follows: 1) Take a mutex that stops the buffer from being moved. This mutex could be global (stops all buffers from being moved) or per-buffer. 2) Wait on any previous synchronization objects attached to the buffer, if those sync objects would not be implicitly signaled when the device executes its work. This is where it becomes bad to have a global mutex under 1). 3) Validate the buffer. This means setting up any missing (contigous) device mappings or move to VRAM, flush cpu caches if necessary. 4) Patch up the device commands to reflect any movement of the buffer in 3). New offsets, SG-lists etc. 5) Submit the device commands. 6) Create a new synchronization object and attach it to the buffer. 7) Release the mutex taken i 1). The buffer will not be moved until the synchronization object has signaled, and mappings set up under 3) will not be torn down until the memory manager receives a request to free up mapping resources. I'd call this "Generation 2" device buffer management. (Intel (uses busy lists, no sync objects), Radeon, Nouveau, vmwgfx, New VIA) "Generation 1" was using a global memory manager for pinned buffers (SiS, old VIA DRM drivers) Generation 3 would be page based device MMUs with programmable apertures to access VRAM. What we were discussing today is basically creating a unified gen 1 manager, with a new user-space interface. /Thomas

14 years, 1 month

Non-PCI DRM implementations

by Jordan Crouse

DRM support for platform devices dropped last year and was drastically improved earlier this year. Qualcomm uses it for a really weak DRM driver that handles memory for X but does GPU and display through a different interface. Feel free to flame me for that.. :). https://www.codeaurora.org/gitweb/quic/la/?p=kernel/msm.git;a=blob;f=driver… And I believe OMAP also has a solution somewhere (sorry, I couldn't find a URL). Jordan

14 years, 1 month

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Linaro-mm-sig May 2011