Hello,
This is an updated version of the patch series introducing a new
features to DMA mapping subsystem to let drivers share the allocated
buffers (preferably using recently introduced dma_buf framework) easy
and efficient.
The first extension is DMA_ATTR_NO_KERNEL_MAPPING attribute. It is
intended for use with dma_{alloc, mmap, free}_attrs functions. It can be
used to notify dma-mapping core that the driver will not use kernel
mapping for the allocated buffer at all, so the core can skip creating
it. This saves precious kernel virtual address space. Such buffer can be
accessed from userspace, after calling dma_mmap_attrs() for it (a
typical use case for multimedia buffers). The value returned by
dma_alloc_attrs() with this attribute should be considered as a DMA
cookie, which needs to be passed to dma_mmap_attrs() and
dma_free_attrs() funtions.
The second extension is required to let drivers to share the buffers
allocated by DMA-mapping subsystem. Right now the driver gets a dma
address of the allocated buffer and the kernel virtual mapping for it.
If it wants to share it with other device (= map into its dma address
space) it usually hacks around kernel virtual addresses to get pointers
to pages or assumes that both devices share the DMA address space. Both
solutions are just hacks for the special cases, which should be avoided
in the final version of buffer sharing. To solve this issue in a generic
way, a new call to DMA mapping has been introduced - dma_get_sgtable().
It allocates a scatter-list which describes the allocated buffer and
lets the driver(s) to use it with other device(s) by calling
dma_map_sg() on it.
The third extension solves the performance issues which we observed with
some advanced buffer sharing use cases, which require creating a dma
mapping for the same memory buffer for more than one device. From the
DMA-mapping perspective this requires to call one of the
dma_map_{page,single,sg} function for the given memory buffer a few
times, for each of the devices. Each dma_map_* call performs CPU cache
synchronization, what might be a time consuming operation, especially
when the buffers are large. We would like to avoid any useless and time
consuming operations, so that was the main reason for introducing
another attribute for DMA-mapping subsystem: DMA_ATTR_SKIP_CPU_SYNC,
which lets dma-mapping core to skip CPU cache synchronization in certain
cases.
The proposed patches have been rebased on the latest Linux kernel
v3.5-rc2 with 'ARM: replace custom consistent dma region with vmalloc'
patches applied (for more information, please refer to the
http://www.spinics.net/lists/arm-kernel/msg179202.html thread).
The patches together with all dependences are also available on the
following GIT branch:
git://git.linaro.org/people/mszyprowski/linux-dma-mapping.git 3.5-rc2-dma-ext-v2
Best regards
Marek Szyprowski
Samsung Poland R&D Center
Changelog:
v2:
- rebased onto v3.5-rc2 and adapted for CMA and dma-mapping changes
- renamed dma_get_sgtable() to dma_get_sgtable_attrs() to match the convention
of the other dma-mapping calls with attributes
- added generic fallback function for dma_get_sgtable() for architectures with
simple dma-mapping implementations
v1: http://thread.gmane.org/gmane.linux.kernel.mm/78644http://thread.gmane.org/gmane.linux.kernel.cross-arch/14435 (part 2)
- initial version
Patch summary:
Marek Szyprowski (6):
common: DMA-mapping: add DMA_ATTR_NO_KERNEL_MAPPING attribute
ARM: dma-mapping: add support for DMA_ATTR_NO_KERNEL_MAPPING
attribute
common: dma-mapping: introduce dma_get_sgtable() function
ARM: dma-mapping: add support for dma_get_sgtable()
common: DMA-mapping: add DMA_ATTR_SKIP_CPU_SYNC attribute
ARM: dma-mapping: add support for DMA_ATTR_SKIP_CPU_SYNC attribute
Documentation/DMA-attributes.txt | 42 ++++++++++++++++++
arch/arm/common/dmabounce.c | 1 +
arch/arm/include/asm/dma-mapping.h | 3 +
arch/arm/mm/dma-mapping.c | 69 ++++++++++++++++++++++++------
drivers/base/dma-mapping.c | 18 ++++++++
include/asm-generic/dma-mapping-common.h | 18 ++++++++
include/linux/dma-attrs.h | 2 +
include/linux/dma-mapping.h | 3 +
8 files changed, 142 insertions(+), 14 deletions(-)
--
1.7.1.569.g6f426
Currently, when freeing 0 order pages, CMA pages are treated
the same as regular movable pages, which means they end up
on the per-cpu page list. This means that the CMA pages are
likely to be allocated for something other than contigous
memory. This increases the chance that the next alloc_contig_range
will fail because pages can't be migrated.
Given the size of the CMA region is typically limited, it is best to
optimize for success of alloc_contig_range as much as possible.
Do this by freeing CMA pages directly instead of putting them
on the per-cpu page lists.
Signed-off-by: Laura Abbott <lauraa(a)codeaurora.org>
---
mm/page_alloc.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e1c6f5..c9a6483 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1310,7 +1310,8 @@ void free_hot_cold_page(struct page *page, int cold)
* excessively into the page allocator
*/
if (migratetype >= MIGRATE_PCPTYPES) {
- if (unlikely(migratetype == MIGRATE_ISOLATE)) {
+ if (unlikely(migratetype == MIGRATE_ISOLATE)
+ || is_migrate_cma(migratetype)) {
free_one_page(zone, page, 0, migratetype);
goto out;
}
--
1.7.8.3
Hello everyone,
The patches adds support for DMABUF exporting to V4L2 stack. The latest
support for DMABUF importing was posted in [1]. The exporter part is dependant
on DMA mapping redesign [2] which is not merged into the mainline. Therefore it
is posted as a separate patchset. Moreover some patches depends on vmap
extension for DMABUF by Dave Airlie [3] and sg_alloc_table_from_pages function
[4].
Changelog:
v0: (RFC)
- updated setup of VIDIOC_EXPBUF ioctl
- doc updates
- introduced workaround to avoid using dma_get_pages,
- removed caching of exported dmabuf to avoid existence of circular reference
between dmabuf and vb2_dc_buf or resource leakage
- removed all 'change behaviour' patches
- inital support for exporting in s5p-mfs driver
- removal of vb2_mmap_pfn_range that is no longer used
- use sg_alloc_table_from_pages instead of creating sglist in vb2_dc code
- move attachment allocation to exporter's attach callback
[1] http://thread.gmane.org/gmane.linux.drivers.video-input-infrastructure/48730
[2] http://thread.gmane.org/gmane.linux.kernel.cross-arch/14098
[3] http://permalink.gmane.org/gmane.comp.video.dri.devel/69302
[4] This patchset is rebased on 3.4-rc1 plus the following patchsets:
Marek Szyprowski (1):
v4l: vb2-dma-contig: let mmap method to use dma_mmap_coherent call
Tomasz Stanislawski (11):
v4l: add buffer exporting via dmabuf
v4l: vb2: add buffer exporting via dmabuf
v4l: vb2-dma-contig: add setup of sglist for MMAP buffers
v4l: vb2-dma-contig: add support for DMABUF exporting
v4l: vb2-dma-contig: add vmap/kmap for dmabuf exporting
v4l: s5p-fimc: support for dmabuf exporting
v4l: s5p-tv: mixer: support for dmabuf exporting
v4l: s5p-mfc: support for dmabuf exporting
v4l: vb2: remove vb2_mmap_pfn_range function
v4l: vb2-dma-contig: use sg_alloc_table_from_pages function
v4l: vb2-dma-contig: Move allocation of dbuf attachment to attach cb
drivers/media/video/s5p-fimc/fimc-capture.c | 9 +
drivers/media/video/s5p-mfc/s5p_mfc_dec.c | 13 ++
drivers/media/video/s5p-mfc/s5p_mfc_enc.c | 13 ++
drivers/media/video/s5p-tv/mixer_video.c | 10 +
drivers/media/video/v4l2-compat-ioctl32.c | 1 +
drivers/media/video/v4l2-dev.c | 1 +
drivers/media/video/v4l2-ioctl.c | 6 +
drivers/media/video/videobuf2-core.c | 67 ++++++
drivers/media/video/videobuf2-dma-contig.c | 323 ++++++++++++++++++++++-----
drivers/media/video/videobuf2-memops.c | 40 ----
include/linux/videodev2.h | 26 +++
include/media/v4l2-ioctl.h | 2 +
include/media/videobuf2-core.h | 2 +
include/media/videobuf2-memops.h | 5 -
14 files changed, 411 insertions(+), 107 deletions(-)
--
1.7.9.5
On Thu, Jun 7, 2012 at 4:35 AM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
> The alternate is to not associate sync objects with buffers and
> have them be distinct entities, exposed to userspace. This gives
> userpsace more power and flexibility and might allow for use-cases
> which an implicit synchronization mechanism can't satisfy - I'd
> be curious to know any specifics here.
Time and time again we've had problems with implicit synchronization
resulting in bugs where different drivers play by slightly different
implicit rules. We're convinced the best way to attack this problem
is to move as much of the command and control of synchronization as
possible into a single piece of code (the compositor in our case.) To
facilitate this we're going to be mandating this explicit approach in
the K release of Android.
> However, every driver which
> needs to participate in the synchronization mechanism will need
> to have its interface with userspace modified to allow the sync
> objects to be passed to the drivers. This seemed like a lot of
> work to me, which is why I prefer the implicit approach. However
> I don't actually know what work is needed and think it should be
> explored. I.e. How much work is it to add explicit sync object
> support to the DRM & v4l2 interfaces?
>
> E.g. I believe DRM/GEM's job dispatch API is "in-order"
> in which case it might be easy to just add "wait for this fence"
> and "signal this fence" ioctls. Seems like vmwgfx already has
> something similar to this already? Could this work over having
> to specify a list of sync objects to wait on and another list
> of sync objects to signal for every operation (exec buf/page
> flip)? What about for v4l2?
If I understand you right a job submission with explicit sync would
become 3 submission:
1) submit wait for pre-req fence job
2) submit render job
3) submit signal ready fence job
Does DRM provide a way to ensure these 3 jobs are submitted
atomically? I also expect GPU vendor would like to get clever about
GPU to GPU fence dependancies. That could probably be handled
entirely in the userspace GL driver.
> I guess my other thought is that implicit vs explicit is not
> mutually exclusive, though I'd guess there'd be interesting
> deadlocks to have to debug if both were in use _at the same
> time_. :-)
I think this is an approach worth investigating. I'd like a way to
either opt out of implicit sync or have a way to check if a dma-buf
has an attached fence and detach it. Actually, that could work really
well. Consider:
* Each dma_buf has a single fence "slot"
* on submission
* the driver will extract the fence from the dma_buf and queue a wait on it.
* the driver will replace that fence with it's own complettion
fence before the job submission ioctl returns.
* dma_buf will have two userspace ioctls:
* DETACH: will return the fence as an FD to userspace and clear the
fence slot in the dma_buf
* ATTACH: takes a fence FD from userspace and attaches it to the
dma_buf fence slot. Returns an error if the fence slot is non-empty.
In the android case, we can do a detach after every submission and an
attach right before.
-Erik
Hey Erik,
Op 07-06-12 19:35, Erik Gilling schreef:
> On Thu, Jun 7, 2012 at 1:55 AM, Maarten Lankhorst
> <m.b.lankhorst(a)gmail.com> wrote:
>> I haven't looked at intel and amd, but from a quick glance
>> it seems like they already implement fencing too, so just
>> some way to synch up the fences on shared buffers seems
>> like it could benefit all graphics drivers and the whole
>> userspace synching could be done away with entirely.
> It's important to have some level of userspace API so that GPU
> generated graphics can participate in the graphics pipeline. Think of
> the case where you have a software video codec streaming textures into
> the GPU. It needs to know when the GPU is done with those textures so
> it can reuse the buffer.
>
In the graphics case this problem already has to be handled without
dma-buf, so adding any extra synchronization api for userspace
that is only used when the bo is shared is a waste.
I do agree you need some way to synch userspace though, but I
think adding a new api for userspace is not the way to go.
Cheers,
Maarten
PS: re-added cc's that seem to have fallen off from your mail.
Tom,
Is there more planned for KDS? It seems to be lacking some
important features to be useful across many SoCs and graphics cards
and features needed by Android. Here's some general feedback on those
gaps.
There is no way to share information between a buffer provider and a
buffer consumer. This is important for architectures such as Tegra
which have several hardware blocks that share common hardware
synchronization.
There's no userspace API. There are several reasons this is
necessary. First, some userspace code (such as GL libs) might need to
get at the private data of the sync primitive in order to generate
command lists for a piece of hardware. Second is does not let
userspace have control or even visibility into the graphics pipeline.
The direction we are moving in Android is to put more control over
synchronization into the compositor and move it out of being
implemented "behind the scenes" by every vendor. Third, there's no way
for a userspace process to wait on a sync primitive.
There's no debugging or timing information tracked with the sync
primitives. During development on new platforms and new OS versions
we often have cases where the graphics pipeline stops making forward
progress because one of the pieces (GPU, display, camera, dsp,
userspace) has, itself, stopped making forward progress. Finding the
root cause of the often hard to reproduce cases is difficult when you
have to instrument every single driver.
It's unclear how you would attach a dependency on a EGL fence to a
dma_buf. Maybe this would be an EGL extension where you pass in the
fence and the dma_buf.
At Android we've been working on our own approach to this problem.
I'll post those patches for discussion.
Cheers,
Erik
Hello everyone,
This patchset adds support for DMABUF [2] importing to V4L2 stack.
The support for DMABUF exporting was moved to separate patchset
due to dependency on patches for DMA mapping redesign by
Marek Szyprowski [4].
v6:
- fixed missing entry in v4l2_memory_names
- fixed a bug occuring after get_user_pages failure
- fixed a bug caused by using invalid vma for get_user_pages
- prepare/finish no longer call dma_sync for dmabuf buffers
v5:
- removed change of importer/exporter behaviour
- fixes vb2_dc_pages_to_sgt basing on Laurent's hints
- changed pin/unpin words to lock/unlock in Doc
v4:
- rebased on mainline 3.4-rc2
- included missing importing support for s5p-fimc and s5p-tv
- added patch for changing map/unmap for importers
- fixes to Documentation part
- coding style fixes
- pairing {map/unmap}_dmabuf in vb2-core
- fixing variable types and semantic of arguments in videobufb2-dma-contig.c
v3:
- rebased on mainline 3.4-rc1
- split 'code refactor' patch to multiple smaller patches
- squashed fixes to Sumit's patches
- patchset is no longer dependant on 'DMA mapping redesign'
- separated path for handling IO and non-IO mappings
- add documentation for DMABUF importing to V4L
- removed all DMABUF exporter related code
- removed usage of dma_get_pages extension
v2:
- extended VIDIOC_EXPBUF argument from integer memoffset to struct
v4l2_exportbuffer
- added patch that breaks DMABUF spec on (un)map_atachment callcacks but allows
to work with existing implementation of DMABUF prime in DRM
- all dma-contig code refactoring patches were squashed
- bugfixes
v1: List of changes since [1].
- support for DMA api extension dma_get_pages, the function is used to retrieve
pages used to create DMA mapping.
- small fixes/code cleanup to videobuf2
- added prepare and finish callbacks to vb2 allocators, it is used keep
consistency between dma-cpu acess to the memory (by Marek Szyprowski)
- support for exporting of DMABUF buffer in V4L2 and Videobuf2, originated from
[3].
- support for dma-buf exporting in vb2-dma-contig allocator
- support for DMABUF for s5p-tv and s5p-fimc (capture interface) drivers,
originated from [3]
- changed handling for userptr buffers (by Marek Szyprowski, Andrzej
Pietrasiewicz)
- let mmap method to use dma_mmap_writecombine call (by Marek Szyprowski)
[1] http://thread.gmane.org/gmane.linux.drivers.video-input-infrastructure/4296…
[2] https://lkml.org/lkml/2011/12/26/29
[3] http://thread.gmane.org/gmane.linux.drivers.video-input-infrastructure/3635…
[4] http://thread.gmane.org/gmane.linux.kernel.cross-arch/12819
Laurent Pinchart (2):
v4l: vb2-dma-contig: Shorten vb2_dma_contig prefix to vb2_dc
v4l: vb2-dma-contig: Reorder functions
Marek Szyprowski (2):
v4l: vb2: add prepare/finish callbacks to allocators
v4l: vb2-dma-contig: add prepare/finish to dma-contig allocator
Sumit Semwal (4):
v4l: Add DMABUF as a memory type
v4l: vb2: add support for shared buffer (dma_buf)
v4l: vb: remove warnings about MEMORY_DMABUF
v4l: vb2-dma-contig: add support for dma_buf importing
Tomasz Stanislawski (5):
Documentation: media: description of DMABUF importing in V4L2
v4l: vb2-dma-contig: Remove unneeded allocation context structure
v4l: vb2-dma-contig: add support for scatterlist in userptr mode
v4l: s5p-tv: mixer: support for dmabuf importing
v4l: s5p-fimc: support for dmabuf importing
Documentation/DocBook/media/v4l/compat.xml | 4 +
Documentation/DocBook/media/v4l/io.xml | 179 +++++++
.../DocBook/media/v4l/vidioc-create-bufs.xml | 1 +
Documentation/DocBook/media/v4l/vidioc-qbuf.xml | 15 +
Documentation/DocBook/media/v4l/vidioc-reqbufs.xml | 45 +-
drivers/media/video/s5p-fimc/Kconfig | 1 +
drivers/media/video/s5p-fimc/fimc-capture.c | 2 +-
drivers/media/video/s5p-tv/Kconfig | 1 +
drivers/media/video/s5p-tv/mixer_video.c | 2 +-
drivers/media/video/v4l2-ioctl.c | 1 +
drivers/media/video/videobuf-core.c | 4 +
drivers/media/video/videobuf2-core.c | 207 +++++++-
drivers/media/video/videobuf2-dma-contig.c | 520 +++++++++++++++++---
include/linux/videodev2.h | 7 +
include/media/videobuf2-core.h | 34 ++
15 files changed, 924 insertions(+), 99 deletions(-)
--
1.7.9.5
Hey,
For intel/nouveau hybrid graphics I'm interested in this since it
would allow me to synchronize between intel and nvidia cards
without waiting for rendering to complete.
I'm worried about the api though, nouveau and intel already
have existing infrastructure to deal with fencing so exposing
additional ioctl's will complicate the implementation. Would
it be possible to never expose this interface to userspace
but keep it inside the kernel only?
nouveau_gem_ioctl_pushbuf is what's used for nouveau.
If any dmabuf synch framework could hook into that then
userspace would never have to act differently on shared bo's.
I haven't looked at intel and amd, but from a quick glance
it seems like they already implement fencing too, so just
some way to synch up the fences on shared buffers seems
like it could benefit all graphics drivers and the whole
userspace synching could be done away with entirely.
Cheers,
Maarten
On Wed, Jun 6, 2012 at 6:33 AM, John Reitan <john.reitan(a)arm.com> wrote:
>> But maybe instead of inventing something new, we can just use 'struct
>> kthread_work' instead of 'struct kds_callback' plus the two 'void *'s?
>> If the user needs some extra args they can embed 'struct
>> kthread_work' in their own struct and use container_of() magic in the
>> cb.
>>
>> Plus this is a natural fit if you want to dispatch callbacks instead
>> on a kthread_worker, which seems like it would simplify a few things
>> when it comes to deadlock avoidance.. ie., not resource deadlock
>> avoidance, but dispatching callbacks when some lock is held.
>
> That sounds like a better approach.
> Will make a cleaner API, will look into it.
When Tom visited us for android graphics camp in the fall he argued
that there were cases where we would want to avoid an extra schedule.
Consider the case where the GPU is waiting for a render buffer that
the display controller is using. If that render can be kicked off w/o
acquiring locks, the display's vsync IRQ handler can call release,
which in turn calls the GPU callback, which in turn kicks off the
render very quickly w/o having to leave IRQ context.
One way around the locking issue with callbacks/async wait is to have
async wait return a value to indicate that the resource has been
acquired instead of calling the callback. This is the approach I
chose in our sync framework.
-Erik