op 17-02-14 19:41, Christian König schreef:
> Am 17.02.2014 19:24, schrieb Rob Clark:
>> On Mon, Feb 17, 2014 at 12:36 PM, Christian König
>> <deathsimple(a)vodafone.de> wrote:
>>> Am 17.02.2014 18:27, schrieb Rob Clark:
>>>
>>>> On Mon, Feb 17, 2014 at 11:56 AM, Christian König
>>>> <deathsimple(a)vodafone.de> wrote:
>>>>> Am 17.02.2014 16:56, schrieb Maarten Lankhorst:
>>>>>
>>>>>> This type of fence can be used with hardware synchronization for simple
>>>>>> hardware that can block execution until the condition
>>>>>> (dma_buf[offset] - value) >= 0 has been met.
>>>>>
>>>>> Can't we make that just "dma_buf[offset] != 0" instead? As far as I know
>>>>> this way it would match the definition M$ uses in their WDDM
>>>>> specification
>>>>> and so make it much more likely that hardware supports it.
>>>> well 'buf[offset] >= value' at least means the same slot can be used
>>>> for multiple operations (with increasing values of 'value').. not sure
>>>> if that is something people care about.
>>>>
>>>>> =value seems to be possible with adreno and radeon. I'm not really sure
>>>>> about others (although I presume it as least supported for nv desktop
>>>>> stuff). For hw that cannot do >=value, we can either have a different fence
>>>>> implementation which uses the !=0 approach. Or change seqno-fence
>>>>> implementation later if needed. But if someone has hw that can do !=0 but
>>>>> not >=value, speak up now ;-)
>>>
>>> Here! Radeon can only do >=value on the DMA and 3D engine, but not with UVD
>>> or VCE. And for the 3D engine it means draining the pipe, which isn't really
>>> a good idea.
>> hmm, ok.. forgot you have a few extra rings compared to me. Is UVD
>> re-ordering from decode-order to display-order for you in hw? If not,
>> I guess you need sw intervention anyways when a frame is done for
>> frame re-ordering, so maybe hw->hw sync doesn't really matter as much
>> as compared to gpu/3d->display. For dma<->3d interactions, seems like
>> you would care more about hw<->hw sync, but I guess you aren't likely
>> to use GPU A to do a resolve blit for GPU B..
>
> No UVD isn't reordering, but since frame reordering is predictable you usually end up with pipelining everything to the hardware. E.g. you send the decode commands in decode order to the UVD block and if you have overlay active one of the frames are going to be the first to display and then you want to wait for it on the display side.
>
>> For 3D ring, I assume you probably want a CP_WAIT_FOR_IDLE before a
>> CP_MEM_WRITE to update fence value in memory (for the one signalling
>> the fence). But why would you need that before a CP_WAIT_REG_MEM (for
>> the one waiting for the fence)? I don't exactly have documentation
>> for adreno version of CP_WAIT_REG_{MEM,EQ,GTE}.. but PFP and ME
>> appear to be same instruction set as r600, so I'm pretty sure they
>> should have similar capabilities.. CP_WAIT_REG_MEM appears to be same
>> but with 32bit gpu addresses vs 64b.
>
> You shouldn't use any of the CP commands for engine synchronization (neither for wait nor for signal). The PFP and ME are just the top of a quite deep pipeline and when you use any of the CP_WAIT functions you block them for something and that's draining the pipeline.
>
> With the semaphore and fence commands the values are just attached as prerequisite to the draw command, e.g. the CP setups the draw environment and issues the command, but the actual execution of it is delayed until the "!= 0" condition hits. And in the meantime the CP already prepares the next draw operation.
>
> But at least for compute queues wait semaphore aren't the perfect solution either. What you need then is a GPU scheduler that uses a kernel task for setting up the command submission for you when all prerequisites are meet.
nouveau has sort of a scheduler in hardware. It can yield when waiting on a semaphore. And each process gets their own context and the timeslices can be adjusted. ;-) But I don't mind changing this patch when an actual user pops up. Nouveau can do a wait for (*sema & mask) != 0 only on nvc0 and newer, where mask can be chosen. But it can do == somevalue and >= somevalue on older relevant optimus hardware, so if we know that it was zero before and we know the sign of the new value that could work too.
Adding ops and a separate mask later on when users pop up is fine with me, the original design here was chosen so I could map the intel status page read-only into the process specific nvidia vm.
~Maarten
Hi,
We have a problem about how to manage cached dmabuf importer private
data, where to keep, how to reuse and how to clean up.
We want to keep some data in dmabuf importer side until an buffer is
free'ed actually since a buffer can be reused again later in that
importer subsystem so that that cache data doesn't have to be
regenerated. This can be considered as some kind of caching this data.
The scenario is:
(1) Exporter passes a dmabuf to Importer.
(2) Importer attaches a dev to a dmabuf.
(3) Importer generates some data for a buffer for its own use.
(4) Importer finishes its use of a buffer.
(5) Importer detaches a dev from a dmabuf.
(6) Again, Exporter passes a dmabuf fd to the same Importer.
(7) Again, Importer attaches a dev to a dmabuf.
(8) Importer wants to use the previously cached data from (2) without regenerating.
(9) Again, Importer detaches a dev from a dmabuf.
(10) Exporter free's a buffer along with a cached data from (2)/(8).
At first I considered to use attachmenet private data, but apparently
a life time of the attachment isn't equal to one of a buffer. A buffer
lives longer than an attachment. Also Neither private data from dmabuf
nor from attachment are for /Importer/. They are for Exporter's use
from the comment in the header file.
/**
* struct dma_buf - shared buffer object
....
* @priv: exporter specific private data for this buffer object.
*/
/**
* struct dma_buf_attachment - holds device-buffer attachment data
...
* @priv: exporter specific attachment data.
...
*/
This leads to the following 2 questions:
One question is how to clean up the cached data at (10) since there's
no way for Importer to trigger clean up at that time. I am
considering to embed an /notifier/ in dmabuf when it's called at
dmabuf release. Importer could register any callback in that
notifier. At least this requires a dmabuf to have an notifier to be
called at release. Does this sound acceptable? Or can we do the same
outside of dmabuf framework? If there's more appropriate way, please
let me know since I'm not so familier with drm side yet.
Another question is where to keep that cached data. Usually that data
is only valid within Impoter subsystem. So Imoporter could keep the
list of that data in it as a global list along with a dmabuf
pointer. When a dmabuf is imported, Importer can look up a global list
if it's already cached. This list needs to be kept till a buffer is
free'ed.
Those can be implemented in the dmabuf exporter backend but we want to
allow multiple allocators/exporters to do the same, and I want to
avoid having something related to importer in exporter side.
Any comment would be really appreciated.
Hi all!
Ok, I hope that this is the last update of the patches which add basic
support for dynamic allocation of memory reserved regions defined in
device tree.
This time I've mainly sliced the main patch into several smaller pieces
to make the changes easier to understand and fixes some minor coding
style issues.
The initial code for this feature were posted here [1], merged as commit
9d8eab7af79cb4ce2de5de39f82c455b1f796963 ("drivers: of: add
initialization code for dma reserved memory") and later reverted by
commit 1931ee143b0ab72924944bc06e363d837ba05063. For more information,
see [2]. Finally a new bindings has been proposed [3] and Josh
Cartwright a few days ago prepared some code which implements those
bindings [4]. This finally pushed me again to find some time to finish
this task and review the code. Josh agreed to give me the ownership of
this series to continue preparing them for mainline inclusion.
For more information please refer to the changlelog and links below.
[1]: http://lkml.kernel.org/g/1377527959-5080-1-git-send-email-m.szyprowski@sams…
[2]: http://lkml.kernel.org/g/1381476448-14548-1-git-send-email-m.szyprowski@sam…
[3]: http://lkml.kernel.org/g/20131030134702.19B57C402A0@trevor.secretlab.ca
[4]: http://thread.gmane.org/gmane.linux.documentation/19579
Changelog:
v5:
- sliced main patch into several smaller patches on Grant's request
- fixed coding style issues pointed by Grant
- use node->phandle value directly instead of parsing properties manually
v4: https://lkml.org/lkml/2014/2/20/150
- dynamic allocations are processed after all static reservations has been
done
- moved code for handling static reservations to drivers/of/fdt.c
- removed node matching by string comparison, now phandle values are used
directly
- moved code for DMA and CMA handling directly to
drivers/base/dma-{coherent,contiguous}.c
- added checks for proper #size-cells, #address-cells, ranges properties
in /reserved-memory node
- even more code cleanup
- added init code for ARM64 and PowerPC
v3: http://article.gmane.org/gmane.linux.documentation/20169/
- refactored memory reservation code, created common code to parse reg, size,
align, alloc-ranges properties
- added support for multiple tuples in 'reg' property
- memory is reserved regardless of presence of the driver for its compatible
- prepared arch specific hooks for memory reservation (defaults use memblock
calls)
- removed node matching by string during device initialization
- CMA init code: added checks for required region alignment
- more code cleanup here and there
v2: http://thread.gmane.org/gmane.linux.documentation/19870/
- removed copying of the node name
- split shared-dma-pool handling into separate files (one for CMA and one
for dma_declare_coherent based implementations) for making the code easier
to understand
- added support for AMBA devices, changed prototypes to use struct decice
instead of struct platform_device
- renamed some functions to better match other names used in drivers/of/
- restructured the rest of the code a bit for better readability
- added 'reusable' property to exmaple linux,cma node in documentation
- exclusive dma (dma_coherent) is used for only handling 'shared-dma-pool'
regions without 'reusable' property and CMA is used only for handling
'shared-dma-pool' regions with 'reusable' property.
v1: http://thread.gmane.org/gmane.linux.documentation/19579
- initial version prepared by Josh Cartwright
Summary:
Grant Likely (1):
of: document bindings for reserved-memory nodes
Marek Szyprowski (10):
drivers: of: add initialization code for static reserved memory
drivers: of: add initialization code for dynamic reserved memory
drivers: of: add support for custom reserved memory drivers
drivers: of: add automated assignment of reserved regions to client
devices
drivers: of: initialize and assign reserved memory to newly created
devices
drivers: dma-coherent: add initialization from device tree
drivers: dma-contiguous: add initialization from device tree
arm: add support for reserved memory defined by device tree
arm64: add support for reserved memory defined by device tree
powerpc: add support for reserved memory defined by device tree
.../bindings/reserved-memory/reserved-memory.txt | 138 ++++++++++
arch/arm/Kconfig | 1 +
arch/arm/mm/init.c | 2 +
arch/arm64/Kconfig | 1 +
arch/arm64/mm/init.c | 1 +
arch/powerpc/Kconfig | 1 +
arch/powerpc/kernel/prom.c | 3 +
drivers/base/dma-coherent.c | 41 +++
drivers/base/dma-contiguous.c | 130 +++++++--
drivers/of/Kconfig | 6 +
drivers/of/Makefile | 1 +
drivers/of/fdt.c | 134 +++++++++
drivers/of/of_reserved_mem.c | 291 ++++++++++++++++++++
drivers/of/platform.c | 7 +
include/asm-generic/vmlinux.lds.h | 11 +
include/linux/of_fdt.h | 3 +
include/linux/of_reserved_mem.h | 61 ++++
17 files changed, 810 insertions(+), 22 deletions(-)
create mode 100644 Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
create mode 100644 drivers/of/of_reserved_mem.c
create mode 100644 include/linux/of_reserved_mem.h
--
1.7.9.5