[RFC] ARM DMA mapping TODO, v1

List overview All Threads
Download

newer

older

Requirements for Memory Management

Re: [Linaro-mm-sig] [RFC] ARM DMA...

Arnd Bergmann

21 Apr 2011 21 Apr '11

7:29 p.m.

I think the recent discussions on linaro-mm-sig and the BoF last week at ELC have been quite productive, and at least my understanding of the missing pieces has improved quite a bit. This is a list of things that I think need to be done in the kernel. Please complain if any of these still seem controversial:

1. Fix the arm version of dma_alloc_coherent. It's in use today and is broken on modern CPUs because it results in both cached and uncached mappings. Rebecca suggested different approaches how to get there.

2. Implement dma_alloc_noncoherent on ARM. Marek pointed out that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

3. Convert ARM to use asm-generic/dma-mapping-common.h. We need both IOMMU and direct mapped DMA on some machines.

4. Implement an architecture independent version of dma_map_ops based on the iommu.h API. As Joerg mentioned, this has been missing for some time, and it would be better to do it once than for each IOMMU separately. This is probably a lot of work.

5. Find a way to define per-device IOMMUs, if that is not actually possible already. We had conflicting statements for this.

6. Implement iommu_ops for each of the ARM platforms that has an IOMMU. Needs some modifications for MSM and a rewrite for OMAP. Implementation for Samsung is under work.

7. Extend the dma_map_ops to have a way for mapping a buffer from dma_alloc_{non,}coherent into user space. We have not discussed that yet, but after thinking this for some time, I believe this would be the right approach to map buffers into user space from code that doesn't care about the underlying hardware.

After all these are in place, building anything on top of dma_alloc_{non,}coherent should be much easier. The question of passing buffers between V4L and DRM is still completely unsolved as far as I can tell, but that discussion might become more focused if we can agree on the above points and assume that it will be done.

I expect that I will have to update the list above as people point out mistakes in my assumptions.

Arnd

Show replies by date

Jesse Barnes

21 Apr 21 Apr

8:09 p.m.

On Thu, 21 Apr 2011 21:29:16 +0200 Arnd Bergmann arnd@arndb.de wrote:

...

I think the recent discussions on linaro-mm-sig and the BoF last week at ELC have been quite productive, and at least my understanding of the missing pieces has improved quite a bit. This is a list of things that I think need to be done in the kernel. Please complain if any of these still seem controversial:

Fix the arm version of dma_alloc_coherent. It's in use today and is broken on modern CPUs because it results in both cached and uncached mappings. Rebecca suggested different approaches how to get there.

Implement dma_alloc_noncoherent on ARM. Marek pointed out that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

Convert ARM to use asm-generic/dma-mapping-common.h. We need both IOMMU and direct mapped DMA on some machines.

I don't think the DMA mapping and allocation APIs are sufficient for high performance graphics at least. It's fairly common to allocate a bunch of buffers necessary to render a scene, build up a command buffer that references them, then hand the whole thing off to the kernel to execute at once on the GPU. That allows for a lot of extra efficiency, since it allows you to batch the MMU binding until execution occurs (or even put it off entirely until the page is referenced by the GPU in the case of faulting support). It's also necessary to avoid livelocks between two clients trying to render; if mapping is incremental on both sides, it's possible that neither will be able to make forward progress due to IOMMU space exhaustion.

So that argues for separating allocation from mapping both on the user side (which I think everyone agrees on) as well as on the kernel side, both for CPU access (which some drivers won't need) and for GPU access.

-- Jesse Barnes, Intel Open Source Technology Center

Zach Pfeffer

9:52 p.m.

...

Arnd Bergmann arnd@arndb.de wrote:

...
I think the recent discussions on linaro-mm-sig and the BoF last week at ELC have been quite productive, and at least my understanding of the missing pieces has improved quite a bit. This is a list of things that I think need to be done in the kernel. Please complain if any of these still seem controversial:

Fix the arm version of dma_alloc_coherent. It's in use today and

is broken on modern CPUs because it results in both cached and uncached mappings. Rebecca suggested different approaches how to get there.

Implement dma_alloc_noncoherent on ARM. Marek pointed out

that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

Convert ARM to use asm-generic/dma-mapping-common.h. We need

both IOMMU and direct mapped DMA on some machines.

I don't think the DMA mapping and allocation APIs are sufficient for high performance graphics at least. It's fairly common to allocate a bunch of buffers necessary to render a scene, build up a command buffer that references them, then hand the whole thing off to the kernel to execute at once on the GPU. That allows for a lot of extra efficiency, since it allows you to batch the MMU binding until execution occurs (or even put it off entirely until the page is referenced by the GPU in the case of faulting support). It's also necessary to avoid livelocks between two clients trying to render; if mapping is incremental on both sides, it's possible that neither will be able to make forward progress due to IOMMU space exhaustion.

So that argues for separating allocation from mapping both on the user side (which I think everyone agrees on) as well as on the kernel side, both for CPU access (which some drivers won't need) and for GPU access.

I agree with Jesse that the separation of mapping from allocation is central to the current usage models. I realize most people didn't like VCMM, but it provided an abstraction for this - if software can handle the multiple mapper approach in a rational way across ARM than we can solve a lot of problems with all the map and unmap current solutions and we don't have to hack in coherency.

...

-- Jesse Barnes, Intel Open Source Technology Center

Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig

KyongHo Cho

22 Apr 22 Apr

12:34 a.m.

On Fri, Apr 22, 2011 at 6:52 AM, Zach Pfeffer zach.pfeffer@linaro.org wrote:

...

I agree with Jesse that the separation of mapping from allocation is central to the current usage models. I realize most people didn't like VCMM, but it provided an abstraction for this - if software can handle the multiple mapper approach in a rational way across ARM than we can solve a lot of problems with all the map and unmap current solutions and we don't have to hack in coherency.

Hi. I've also noticed that VCMM is the reasonable idea for IOMMU mappings. We often deal with physical memory blocks to map multiple way. Allocation of physical memory itself is also important for some peripheral devices because it is beneficial to get larger page frame for their performance.

IOMMU api does not provide virtual memory management. DMA api is not flexible for all our use-cases.

Regards, KyongHo.

Arnd Bergmann

26 Apr 26 Apr

2:29 p.m.

On Friday 22 April 2011, KyongHo Cho wrote:

...

IOMMU api does not provide virtual memory management. DMA api is not flexible for all our use-cases.

We can fix either problem by changing the existing interfaces, which is much more maintainable in the long run than adding a third one.

Arnd

Arnd Bergmann

2:28 p.m.

On Thursday 21 April 2011, Zach Pfeffer wrote:

...

I agree with Jesse that the separation of mapping from allocation is central to the current usage models. I realize most people didn't like VCMM, but it provided an abstraction for this - if software can handle the multiple mapper approach in a rational way across ARM than we can solve a lot of problems with all the map and unmap current solutions and we don't have to hack in coherency.

Any solution we come up with needs to work on only across ARM, but also across other architectures. Fortunately, most have less weird constraints, for instance some architecture have no concept of uncached mappings (and don't need them), or there might be DMA ordering settings that we have not yet seen on ARM.

Arnd

Arnd Bergmann

2:26 p.m.

On Thursday 21 April 2011, Jesse Barnes wrote:

...

On Thu, 21 Apr 2011 21:29:16 +0200 Arnd Bergmann arnd@arndb.de wrote:

...
I think the recent discussions on linaro-mm-sig and the BoF last week at ELC have been quite productive, and at least my understanding of the missing pieces has improved quite a bit. This is a list of things that I think need to be done in the kernel. Please complain if any of these still seem controversial:

Fix the arm version of dma_alloc_coherent. It's in use today and is broken on modern CPUs because it results in both cached and uncached mappings. Rebecca suggested different approaches how to get there.

Implement dma_alloc_noncoherent on ARM. Marek pointed out that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

Convert ARM to use asm-generic/dma-mapping-common.h. We need both IOMMU and direct mapped DMA on some machines.

I don't think the DMA mapping and allocation APIs are sufficient for high performance graphics at least. It's fairly common to allocate a bunch of buffers necessary to render a scene, build up a command buffer that references them, then hand the whole thing off to the kernel to execute at once on the GPU. That allows for a lot of extra efficiency, since it allows you to batch the MMU binding until execution occurs (or even put it off entirely until the page is referenced by the GPU in the case of faulting support). It's also necessary to avoid livelocks between two clients trying to render; if mapping is incremental on both sides, it's possible that neither will be able to make forward progress due to IOMMU space exhaustion.

So that argues for separating allocation from mapping both on the user side (which I think everyone agrees on) as well as on the kernel side, both for CPU access (which some drivers won't need) and for GPU access.

I don't thing that this argument has anything to do with what the underlying API should be, right? I can see this built on top of either the dma-mapping headers with extensions to map potentially uncached pages, and with the iommu API. Neither way would however save us from implementing the three items listed above.

It's certainly a good point to note that we should have a way to allocate pages for a device without mapping them into any address space right away. My feeling is still that the dma mapping API is the right place for this, because it is the only part of the kernel that has knowledge about whether a device needs uncached memory for coherent access, under what constraints it can map noncontiguous memory into its own address space, and what its addressing capabilities are (dma mask).

Arnd

Jesse Barnes

3:39 p.m.

On Tue, 26 Apr 2011 16:26:19 +0200 Arnd Bergmann arnd@arndb.de wrote:

...

I don't thing that this argument has anything to do with what the underlying API should be, right? I can see this built on top of either the dma-mapping headers with extensions to map potentially uncached pages, and with the iommu API. Neither way would however save us from implementing the three items listed above.

Or simply extending the DMA mapping API to allow for allocations without mapping. I was just worried you had a more traditional driver model in mind (e.g. coherent alloc on the ring buffer, single mappings for data buffers, all mapped in the kernel driver at allocation time).

The DMA API does have some advantages, in that arches already support it, there's some infrastructure for handling per-bus mapping, etc., so building on top of it is probably a good idea.

...

It's certainly a good point to note that we should have a way to allocate pages for a device without mapping them into any address space right away. My feeling is still that the dma mapping API is the right place for this, because it is the only part of the kernel that has knowledge about whether a device needs uncached memory for coherent access, under what constraints it can map noncontiguous memory into its own address space, and what its addressing capabilities are (dma mask).

Right. Sometimes a device or platform can handle either cached or uncached though, and we need userspace to decide on the best type for performance reasons.

-- Jesse Barnes, Intel Open Source Technology Center

Russell King - ARM Linux

27 Apr 27 Apr

7:35 a.m.

On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:

...

Fix the arm version of dma_alloc_coherent. It's in use today and is broken on modern CPUs because it results in both cached and uncached mappings. Rebecca suggested different approaches how to get there.

I also suggested various approaches and produced patches, which I'm slowly feeding in. However, I think whatever we do, we'll end up breaking something along the line - especially as various places assume that dma_alloc_coherent() is ultimately backed by memory with a struct page.

...

Implement dma_alloc_noncoherent on ARM. Marek pointed out that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

dma_alloc_noncoherent is an entirely pointless API afaics.

...

Convert ARM to use asm-generic/dma-mapping-common.h. We need both IOMMU and direct mapped DMA on some machines.

Implement an architecture independent version of dma_map_ops based on the iommu.h API. As Joerg mentioned, this has been missing for some time, and it would be better to do it once than for each IOMMU separately. This is probably a lot of work.

dma_map_ops design is broken - we can't have the entire DMA API indirected through that structure. Whether you have an IOMMU or not is completely independent of whether you have to do DMA cache handling. Moreover, with dmabounce, having the DMA cache handling in place doesn't make sense.

So you can't have a dma_map_ops for the cache handling bits, a dma_map_ops for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't work like that.

I believe the dma_map_ops stuff in asm-generic to be entirely unsuitable for ARM.

Arnd Bergmann

8:56 a.m.

On Wednesday 27 April 2011, Russell King - ARM Linux wrote:

...

...

Implement dma_alloc_noncoherent on ARM. Marek pointed out that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

dma_alloc_noncoherent is an entirely pointless API afaics.

The main use case that I can see for dma_alloc_noncoherent is being able to allocate a large cacheable memory chunk that is mapped contiguous into both kernel virtual and bus virtual space, but not necessarily in contiguous in physical memory.

Without an IOMMU, I agree that it is pointless, because the only sensible imlpementation would be alloc_pages_exact + dma_map_single.

...

...

Convert ARM to use asm-generic/dma-mapping-common.h. We need both IOMMU and direct mapped DMA on some machines.

Implement an architecture independent version of dma_map_ops based on the iommu.h API. As Joerg mentioned, this has been missing for some time, and it would be better to do it once than for each IOMMU separately. This is probably a lot of work.

dma_map_ops design is broken - we can't have the entire DMA API indirected through that structure. Whether you have an IOMMU or not is completely independent of whether you have to do DMA cache handling. Moreover, with dmabounce, having the DMA cache handling in place doesn't make sense.

So you can't have a dma_map_ops for the cache handling bits, a dma_map_ops for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't work like that.

I believe the dma_map_ops stuff in asm-generic to be entirely unsuitable for ARM.

We probably still need to handle both the coherent and noncoherent case in each dma_map_ops implementation, at least for those combinations where they matter (definitely the linear mapping). However, I think that using dma_mapping_common.h would let us use an architecture-independent dma_map_ops for the generic iommu code that Marek wants to introduce now.

I still don't understand how dmabounce works, but if it's similar to swiotlb, we can have at least three different dma_map_ops: linear, dmabounce and iommu.

Without the common iommu abstraction, there would be a bigger incentive to go with dma_map_ops, because then we would need one operations structure per IOMMU implementation, as some other architectures (x86, powerpc, ia64, ...) have. If we only need to distinguish between the common linear mapping code and the common iommu code, then you are right and we are likely better off adding some more conditionals to the existing code to handle the iommu case in addition to the ones we handle today.

Arnd

Russell King - ARM Linux

9:09 a.m.

On Wed, Apr 27, 2011 at 10:56:49AM +0200, Arnd Bergmann wrote:

...

We probably still need to handle both the coherent and noncoherent case in each dma_map_ops implementation, at least for those combinations where they matter (definitely the linear mapping). However, I think that using dma_mapping_common.h would let us use an architecture-independent dma_map_ops for the generic iommu code that Marek wants to introduce now.

The 'do we have an iommu or not' question and the 'do we need to do cache coherency' question are two independent questions which are unrelated to each other. There are four unique but equally valid combinations.

Pushing the cache coherency question down into the iommu stuff will mean that we'll constantly be fighting against the 'but this iommu works on x86' shite that we've fought with over block device crap for years. I have no desire to go there.

What we need is a proper abstraction where the DMA ops can say whether they can avoid DMA cache handling (eg, swiotlb or dmabounce stuff) but default to DMA cache handling being the norm - and the DMA cache handling performed in the level above the DMA ops indirection.

Anything else is asking for an endless stream of shite iommu stuff getting DMA cache handling wrong.

Arnd Bergmann

11:02 a.m.

On Wednesday 27 April 2011, Russell King - ARM Linux wrote:

...

On Wed, Apr 27, 2011 at 10:56:49AM +0200, Arnd Bergmann wrote:

...
We probably still need to handle both the coherent and noncoherent case in each dma_map_ops implementation, at least for those combinations where they matter (definitely the linear mapping). However, I think that using dma_mapping_common.h would let us use an architecture-independent dma_map_ops for the generic iommu code that Marek wants to introduce now.

The 'do we have an iommu or not' question and the 'do we need to do cache coherency' question are two independent questions which are unrelated to each other. There are four unique but equally valid combinations.

Pushing the cache coherency question down into the iommu stuff will mean that we'll constantly be fighting against the 'but this iommu works on x86' shite that we've fought with over block device crap for years. I have no desire to go there.

Ok, I see. I believe we could avoid having to fight with the people that only care about coherent architectures if we just have two separate implementations of dma_map_ops in the iommu code, one for coherent and one for noncoherent DMA. Any architecture that only needs one of them would then only enable the Kconfig options for that implementation and not care about the other one.

...

What we need is a proper abstraction where the DMA ops can say whether they can avoid DMA cache handling (eg, swiotlb or dmabounce stuff) but default to DMA cache handling being the norm - and the DMA cache handling performed in the level above the DMA ops indirection.

Yes, that sounds definitely possible. I guess it could be as simple as having a flag somewhere in struct device if we want to make it architecture independent.

As for making the default being to do cache handling, I'm not completely sure how that would work on architectures where most devices are coherent. If I understood the DRM people correctly, some x86 machine have noncoherent DMA in their GPUs while everything else is coherent.

Maybe we can default to arch_is_coherent() and allow a device to override that when it knows better.

Arnd

Alex Deucher

4:16 p.m.

On Wed, Apr 27, 2011 at 7:02 AM, Arnd Bergmann arnd@arndb.de wrote:

...

On Wednesday 27 April 2011, Russell King - ARM Linux wrote:

...
On Wed, Apr 27, 2011 at 10:56:49AM +0200, Arnd Bergmann wrote:

...
We probably still need to handle both the coherent and noncoherent case in each dma_map_ops implementation, at least for those combinations where they matter (definitely the linear mapping). However, I think that using dma_mapping_common.h would let us use an architecture-independent dma_map_ops for the generic iommu code that Marek wants to introduce now.

The 'do we have an iommu or not' question and the 'do we need to do cache coherency' question are two independent questions which are unrelated to each other. There are four unique but equally valid combinations.

Pushing the cache coherency question down into the iommu stuff will mean that we'll constantly be fighting against the 'but this iommu works on x86' shite that we've fought with over block device crap for years. I have no desire to go there.

Ok, I see. I believe we could avoid having to fight with the people that only care about coherent architectures if we just have two separate implementations of dma_map_ops in the iommu code, one for coherent and one for noncoherent DMA. Any architecture that only needs one of them would then only enable the Kconfig options for that implementation and not care about the other one.

...
What we need is a proper abstraction where the DMA ops can say whether they can avoid DMA cache handling (eg, swiotlb or dmabounce stuff) but default to DMA cache handling being the norm - and the DMA cache handling performed in the level above the DMA ops indirection.

Yes, that sounds definitely possible. I guess it could be as simple as having a flag somewhere in struct device if we want to make it architecture independent.

As for making the default being to do cache handling, I'm not completely sure how that would work on architectures where most devices are coherent. If I understood the DRM people correctly, some x86 machine have noncoherent DMA in their GPUs while everything else is coherent.

On radeon hardware at least the on chip gart mechanism supports both snooped cache coherent pages and uncached, non-snooped pages.

Alex

...

Maybe we can default to arch_is_coherent() and allow a device to override that when it knows better.

Arnd

Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig

Anca Emanuel

5:44 p.m.

@Russell: contact Linaro. They need you. Make sure their effort is the right thing to do.

Russell King - ARM Linux

8:27 p.m.

On Wed, Apr 27, 2011 at 08:44:38PM +0300, Anca Emanuel wrote:

...

@Russell: contact Linaro. They need you. Make sure their effort is the right thing to do.

Sorry, don't understand your message. I don't do brief twitter-like impossible to comprehend messages.

Russell King - ARM Linux

8:16 p.m.

On Wed, Apr 27, 2011 at 01:02:43PM +0200, Arnd Bergmann wrote:

...

On Wednesday 27 April 2011, Russell King - ARM Linux wrote:

...
On Wed, Apr 27, 2011 at 10:56:49AM +0200, Arnd Bergmann wrote:

...
We probably still need to handle both the coherent and noncoherent case in each dma_map_ops implementation, at least for those combinations where they matter (definitely the linear mapping). However, I think that using dma_mapping_common.h would let us use an architecture-independent dma_map_ops for the generic iommu code that Marek wants to introduce now.

The 'do we have an iommu or not' question and the 'do we need to do cache coherency' question are two independent questions which are unrelated to each other. There are four unique but equally valid combinations.

Pushing the cache coherency question down into the iommu stuff will mean that we'll constantly be fighting against the 'but this iommu works on x86' shite that we've fought with over block device crap for years. I have no desire to go there.

Ok, I see. I believe we could avoid having to fight with the people that only care about coherent architectures if we just have two separate implementations of dma_map_ops in the iommu code, one for coherent and one for noncoherent DMA. Any architecture that only needs one of them would then only enable the Kconfig options for that implementation and not care about the other one.

But then we have to invent yet another whole new API to deal with the cache coherency issues - which makes for more documentation, and eventually more abuse because it won't quite do what architectures want it to do, etc.

...

Yes, that sounds definitely possible. I guess it could be as simple as having a flag somewhere in struct device if we want to make it architecture independent.

I was referring to a flag in the dma_ops to say whether the DMA ops implementation requires DMA cache coherency. In the case of swiotlb, performing full DMA cache coherency is a pure waste of CPU cycles - and probably makes DMA much more expensive than merely switching back to using PIO.

I'm really not interested in producing "generic" interfaces which end up throwing the baby out with the bath water when we already have a better implementation in place - even if the hardware sucks. That's not forward progress as far as I'm concerned.

...

As for making the default being to do cache handling, I'm not completely sure how that would work on architectures where most devices are coherent. If I understood the DRM people correctly, some x86 machine have noncoherent DMA in their GPUs while everything else is coherent.

Well, it sounds like struct device needs a flag to indicate whether it is coherent or not - but exactly how this gets set seems to be architecture dependent. I don't see bus or driver code being able to make the necessary decisions - eg, tulip driver on x86 would be coherent, but tulip driver on ARM would be non-coherent.

Nevertheless, doing it on a per-device basis is definitely the right answer.

Arnd Bergmann

8:21 p.m.

On Wednesday 27 April 2011 22:16:05 Russell King - ARM Linux wrote:

...

...
As for making the default being to do cache handling, I'm not completely sure how that would work on architectures where most devices are coherent. If I understood the DRM people correctly, some x86 machine have noncoherent DMA in their GPUs while everything else is coherent.

Well, it sounds like struct device needs a flag to indicate whether it is coherent or not - but exactly how this gets set seems to be architecture dependent. I don't see bus or driver code being able to make the necessary decisions - eg, tulip driver on x86 would be coherent, but tulip driver on ARM would be non-coherent.

Nevertheless, doing it on a per-device basis is definitely the right answer.

The flag would not get set by the driver that uses the device but the driver that found it, e.g. the PCI bus or the platform code, which should know about these things and also install the appropriate iommu or mapping operations.

Arnd

Russell King - ARM Linux

8:26 p.m.

On Wed, Apr 27, 2011 at 10:21:48PM +0200, Arnd Bergmann wrote:

...

On Wednesday 27 April 2011 22:16:05 Russell King - ARM Linux wrote:

...
...
As for making the default being to do cache handling, I'm not completely sure how that would work on architectures where most devices are coherent. If I understood the DRM people correctly, some x86 machine have noncoherent DMA in their GPUs while everything else is coherent.

Well, it sounds like struct device needs a flag to indicate whether it is coherent or not - but exactly how this gets set seems to be architecture dependent. I don't see bus or driver code being able to make the necessary decisions - eg, tulip driver on x86 would be coherent, but tulip driver on ARM would be non-coherent.

Nevertheless, doing it on a per-device basis is definitely the right answer.

The flag would not get set by the driver that uses the device but the driver that found it, e.g. the PCI bus or the platform code, which should know about these things and also install the appropriate iommu or mapping operations.

As I said above, I don't think bus code can do it. Take my example above of a tulip pci device on x86 and a tulip pci device on ARM. Both use the same PCI code.

Maybe something in asm/pci.h - but that invites having lots of bus specific header files in asm/.

A better solution imho would be to have an architecture callback for struct device which gets registered, which can inspect the type of the device, and set the flag depending on where it appears in the tree.

Arnd Bergmann

8:48 p.m.

On Wednesday 27 April 2011 22:26:03 Russell King - ARM Linux wrote:

...

Maybe something in asm/pci.h - but that invites having lots of bus specific header files in asm/.

A better solution imho would be to have an architecture callback for struct device which gets registered, which can inspect the type of the device, and set the flag depending on where it appears in the tree.

Ah, I was under the assumption that there was already a callback for this. We have a dma_set_coherent_mask() implementation in some pci hosts (ixp4xx and it8152), but that's not a proper callback that can be override per host and it does not actually do what we were talking about here. I guess the callback should live in struct hw_pci in case of ARM, and set a new field in struct device_dma_parameters.

Maybe we don't even need a new flag if we just set device->coherent_dma_mask to zero.

Arnd

Benjamin Herrenschmidt

9:41 p.m.

...

As I said above, I don't think bus code can do it. Take my example above of a tulip pci device on x86 and a tulip pci device on ARM. Both use the same PCI code.

Maybe something in asm/pci.h - but that invites having lots of bus specific header files in asm/.

A better solution imho would be to have an architecture callback for struct device which gets registered, which can inspect the type of the device, and set the flag depending on where it appears in the tree.

Now -that's gross :-)

For PCI you can have the flag propagate from the PHB down, for busses without a bus type (platform) then whoever instanciate them (the platform code) can set that appropriately.

Ben.

Russell King - ARM Linux

28 Apr 28 Apr

9:30 a.m.

On Thu, Apr 28, 2011 at 07:41:07AM +1000, Benjamin Herrenschmidt wrote:

...

...
As I said above, I don't think bus code can do it. Take my example above of a tulip pci device on x86 and a tulip pci device on ARM. Both use the same PCI code.

Maybe something in asm/pci.h - but that invites having lots of bus specific header files in asm/.

A better solution imho would be to have an architecture callback for struct device which gets registered, which can inspect the type of the device, and set the flag depending on where it appears in the tree.

Now -that's gross :-)

For PCI you can have the flag propagate from the PHB down, for busses without a bus type (platform) then whoever instanciate them (the platform code) can set that appropriately.

How can you do that when it changes mid-bus heirarchy? I'm thinking of the situation where the DRM stuff is on a child bus below the root bus, and the root bus has DMA coherent devices on it but the DRM stuff doesn't.

Your solution doesn't allow that - and I believe that's what Arnd is talking about.

Benjamin Herrenschmidt

9:07 p.m.

On Thu, 2011-04-28 at 10:30 +0100, Russell King - ARM Linux wrote:

...

On Thu, Apr 28, 2011 at 07:41:07AM +1000, Benjamin Herrenschmidt wrote:

...
...
As I said above, I don't think bus code can do it. Take my example above of a tulip pci device on x86 and a tulip pci device on ARM. Both use the same PCI code.

Maybe something in asm/pci.h - but that invites having lots of bus specific header files in asm/.

A better solution imho would be to have an architecture callback for struct device which gets registered, which can inspect the type of the device, and set the flag depending on where it appears in the tree.

Now -that's gross :-)

For PCI you can have the flag propagate from the PHB down, for busses without a bus type (platform) then whoever instanciate them (the platform code) can set that appropriately.

How can you do that when it changes mid-bus heirarchy? I'm thinking of the situation where the DRM stuff is on a child bus below the root bus, and the root bus has DMA coherent devices on it but the DRM stuff doesn't.

But that's not PCI right ? IE. with PCI, coherency is a property of the PHB...

...

Your solution doesn't allow that - and I believe that's what Arnd is talking about.

Well, for the rest I'm thinking just bolt it into the platform until you can put the property in the DT :-)

Cheers, Ben.

Arnd Bergmann

29 Apr 29 Apr

11:26 a.m.

On Thursday 28 April 2011, Benjamin Herrenschmidt wrote:

...

...
...
For PCI you can have the flag propagate from the PHB down, for busses without a bus type (platform) then whoever instanciate them (the platform code) can set that appropriately.

How can you do that when it changes mid-bus heirarchy? I'm thinking of the situation where the DRM stuff is on a child bus below the root bus, and the root bus has DMA coherent devices on it but the DRM stuff doesn't.

But that's not PCI right ? IE. with PCI, coherency is a property of the PHB...

That is my understanding at least, but I'd like to have a confirmation from the DRM folks.

I believe that the PC graphics cards that have noncoherent DMA mappings are all of the unified memory (integrated into the northbridge) kind, so they are not on the same host bridge as all regular PCI devices, even if they appear as a PCI device.

Arnd

Benjamin Herrenschmidt

11:47 a.m.

On Fri, 2011-04-29 at 13:26 +0200, Arnd Bergmann wrote:

...

On Thursday 28 April 2011, Benjamin Herrenschmidt wrote:

...
...
...
For PCI you can have the flag propagate from the PHB down, for busses without a bus type (platform) then whoever instanciate them (the platform code) can set that appropriately.

How can you do that when it changes mid-bus heirarchy? I'm thinking of the situation where the DRM stuff is on a child bus below the root bus, and the root bus has DMA coherent devices on it but the DRM stuff doesn't.

But that's not PCI right ? IE. with PCI, coherency is a property of the PHB...

That is my understanding at least, but I'd like to have a confirmation from the DRM folks.

I believe that the PC graphics cards that have noncoherent DMA mappings are all of the unified memory (integrated into the northbridge) kind, so they are not on the same host bridge as all regular PCI devices, even if they appear as a PCI device.

Hrm... beware with x86 , they love playing tricks :-) Since too many BIOSes don't understand PCI domains, they make devices on separate bridges look like sibling on the same segment and that sort of thing.

Cheers, Ben.

Alan Cox

11:56 a.m.

...

I believe that the PC graphics cards that have noncoherent DMA mappings are all of the unified memory (integrated into the northbridge) kind, so they are not on the same host bridge as all regular PCI devices, even if they appear as a PCI device.

The AGP GART is not coherent on a lot of systems - not necessarily unified memory though, it can be a plug in AGP card too. The GART is basically an IOMMU (and indeed in the later AMD case used exactly as that)

Benjamin Herrenschmidt

10:51 p.m.

On Fri, 2011-04-29 at 12:56 +0100, Alan Cox wrote:

...

...
I believe that the PC graphics cards that have noncoherent DMA mappings are all of the unified memory (integrated into the northbridge) kind, so they are not on the same host bridge as all regular PCI devices, even if they appear as a PCI device.

The AGP GART is not coherent on a lot of systems - not necessarily unified memory though, it can be a plug in AGP card too. The GART is basically an IOMMU (and indeed in the later AMD case used exactly as that)

Right. Actually there's also the ability for PCIe devices to set a "no snoop" bit on transactions and thus behave in a non-coherent manner. Hopefully most sane PHBs ignore that bit ...

Cheers, Ben.

Thomas Hellstrom

12:06 p.m.

On 04/29/2011 01:26 PM, Arnd Bergmann wrote:

...

On Thursday 28 April 2011, Benjamin Herrenschmidt wrote:

...
...
...
For PCI you can have the flag propagate from the PHB down, for busses without a bus type (platform) then whoever instanciate them (the platform code) can set that appropriately.

How can you do that when it changes mid-bus heirarchy? I'm thinking of the situation where the DRM stuff is on a child bus below the root bus, and the root bus has DMA coherent devices on it but the DRM stuff doesn't.

But that's not PCI right ? IE. with PCI, coherency is a property of the PHB...

That is my understanding at least, but I'd like to have a confirmation from the DRM folks.

I believe that the PC graphics cards that have noncoherent DMA mappings are all of the unified memory (integrated into the northbridge) kind, so they are not on the same host bridge as all regular PCI devices, even if they appear as a PCI device.

I think Jerome has mentioned at one point that the Radeon graphics cards support non-coherent mappings.

Fwiw, the PowerVR SGX MMU also supports this mode of operation, although it being functional I guess depends on the system implementation.

/Thomas

...

Arnd

Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig

Jerome Glisse

1:34 p.m.

On Fri, Apr 29, 2011 at 8:06 AM, Thomas Hellstrom thellstrom@vmware.com wrote:

...

On 04/29/2011 01:26 PM, Arnd Bergmann wrote:

...
On Thursday 28 April 2011, Benjamin Herrenschmidt wrote:

...
...
...
For PCI you can have the flag propagate from the PHB down, for busses without a bus type (platform) then whoever instanciate them (the platform code) can set that appropriately.

How can you do that when it changes mid-bus heirarchy? I'm thinking of the situation where the DRM stuff is on a child bus below the root bus, and the root bus has DMA coherent devices on it but the DRM stuff doesn't.

But that's not PCI right ? IE. with PCI, coherency is a property of the PHB...

That is my understanding at least, but I'd like to have a confirmation from the DRM folks.

I believe that the PC graphics cards that have noncoherent DMA mappings are all of the unified memory (integrated into the northbridge) kind, so they are not on the same host bridge as all regular PCI devices, even if they appear as a PCI device.

I think Jerome has mentioned at one point that the Radeon graphics cards support non-coherent mappings.

Fwiw, the PowerVR SGX MMU also supports this mode of operation, although it being functional I guess depends on the system implementation.

/Thomas

Radeon memory controller can do non snooped pci transaction, as far as i have tested most of the x86 pci bridge don't try to be coherent then ie they don't analyze pci dma and ask for cpu flush they just perform the request (and i guess it's what all bridge will do), so it endup being noncoherent. I haven't done any benchmark of how faster it's for the GPU when it's not snooping but i guess it can give 50% boost as it likely drastictly reduce pci transaction overhead.

I am talking here about device that you plug into any pci or pcie slot, so it's not igp integrated into northbridge or into the cpu.

Cheers, Jerome

Benjamin Herrenschmidt

10:55 p.m.

On Fri, 2011-04-29 at 09:34 -0400, Jerome Glisse wrote:

...

Radeon memory controller can do non snooped pci transaction, as far as i have tested most of the x86 pci bridge don't try to be coherent then ie they don't analyze pci dma and ask for cpu flush they just perform the request (and i guess it's what all bridge will do), so it endup being noncoherent. I haven't done any benchmark of how faster it's for the GPU when it's not snooping but i guess it can give 50% boost as it likely drastictly reduce pci transaction overhead.

I am talking here about device that you plug into any pci or pcie slot, so it's not igp integrated into northbridge or into the cpu.

Right, the card has nothing to do with the snooping process, it's purely a feature of the bridge, based on a flag optionally set by the card. As I said earlier, bridges have the freedom to ignore it, which we do on ppc, so that's a non issue.

Cheers, Ben.

Benjamin Herrenschmidt

10:53 p.m.

On Fri, 2011-04-29 at 14:06 +0200, Thomas Hellstrom wrote:

...

I think Jerome has mentioned at one point that the Radeon graphics cards support non-coherent mappings.

If the card is PCI/PCI-X/PCIe then coherency is not its business, it's the business of the host bridge. However, on PCIe at least, the card can indeed set a "no snoop" attribute on DMA transactions to request "no coherency". At least the systems have the latitude to just ignore that bit (like we do on all ppc afaik) :-)

...

Fwiw, the PowerVR SGX MMU also supports this mode of operation, although it being functional I guess depends on the system implementation.

Right, it's not a GPU thing, it's really a system design thing.

Cheers, Ben.

Marek Szyprowski

27 Apr 27 Apr

10:51 a.m.

Hello,

On Wednesday, April 27, 2011 10:57 AM Arnd Bergmann wrote:

...

On Wednesday 27 April 2011, Russell King - ARM Linux wrote:

...
...

Implement dma_alloc_noncoherent on ARM. Marek pointed out that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

dma_alloc_noncoherent is an entirely pointless API afaics.

The main use case that I can see for dma_alloc_noncoherent is being able to allocate a large cacheable memory chunk that is mapped contiguous into both kernel virtual and bus virtual space, but not necessarily in contiguous in physical memory.

Without an IOMMU, I agree that it is pointless, because the only sensible imlpementation would be alloc_pages_exact + dma_map_single.

Still it might be reasonable to use it in the drivers that will work on different platforms - one with iommu and one without.

...

...
...

Convert ARM to use asm-generic/dma-mapping-common.h. We need both IOMMU and direct mapped DMA on some machines.

Implement an architecture independent version of dma_map_ops based on the iommu.h API. As Joerg mentioned, this has been missing for some time, and it would be better to do it once than for each IOMMU separately. This is probably a lot of work.

dma_map_ops design is broken - we can't have the entire DMA API

indirected

...
through that structure. Whether you have an IOMMU or not is completely independent of whether you have to do DMA cache handling. Moreover, with dmabounce, having the DMA cache handling in place doesn't make sense.

So you can't have a dma_map_ops for the cache handling bits, a

dma_map_ops

...
for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't work like that.

I believe the dma_map_ops stuff in asm-generic to be entirely unsuitable for ARM.

We probably still need to handle both the coherent and noncoherent case in each dma_map_ops implementation, at least for those combinations where they matter (definitely the linear mapping). However, I think that using dma_mapping_common.h would let us use an architecture-independent dma_map_ops for the generic iommu code that Marek wants to introduce now.

I still don't understand how dmabounce works, but if it's similar to swiotlb, we can have at least three different dma_map_ops: linear, dmabounce and iommu.

That's exactly what I want to make in the initial version of my patches.

Best regards

-- Marek Szyprowski Samsung Poland R&D Center

Benjamin Herrenschmidt

9:37 p.m.

On Wed, 2011-04-27 at 08:35 +0100, Russell King - ARM Linux wrote:

...

On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:

...

Fix the arm version of dma_alloc_coherent. It's in use today and is broken on modern CPUs because it results in both cached and uncached mappings. Rebecca suggested different approaches how to get there.

I also suggested various approaches and produced patches, which I'm slowly feeding in. However, I think whatever we do, we'll end up breaking something along the line - especially as various places assume that dma_alloc_coherent() is ultimately backed by memory with a struct page.

Our implementation for embedded ppc has a similar problem. It currently uses a pool of memory and does virtual mappings on it which means no struct page easy to get to. How do you do on your side ? A fixed size pool that you take out of the linear mapping ? Or you allocate pages in the linear mapping and "unmap" them ? The problem I have with some embedded ppc's is that the linear map is mapped in chunks of 256M or so....

...

...

Implement dma_alloc_noncoherent on ARM. Marek pointed out that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

dma_alloc_noncoherent is an entirely pointless API afaics.

I was about to ask what the point is ... (what is the expected semantic ? Memory that is reachable but not necessarily cache coherent ?)

...

...

Convert ARM to use asm-generic/dma-mapping-common.h. We need both IOMMU and direct mapped DMA on some machines.

Implement an architecture independent version of dma_map_ops based on the iommu.h API. As Joerg mentioned, this has been missing for some time, and it would be better to do it once than for each IOMMU separately. This is probably a lot of work.

dma_map_ops design is broken - we can't have the entire DMA API indirected through that structure.

Why not ? That's the only way we can deal in my experience with multiple type of different iommu's etc... at runtime in a single kernel. We used to more/less have global function pointers in a long past but we moved to per device ops instead to cope with multiple DMA path within a given system and it works fine.

...

Whether you have an IOMMU or not is completely independent of whether you have to do DMA cache handling. Moreover, with dmabounce, having the DMA cache handling in place doesn't make sense.

Right. For now I don't have that problem on ppc as my iommu archs are also fully coherent, so it's a bit more tricky that way but can be handled I suppose by having the cache mgmnt be lib functions based on flags added to the struct device.

...

So you can't have a dma_map_ops for the cache handling bits, a dma_map_ops for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't work like that.

Well, the dmabounce and cache handling is one implementation that's just on/off with parameters no ?. iommu is different implementations. So the ops should be for the iommu backends. The dmabounce & cache handling is then done by those backends based on flags you stick in struct device for example.

...

I believe the dma_map_ops stuff in asm-generic to be entirely unsuitable for ARM.

I don't think it is :-)

Cheers, Ben.

...

-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

Arnd Bergmann

28 Apr 28 Apr

6:40 a.m.

On Wednesday 27 April 2011 23:37:51 Benjamin Herrenschmidt wrote:

...

On Wed, 2011-04-27 at 08:35 +0100, Russell King - ARM Linux wrote:

...
On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:

...

Implement dma_alloc_noncoherent on ARM. Marek pointed out that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

dma_alloc_noncoherent is an entirely pointless API afaics.

I was about to ask what the point is ... (what is the expected semantic ? Memory that is reachable but not necessarily cache coherent ?)

Drivers use this when they explicitly want to manage the caching themselves. I think this is most interesting on big NUMA systems, where you really want to use fast (local cached) memory and then flush it explicitly to do dma. Very few drivers use this:

arnd@wuerfel:~/linux-2.6$ git grep dma_alloc_noncoherent drivers/ drivers/base/dma-mapping.c: vaddr = dma_alloc_noncoherent(dev, size, dma_handle, gfp); drivers/net/au1000_eth.c: aup->vaddr = (u32)dma_alloc_noncoherent(NULL, MAX_BUF_SIZE * drivers/net/lasi_82596.c:#define DMA_ALLOC dma_alloc_noncoherent drivers/net/sgiseeq.c: sr = dma_alloc_noncoherent(&pdev->dev, sizeof(*sp->srings), drivers/scsi/53c700.c: memory = dma_alloc_noncoherent(hostdata->dev, TOTAL_MEM_SIZE, drivers/scsi/sgiwd93.c: hdata->cpu = dma_alloc_noncoherent(&pdev->dev, HPC_DMA_SIZE, drivers/tty/serial/mpsc.c: } else if ((pi->dma_region = dma_alloc_noncoherent(pi->port.dev, drivers/video/au1200fb.c: fbdev->fb_mem = dma_alloc_noncoherent(&dev->dev,

...

...
So you can't have a dma_map_ops for the cache handling bits, a dma_map_ops for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't work like that.

Well, the dmabounce and cache handling is one implementation that's just on/off with parameters no ?. iommu is different implementations. So the ops should be for the iommu backends. The dmabounce & cache handling is then done by those backends based on flags you stick in struct device for example.

Well, what we are currently discussing is to have a common implementation for IOMMUs that provide the generic iommu_ops that the KVM people introduced. Once we get there, we only need a single dma_map_ops structure for all IOMMUs.

Arnd

FUJITA Tomonori

6:46 a.m.

On Thu, 28 Apr 2011 08:40:08 +0200 Arnd Bergmann arnd@arndb.de wrote:

...

On Wednesday 27 April 2011 23:37:51 Benjamin Herrenschmidt wrote:

...
On Wed, 2011-04-27 at 08:35 +0100, Russell King - ARM Linux wrote:

...
On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:

...

Implement dma_alloc_noncoherent on ARM. Marek pointed out that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

dma_alloc_noncoherent is an entirely pointless API afaics.

I was about to ask what the point is ... (what is the expected semantic ? Memory that is reachable but not necessarily cache coherent ?)

Drivers use this when they explicitly want to manage the caching themselves.

Not "want to manage". The API is for drivers that "have to" manage the cache because of architectures that can't allocate coherent memory.

...

I think this is most interesting on big NUMA systems, where you really want to use fast (local cached) memory and then flush it explicitly to do dma. Very few drivers use this:

Russell King - ARM Linux

9:37 a.m.

On Thu, Apr 28, 2011 at 07:37:51AM +1000, Benjamin Herrenschmidt wrote:

...

On Wed, 2011-04-27 at 08:35 +0100, Russell King - ARM Linux wrote:

...
On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:

...

Fix the arm version of dma_alloc_coherent. It's in use today and is broken on modern CPUs because it results in both cached and uncached mappings. Rebecca suggested different approaches how to get there.

I also suggested various approaches and produced patches, which I'm slowly feeding in. However, I think whatever we do, we'll end up breaking something along the line - especially as various places assume that dma_alloc_coherent() is ultimately backed by memory with a struct page.

Our implementation for embedded ppc has a similar problem. It currently uses a pool of memory and does virtual mappings on it which means no struct page easy to get to. How do you do on your side ? A fixed size pool that you take out of the linear mapping ? Or you allocate pages in the linear mapping and "unmap" them ? The problem I have with some embedded ppc's is that the linear map is mapped in chunks of 256M or so....

We don't - what I was referring to was people taking the DMA cookie and treating it as a physical address, converting it to a PFN and then doing pfn_to_page() on that. (Yes, it's been tried.)

There have been some subsystems (eg ALSA) which also tried to use virt_to_page() on dma_alloc_coherent(), but I think those got fixed to use our dma_mmap_coherent() stuff when building on ARM.

...

...
...

Implement dma_alloc_noncoherent on ARM. Marek pointed out that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

dma_alloc_noncoherent is an entirely pointless API afaics.

I was about to ask what the point is ... (what is the expected semantic ? Memory that is reachable but not necessarily cache coherent ?)

As far as I can see, dma_alloc_noncoherent() should just be a wrapper around the normal page allocation function. I don't see it ever needing to do anything special - and the advantage of just being the normal page allocation function is that its properties are well known and architecture independent.

...

...
...

Convert ARM to use asm-generic/dma-mapping-common.h. We need both IOMMU and direct mapped DMA on some machines.

Implement an architecture independent version of dma_map_ops based on the iommu.h API. As Joerg mentioned, this has been missing for some time, and it would be better to do it once than for each IOMMU separately. This is probably a lot of work.

dma_map_ops design is broken - we can't have the entire DMA API indirected through that structure.

Why not ? That's the only way we can deal in my experience with multiple type of different iommu's etc... at runtime in a single kernel. We used to more/less have global function pointers in a long past but we moved to per device ops instead to cope with multiple DMA path within a given system and it works fine.

...
Whether you have an IOMMU or not is completely independent of whether you have to do DMA cache handling. Moreover, with dmabounce, having the DMA cache handling in place doesn't make sense.

Here I've answered your question above.

...

Right. For now I don't have that problem on ppc as my iommu archs are also fully coherent, so it's a bit more tricky that way but can be handled I suppose by having the cache mgmnt be lib functions based on flags added to the struct device.

Think about stuffing all the iommu drivers with DMA cache management for ARM, and think about the maintainability for that when other folk come along and change the iommu drivers. I've no desire to keep going to fix them each time someone breaks the DMA cache management because everyone elses cache is DMA coherent.

Keep that in the arch code, out of the dma_ops and it doesn't have to be thought about by each and every iommu driver.

...

...
So you can't have a dma_map_ops for the cache handling bits, a dma_map_ops for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't work like that.

Well, the dmabounce and cache handling is one implementation that's just on/off with parameters no ?. iommu is different implementations. So the ops should be for the iommu backends. The dmabounce & cache handling is then done by those backends based on flags you stick in struct device for example.

You've completely missed the point.

Marek Szyprowski

10:32 a.m.

Hello,

On Thursday, April 28, 2011 11:38 AM Russell King - ARM Linux wrote:

...

...
...
...

Implement dma_alloc_noncoherent on ARM. Marek pointed out that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

dma_alloc_noncoherent is an entirely pointless API afaics.

I was about to ask what the point is ... (what is the expected semantic ? Memory that is reachable but not necessarily cache coherent ?)

As far as I can see, dma_alloc_noncoherent() should just be a wrapper around the normal page allocation function. I don't see it ever needing to do anything special - and the advantage of just being the normal page allocation function is that its properties are well known and architecture independent.

If there is IOMMU chip that supports pages larger than 4KiB then dma_alloc_noncoherent() might try to allocate such larger pages what will result in faster access to the buffer (lower iommu tlb miss ratio). For large buffers even 64KiB 'pages' gives a significant performance improvement.

Best regards

-- Marek Szyprowski Samsung Poland R&D Center

Russell King - ARM Linux

10:51 a.m.

On Thu, Apr 28, 2011 at 12:32:32PM +0200, Marek Szyprowski wrote:

...

On Thursday, April 28, 2011 11:38 AM Russell King - ARM Linux wrote:

...
...
...
...

Implement dma_alloc_noncoherent on ARM. Marek pointed out that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

dma_alloc_noncoherent is an entirely pointless API afaics.

I was about to ask what the point is ... (what is the expected semantic ? Memory that is reachable but not necessarily cache coherent ?)

As far as I can see, dma_alloc_noncoherent() should just be a wrapper around the normal page allocation function. I don't see it ever needing to do anything special - and the advantage of just being the normal page allocation function is that its properties are well known and architecture independent.

If there is IOMMU chip that supports pages larger than 4KiB then dma_alloc_noncoherent() might try to allocate such larger pages what will result in faster access to the buffer (lower iommu tlb miss ratio). For large buffers even 64KiB 'pages' gives a significant performance improvement.

The memory allocated by dma_alloc_noncoherent() (and dma_alloc_coherent()) has to be virtually contiguous, and DMA contiguous. It is assumed by all drivers that:

virt = dma_alloc_foo(size, &dma);

cpuaddr = virt + offset; dmaaddr = dma + offset;

results in the CPU and DMA seeing ultimately the same address for cpuaddr and dmaaddr for 0 <= offset < size.

The standard alloc_pages() also ensures that if you ask for an order-N page, you'll end up with that allocation being contiguous - so there's no difference there.

What I'd suggest is that dma_alloc_noncoherent() should be architecture independent, and should call into whatever iommu support the device has to setup an approprite iommu mapping. IOW, I don't see any need for every architecture to provide its own dma_alloc_noncoherent() allocation function - or indeed every iommu implementation to deal with the allocation issues either.

Arnd Bergmann

12:28 p.m.

On Thursday 28 April 2011, Russell King - ARM Linux wrote:

...

What I'd suggest is that dma_alloc_noncoherent() should be architecture independent, and should call into whatever iommu support the device has to setup an approprite iommu mapping. IOW, I don't see any need for every architecture to provide its own dma_alloc_noncoherent() allocation function - or indeed every iommu implementation to deal with the allocation issues either.

Almost all architectures today define dma_alloc_noncoherent to dma_alloc_coherent, which is totally fine on architectures where cacheable coherent mappings are the default or where we don't need to flush individual cache lines for dma_sync_*.

The problem with backing either of the two with alloc_pages or alloc_pages_exact is that you cannot do large allocation when physical memory is fragmented, even if you have an IOMMU.

IMHO the allocation for both dma_alloc_coherent and dma_alloc_noncoherent should therefore depend on whether you have an IOMMU. If you do, you can easily allocate megabytes, e.g. for use as a frame buffer.

Arnd

Russell King - ARM Linux

1:15 p.m.

On Thu, Apr 28, 2011 at 02:28:56PM +0200, Arnd Bergmann wrote:

...

On Thursday 28 April 2011, Russell King - ARM Linux wrote:

...
What I'd suggest is that dma_alloc_noncoherent() should be architecture independent, and should call into whatever iommu support the device has to setup an approprite iommu mapping. IOW, I don't see any need for every architecture to provide its own dma_alloc_noncoherent() allocation function - or indeed every iommu implementation to deal with the allocation issues either.

Almost all architectures today define dma_alloc_noncoherent to dma_alloc_coherent, which is totally fine on architectures where cacheable coherent mappings are the default or where we don't need to flush individual cache lines for dma_sync_*.

However, dma_alloc_coherent() memory can't be used with the dma_sync_* API as its return address (unlike other architectures) is not in the kernel direct mapped memory range.

The only thing valid for dma_sync_* are buffers which have been passed to the dma_map_* APIs.

Instead, I think what you're referring to is dma_cache_sync(), which is the API to be used with dma_alloc_noncoherent(), which we don't implement.

As we have problems with some SMP implementations, and the noncoherent API doesn't have the idea of buffer ownership, it's rather hard to deal with the DMA cache implications with the existing API, especially with the issues of speculative prefetching. The current usage (looking at drivers/scsi/53c700.c) doesn't cater for speculative prefetch as the dma_cache_sync(,,,DMA_FROM_DEVICE) is done well in advance of the DMA actually happening.

So all in all, I think the noncoherent API is broken as currently designed - and until we have devices on ARM which use it, I don't see much point into trying to fix the current thing especially as we'd be unable to test.

Arnd Bergmann

2:29 p.m.

On Thursday 28 April 2011, Russell King - ARM Linux wrote:

...

On Thu, Apr 28, 2011 at 02:28:56PM +0200, Arnd Bergmann wrote:

...
On Thursday 28 April 2011, Russell King - ARM Linux wrote:

...
What I'd suggest is that dma_alloc_noncoherent() should be architecture independent, and should call into whatever iommu support the device has to setup an approprite iommu mapping. IOW, I don't see any need for every architecture to provide its own dma_alloc_noncoherent() allocation function - or indeed every iommu implementation to deal with the allocation issues either.

Almost all architectures today define dma_alloc_noncoherent to dma_alloc_coherent, which is totally fine on architectures where cacheable coherent mappings are the default or where we don't need to flush individual cache lines for dma_sync_*.

However, dma_alloc_coherent() memory can't be used with the dma_sync_* API as its return address (unlike other architectures) is not in the kernel direct mapped memory range.

Right, because ARM does not fit in the two categories I listed above: the regular DMA is not cache coherent and we need to flush the cache lines for the data we want to access in dma_sync_*.

...

The only thing valid for dma_sync_* are buffers which have been passed to the dma_map_* APIs.

Instead, I think what you're referring to is dma_cache_sync(), which is the API to be used with dma_alloc_noncoherent(), which we don't implement.

As we have problems with some SMP implementations, and the noncoherent API doesn't have the idea of buffer ownership, it's rather hard to deal with the DMA cache implications with the existing API, especially with the issues of speculative prefetching. The current usage (looking at drivers/scsi/53c700.c) doesn't cater for speculative prefetch as the dma_cache_sync(,,,DMA_FROM_DEVICE) is done well in advance of the DMA actually happening.

So all in all, I think the noncoherent API is broken as currently designed - and until we have devices on ARM which use it, I don't see much point into trying to fix the current thing especially as we'd be unable to test.

I agree that dma_cache_sync() is totally unusable on ARM, I thought we had killed that off and replaced it with dma_sync_*. Unfortunately, I was mistaken there: all drivers that use dma_alloc_noncoherent either use dma_cache_sync() or they do something that is more broken, but they don't do dma_sync_*.

Given that people still want to have an interface that does what I though this one did, I guess we have two options:

* Kill off dma_cache_sync and replace it with calls to dma_sync_* so we can start using dma_alloc_noncoherent on ARM

* Introduce a new interface

Arnd

Russell King - ARM Linux

2:34 p.m.

On Thu, Apr 28, 2011 at 04:29:52PM +0200, Arnd Bergmann wrote:

...

Given that people still want to have an interface that does what I though this one did, I guess we have two options:

Kill off dma_cache_sync and replace it with calls to dma_sync_* so we can start using dma_alloc_noncoherent on ARM

I don't think this is an option as dma_sync_*() is part of the streaming DMA mapping API (dma_map_*) which participates in the idea of buffer ownership, which the noncoherent API doesn't appear to.

Arnd Bergmann

2:39 p.m.

On Thursday 28 April 2011, Russell King - ARM Linux wrote:

...

On Thu, Apr 28, 2011 at 04:29:52PM +0200, Arnd Bergmann wrote:

...
Given that people still want to have an interface that does what I though this one did, I guess we have two options:

Kill off dma_cache_sync and replace it with calls to dma_sync_* so we can start using dma_alloc_noncoherent on ARM

I don't think this is an option as dma_sync_*() is part of the streaming DMA mapping API (dma_map_*) which participates in the idea of buffer ownership, which the noncoherent API doesn't appear to.

I thought the problem was in fact that the noncoherent API cannot be implemented on architectures like ARM specifically because there is no concept of buffer ownership. The obvious way to fix that would be to redefine the API. What am I missing?

Arnd

Russell King - ARM Linux

2:58 p.m.

On Thu, Apr 28, 2011 at 04:39:59PM +0200, Arnd Bergmann wrote:

...

On Thursday 28 April 2011, Russell King - ARM Linux wrote:

...
On Thu, Apr 28, 2011 at 04:29:52PM +0200, Arnd Bergmann wrote:

...
Given that people still want to have an interface that does what I though this one did, I guess we have two options:

Kill off dma_cache_sync and replace it with calls to dma_sync_* so we can start using dma_alloc_noncoherent on ARM

I don't think this is an option as dma_sync_*() is part of the streaming DMA mapping API (dma_map_*) which participates in the idea of buffer ownership, which the noncoherent API doesn't appear to.

I thought the problem was in fact that the noncoherent API cannot be implemented on architectures like ARM specifically because there is no concept of buffer ownership. The obvious way to fix that would be to redefine the API. What am I missing?

You are partially correct. With the streaming interface, we're fairly strict with the buffer ownership stuff, as the most effective way to implement it across all our CPUs is to deal with the mapping, sync and unmapping in terms of buffers being passed from CPU control to DMA device control and back again.

With the noncoherent interface, there is less of a buffer ownership idea. For instance, to read from a noncoherent buffer, the following is required (in order, I'm not considering the effects of weakly ordered stuff):

/* dma happens, signalled complete */ dma_cache_invalidate(buffer, size); /* cpu can now see up to date data */ message = *buffer;

Unlike the streaming API, we don't need to hand the buffer back to the device before the CPU can repeat the above code sequence.

If we want to write to a noncoherent buffer, then we need:

*buffer = value; dma_cache_writeback(buffer, size); /* dma can only now see new value */

and again, the same thing applies.

There is an additional problem lurking in amongst this though - a buffer which is both read and written by the CPU has to be extremely careful of cache writebacks - this for instance would not be legal:

*buffer = value; ... /* dma from device */ dma_cache_invalidate(buffer, size); message = *buffer;

as it is not predictable whether we'll see 'value' or the DMA data - that depends on the relative ordering of the DMA writing to RAM vs the cache eviction of the CPU write.

So, there is a kind of buffer ownership here:

/* cpu owns */ dma_cache_writeback(buffer, size); /* dma owns */ dma_cache_invalidate(buffer, size); /* cpu owns */

but as shown above it doesn't need to be as strict as the streaming API.

Also note that there's a problem lurking here with DMA cache line size:

| int | dma_get_cache_alignment(void) | | Returns the processor cache alignment. This is the absolute minimum | alignment *and* width that you must observe when either mapping | memory or doing partial flushes. | | Notes: This API may return a number *larger* than the actual cache | line, but it will guarantee that one or more cache lines fit exactly | into the width returned by this call. It will also always be a power | of two for easy alignment.

$ grep -L dma_get_cache_alignment $(grep dma_alloc_noncoherent drivers/ -lr) drivers/base/dma-mapping.c drivers/scsi/sgiwd93.c drivers/scsi/53c700.c drivers/net/au1000_eth.c drivers/net/sgiseeq.c drivers/net/lasi_82596.c drivers/video/au1200fb.c

so we have a bunch of drivers which presumably don't take any notice of the DMA cache line size, which may be very important. 53c700 for instance aligns its buffers using L1_CACHE_ALIGN(), which may be smaller than what's actually required...

Jerome Glisse

7:37 p.m.

On Thu, Apr 28, 2011 at 10:34 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:

...

On Thu, Apr 28, 2011 at 04:29:52PM +0200, Arnd Bergmann wrote:

...
Given that people still want to have an interface that does what I though this one did, I guess we have two options:

Kill off dma_cache_sync and replace it with calls to dma_sync_*

so we can start using dma_alloc_noncoherent on ARM

I don't think this is an option as dma_sync_*() is part of the streaming DMA mapping API (dma_map_*) which participates in the idea of buffer ownership, which the noncoherent API doesn't appear to.

Sorry to jump in like that, but to me it seems that this whole discussion is going toward having the decision of cache attribute inside dma_* function and that a driver asking for uncached memory might get cached memory if IOMMU or others component allows to have cache coherency.

As Jesse pointed out already, for performance reasons it's lot better if you let the driver decide even if you have an iommu capable of handling coherency for you. My understanding is that each time coherency is asked for it trigger bus activities of some kind (i think snoop is the term used for pci) this traffic can slow down both the cpu and the device. For graphic driver we have a lot of write once and use (once or more) buffer and it makes a lot of sense to have those buffer allocated using uncached memory so we can tell the device (in case of drm driver) that there is no need to trigger snoop activities for coherency. So i believe the decision should ultimately be in the driver side.

Jesse also pointed out space exhaustion inside the iommu and i believe this should also be considered. This is why i believe the dma_* api is not well suited. In DRM/TTM we use pci_dma_mapping* and we also play with with page set_page*_uc|wc|wb.

So i believe a better API might look like :

- struct dma_alloc_unit { bool contiguous; uint dmamask; } struct dma_buffer { dma_unit } CONTIGUOUS tell that this dma unit needs contiguous allocation or not, if it needs contiguous allocation and there is an iommu then the allocator might allocate non contiguous pages/memory and latter properly program the iommu to make things look contiguous to the device. if contiguous==false then allocator might allocate one page at a time but should rather to allocate a bunch of contiguous page to allow optimization for minimizing tlb miss if the device allow such things (maybe adding a flag here might make sense) -dma_buffer dma_alloc_(uc|wc|wb)(dma_alloc_unit, size) : alloc memory according to constraint defined by dma_alloc_unit -dma_buffer_update(dma_buffer, offset, size) allow dmabounce&swiotlb to know what needs to be updated -dma_bus_map(dma_buffer) map the buffer on to the bus in case of dmabounce that would mean copy to the bounce buffer, for iommu that would mean bind it, and in case of no iommu well do nothings -dma_bus_unmap(dma_buffer) implementation might not necessarily unmap the buffer if there is plenty of room in the iommu

So usage would look like : mydma_buffer = dma_alloc_uc(N); cpuptr=dma_cpu_ptr(mydma_buffer) //write to the buffer // tell dma which data need to be updated depending on platform iommu,dmabounce cache flushing ... dma_buffer_update(mydma_buffer, offset, size) dma_bus_map(mydma_buffer) // let the device use the buffer ... // the buffer isn't use anymore by the device dma_bus_unmap(mydma_buffer)

It hides things like iommu or dmabounce from the device driver but still allow the device driver to ask for the most optimal way. A platform decide to not support dma_alloc_uc|wc (ie non coherent) if it has an iommu that can handle coherency or some others way to handle it like flushing. But if platform wants better performance it should try to provide non coherent allocation (through highmem or changing kernel mapping properties ...).

Maybe i am completely missing the point.

Cheers, Jerome

Benjamin Herrenschmidt

29 Apr 29 Apr

12:29 a.m.

On Thu, 2011-04-28 at 15:37 -0400, Jerome Glisse wrote:

...

Jesse also pointed out space exhaustion inside the iommu and i believe this should also be considered. This is why i believe the dma_* api is not well suited. In DRM/TTM we use pci_dma_mapping* and we also play with with page set_page*_uc|wc|wb.

Which are yet another set of completely x86-centric APIs that have not been thought in the context of other architectures and are probably mostly unimplementables on half of them :-)

Cheers, Ben.

Thomas Hellstrom

5:50 a.m.

On 04/29/2011 02:29 AM, Benjamin Herrenschmidt wrote:

...

On Thu, 2011-04-28 at 15:37 -0400, Jerome Glisse wrote:

...
Jesse also pointed out space exhaustion inside the iommu and i believe this should also be considered. This is why i believe the dma_* api is not well suited. In DRM/TTM we use pci_dma_mapping* and we also play with with page set_page*_uc|wc|wb.

Which are yet another set of completely x86-centric APIs that have not been thought in the context of other architectures and are probably mostly unimplementables on half of them :-)

Cheers, Ben.

I've been doing some thinking over the years on how we could extend that functionality to other architectures. The reason we need those is because some x86 processors (early AMDs and, I think VIA c3) dislike multiple mappings of the same pages with conflicting caching attributes.

What we really want to be able to do is to unmap pages from the linear kernel map, to avoid having to transition the linear kernel map every time we change other mappings.

The reason we need to do this in the first place is that AGP and modern GPUs has a fast mode where snooping is turned off.

However, we should be able to construct a completely generic api around these operations, and for architectures that don't support them we need to determine

a) Whether we want to support them anyway (IIRC the problem with PPC is that the linear kernel map has huge tlb entries that are very inefficient to break up?)

b) Whether they are needed at all on the particular architecture. The Intel x86 spec is, (according to AMD), supposed to forbid conflicting caching attributes, but the Intel graphics guys use them for GEM. PPC appears not to need it.

c) If neither of the above applies, we might be able to either use explicit cache flushes (which will require a TTM cache sync API), or require the device to use snooping mode. The architecture may also perhaps have a pool of write-combined pages that we can use. This should be indicated by defines in the api header.

/Thomas

...

Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig

Benjamin Herrenschmidt

7:35 a.m.

...

I've been doing some thinking over the years on how we could extend that functionality to other architectures. The reason we need those is because some x86 processors (early AMDs and, I think VIA c3) dislike multiple mappings of the same pages with conflicting caching attributes.

What we really want to be able to do is to unmap pages from the linear kernel map, to avoid having to transition the linear kernel map every time we change other mappings.

The reason we need to do this in the first place is that AGP and modern GPUs has a fast mode where snooping is turned off.

Right. Unfortunately, unmapping pages from the linear mapping is precisely what I cannot give you on powerpc :-(

This is due to our tendency to map it using the largest page size available. That translates to things like:

- On hash based ppc64, I use 16M pages. I can't "break them up" due to the limitation of the processor of having a single page size per segment (and we use 1T segments nowadays). I could break the whole thing down to 4K but that would very seriously affect system performances.

- On embedded, I map it using 1G pages. I suppose I could break it up since it's SW loaded but here too, system performance would suffer. In addition, we rely on ppc32 embedded to have the first 768M of the linear mapping and on ppc64 embedded, the first 1G, mapped using bolted TLB entries, which we can really only do using very large entries (respectively 256M and 1G) that can't be broken up.

So you need to make sure whatever APIs you come up with will work on architectures where memory -has- to be cachable and coherent and you cannot play with the linear mapping. But that won't help with our non-coherent embedded systems :-(

Maybe with future chips we'll have more flexibility here but not at this point.

...

However, we should be able to construct a completely generic api around these operations, and for architectures that don't support them we need to determine

a) Whether we want to support them anyway (IIRC the problem with PPC is that the linear kernel map has huge tlb entries that are very inefficient to break up?)

Depends on the PPC variant / type of MMU. Inefficiency is part of the problem. The need to have things bolted is another part. 4xx/BookE for example needs to have lowmem bolted in the TLB. If it's broken up, you'll quickly use up the TLB with bolted entries.

We could relax that to a certain extent until only the kernel text/data/bss needs to be bolted, tho that would be at the expense of performance of the TLB miss handlers which would have issues walking the page tables. We'd also need to make sure we don't hand out to your API the memory that is within the bolted entries that cover the kernel.

IE. If the kernel is large (32M ?) then the smallest entry I can use on some CPUs will be 256M. So I'll need to have a way to allocate outside of the first 256M. The linux allocators today don't allow for that sort of restrictions.

...

b) Whether they are needed at all on the particular architecture. The Intel x86 spec is, (according to AMD), supposed to forbid conflicting caching attributes, but the Intel graphics guys use them for GEM. PPC appears not to need it.

We have problems with AGP and macs, we chose to mostly ignore them and things have been working so-so ... with the old DRM. With DRI2 being much more aggressive at mapping/unmapping things, things became a lot less stable and it could be in part related to that. IE. Aliases are similarily forbidden but we create them anyways.

...

c) If neither of the above applies, we might be able to either use explicit cache flushes (which will require a TTM cache sync API), or require the device to use snooping mode. The architecture may also perhaps have a pool of write-combined pages that we can use. This should be indicated by defines in the api header.

Right. We should still shoot HW designers who give up coherency for the sake of 3D benchmarks. It's insanely stupid.

Cheers, Ben.

...

/Thomas

...

Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig

Thomas Hellstrom

10:55 a.m.

On 04/29/2011 09:35 AM, Benjamin Herrenschmidt wrote:

...

We have problems with AGP and macs, we chose to mostly ignore them and things have been working so-so ... with the old DRM. With DRI2 being much more aggressive at mapping/unmapping things, things became a lot less stable and it could be in part related to that. IE. Aliases are similarily forbidden but we create them anyways.

Do you have any idea how other OS's solve this AGP issue on Macs? Using a fixed pool of write-combined pages?

...

...
c) If neither of the above applies, we might be able to either use explicit cache flushes (which will require a TTM cache sync API), or require the device to use snooping mode. The architecture may also perhaps have a pool of write-combined pages that we can use. This should be indicated by defines in the api header.

Right. We should still shoot HW designers who give up coherency for the sake of 3D benchmarks. It's insanely stupid.

I agree. From a driver writer's perspective having the GPU always snooping the system pages would be a dream. On the GPUs that do support snooping that I have looked at, its internal MMU usually support both modes, but the snooping mode is way slower (we're talking 50-70% or so slower texturing operations), and often buggy causing crashes or scanout timing issues since system designers apparently don't really count on it being used. I've found it usable for device-to-system memory blits.

In addition memcpy to device is usually way faster if the destination is write-combined. Probably due to cache thrashing effects.

/Thomas

...

Cheers, Ben.

...
/Thomas
 
...
Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
   

Benjamin Herrenschmidt

10:50 p.m.

On Fri, 2011-04-29 at 12:55 +0200, Thomas Hellstrom wrote:

...

On 04/29/2011 09:35 AM, Benjamin Herrenschmidt wrote:

...
We have problems with AGP and macs, we chose to mostly ignore them and things have been working so-so ... with the old DRM. With DRI2 being much more aggressive at mapping/unmapping things, things became a lot less stable and it could be in part related to that. IE. Aliases are similarily forbidden but we create them anyways.

Do you have any idea how other OS's solve this AGP issue on Macs? Using a fixed pool of write-combined pages?

Write-combine is a different business, it's a matter of not mapping with the G bit, but no, the way MacOS works I think is that they don't actually use large pages at all, and I don't even think they have a linear mapping of all memory. On the other hand they are slow :-)

...

...
...
c) If neither of the above applies, we might be able to either use explicit cache flushes (which will require a TTM cache sync API), or require the device to use snooping mode. The architecture may also perhaps have a pool of write-combined pages that we can use. This should be indicated by defines in the api header.

Right. We should still shoot HW designers who give up coherency for the sake of 3D benchmarks. It's insanely stupid.

I agree. From a driver writer's perspective having the GPU always snooping the system pages would be a dream. On the GPUs that do support snooping that I have looked at, its internal MMU usually support both modes, but the snooping mode is way slower (we're talking 50-70% or so slower texturing operations), and often buggy causing crashes or scanout timing issues since system designers apparently don't really count on it being used. I've found it usable for device-to-system memory blits.

In addition memcpy to device is usually way faster if the destination is write-combined. Probably due to cache thrashing effects.

Possibly. It's a matter of the HW folks actually spending some time to make it work properly. It can be done :-) It's just that they don't bother. Look at the perfs one can get out of fully coherent PCIe nowadays, largely enough for a simple scanout :-)

Cheers, Ben.

...

/Thomas

...
Cheers, Ben.

...
/Thomas
 
...
Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
   
linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Jesse Barnes

4:27 p.m.

On Fri, 29 Apr 2011 17:35:23 +1000 Benjamin Herrenschmidt benh@kernel.crashing.org wrote:

...

...
I've been doing some thinking over the years on how we could extend that functionality to other architectures. The reason we need those is because some x86 processors (early AMDs and, I think VIA c3) dislike multiple mappings of the same pages with conflicting caching attributes.

What we really want to be able to do is to unmap pages from the linear kernel map, to avoid having to transition the linear kernel map every time we change other mappings.

The reason we need to do this in the first place is that AGP and modern GPUs has a fast mode where snooping is turned off.

Right. Unfortunately, unmapping pages from the linear mapping is precisely what I cannot give you on powerpc :-(

This is due to our tendency to map it using the largest page size available. That translates to things like:

On hash based ppc64, I use 16M pages. I can't "break them up" due to

the limitation of the processor of having a single page size per segment (and we use 1T segments nowadays). I could break the whole thing down to 4K but that would very seriously affect system performances.

On embedded, I map it using 1G pages. I suppose I could break it up

since it's SW loaded but here too, system performance would suffer. In addition, we rely on ppc32 embedded to have the first 768M of the linear mapping and on ppc64 embedded, the first 1G, mapped using bolted TLB entries, which we can really only do using very large entries (respectively 256M and 1G) that can't be broken up. So you need to make sure whatever APIs you come up with will work on architectures where memory -has- to be cachable and coherent and you cannot play with the linear mapping. But that won't help with our non-coherent embedded systems :-(

You must be making it sound worse than it really is, otherwise how would an embedded platform like the above deal with a display engine that needed a large, contiguous chunk of uncached memory for the display buffer? If the CPU is actively speculating into it and overwriting blits etc it would never work... Or do you do such reservations up front at 1G granularity??

...

Right. We should still shoot HW designers who give up coherency for the sake of 3D benchmarks. It's insanely stupid.

Ah if it were that simple. :) There are big costs to implementing full coherency for all your devices, as you well know, so it's just not a question of benchmark optimization.

-- Jesse Barnes, Intel Open Source Technology Center

Benjamin Herrenschmidt

10:46 p.m.

On Fri, 2011-04-29 at 09:27 -0700, Jesse Barnes wrote:

...

You must be making it sound worse than it really is, otherwise how would an embedded platform like the above deal with a display engine that needed a large, contiguous chunk of uncached memory for the display buffer? If the CPU is actively speculating into it and overwriting blits etc it would never work... Or do you do such reservations up front at 1G granularity??

Such embedded platforms have not been used with GPUs so far and our only implementation of 64-bit BookE is fortunately also completely cache coherent :-)

The good thing on ppc is that so far there is no new design coming from us or FSL that isn't cache coherent. The bad thing is that people seem to still try to pump out things using old 44x which isn't and somewhat seem to also want to use GPUs on them :-)

The 44x is a case where I have a small (64 entries) SW loaded TLB and I bolt the first 768M of the linear mapping (lowmem) using 3x256M entries. What "saves" it is that it's also an ancient design with essentially a busted prefetch engine that will thus cope with aliases as long as we don't explicitely access the cached and non-cached aliases simultaneously.

The nasty cases I have never really dealt with properly are the Apple machines and their non coherent AGP. Those processors were really not designed with the idea that one would do non-coherent DMA, especially the 970 (G5) and our Linux code really don't like it.

Things tend to "work" with DRI 1 because we allocate the AGP memory once in one big chunk (it's pages but they are allocated together and thus tend to be contiguous) so the possible issues with prefetch are so rare, I think we end up being lucky. With DRI 2 dynamically mapping things in/out, we have a bigger problem and I don't know how to solve it other than forcing the DRM to allocate graphic objects in reserved areas of memory made of 16M pools that I unmap from the linear mapping.... (since I use 16M pages to map the linear mapping).

For ppc32 laptops it's even worse as I use 256MB BATs (block address translation, kind of special registers to create large static mappings) to map the linear mapping, which brings me back to the 44x case to some extent. I can't really do without at the moment, at the very least I require the kernel text / data / bss to be covered by BATs.

...

...
Right. We should still shoot HW designers who give up coherency for the sake of 3D benchmarks. It's insanely stupid.

Ah if it were that simple. :) There are big costs to implementing full coherency for all your devices, as you well know, so it's just not a question of benchmark optimization.

But it -is- that simple.

You do have to deal with coherency anyways for your PHB unless you start advocating that we should make everything else non coherent as well. So you have the logic. Just make your GPU operate on the same protocol.

It's really only a perf tradeoff I believe. And a bad one.

Cheers, Ben.

Jesse Barnes

30 Apr 30 Apr

2:45 a.m.

On Sat, 30 Apr 2011 08:46:54 +1000 Benjamin Herrenschmidt benh@kernel.crashing.org wrote:

...

...
Ah if it were that simple. :) There are big costs to implementing full coherency for all your devices, as you well know, so it's just not a question of benchmark optimization.

But it -is- that simple.

You do have to deal with coherency anyways for your PHB unless you start advocating that we should make everything else non coherent as well. So you have the logic. Just make your GPU operate on the same protocol.

It's really only a perf tradeoff I believe. And a bad one.

Ok so I was the one oversimplifying. :) Yes, it's definitely doable to make a cache coherent PHB, and is awfully nice from a perf and programming perspective.

But as you say, to make a high performance one for things like gfx, or even to handle things like atomic ops, adds a lot of expense (in the case of graphics, a whole lot unless you can integrate with the CPU, and even then display can be tough to deal with).

I don't see even good coherent implementations being good enough for high perf graphics in the near term (though at least on relatively high power designs like Sandy Bridge we're getting close) so we'll have to solve the uncached and simultaneous mapping issue both for today's hardware and the near future.

-- Jesse Barnes, Intel Open Source Technology Center

Russell King - ARM Linux

29 Apr 29 Apr

7:59 a.m.

On Fri, Apr 29, 2011 at 07:50:12AM +0200, Thomas Hellstrom wrote:

...

However, we should be able to construct a completely generic api around these operations, and for architectures that don't support them we need to determine

a) Whether we want to support them anyway (IIRC the problem with PPC is that the linear kernel map has huge tlb entries that are very inefficient to break up?)

That same issue applies to ARM too - you'd need to stop the entire machine, rewrite all processes page tables, flush tlbs, and only then restart. Otherwise there's the possibility of ending up with conflicting types of TLB entries, and I'm not sure what the effect of having two matching TLB entries for the same address would be.

...

b) Whether they are needed at all on the particular architecture. The Intel x86 spec is, (according to AMD), supposed to forbid conflicting caching attributes, but the Intel graphics guys use them for GEM. PPC appears not to need it.

Some versions of the architecture manual say that having multiple mappings with differing attributes is unpredictable.

Jesse Barnes

4:32 p.m.

On Fri, 29 Apr 2011 08:59:58 +0100 Russell King - ARM Linux linux@arm.linux.org.uk wrote:

...

On Fri, Apr 29, 2011 at 07:50:12AM +0200, Thomas Hellstrom wrote:

...
However, we should be able to construct a completely generic api around these operations, and for architectures that don't support them we need to determine

a) Whether we want to support them anyway (IIRC the problem with PPC is that the linear kernel map has huge tlb entries that are very inefficient to break up?)

That same issue applies to ARM too - you'd need to stop the entire machine, rewrite all processes page tables, flush tlbs, and only then restart. Otherwise there's the possibility of ending up with conflicting types of TLB entries, and I'm not sure what the effect of having two matching TLB entries for the same address would be.

Right, I don't think anyone wants to see this sort of thing happen with any frequency. So either a large, uncached region can be set up a boot time for allocations, or infrequent, large requests and conversions can be made on demand, with memory being freed back to the main, coherent pool under pressure.

...

...
b) Whether they are needed at all on the particular architecture. The Intel x86 spec is, (according to AMD), supposed to forbid conflicting caching attributes, but the Intel graphics guys use them for GEM. PPC appears not to need it.

Some versions of the architecture manual say that having multiple mappings with differing attributes is unpredictable.

Yes, there's a bit of abuse going on there. We've received a guarantee that if the CPU speculates a line into the cache, as long as it's not modified through the cacheable mapping the CPU won't write it back to memory; it'll discard the line as needed instead (iirc AMD CPUs will actually write back clean lines, so GEM wouldn't work the same way there).

But even with GEM, there is a large performance penalty for having to allocate a new buffer object the first time. Even though we don't have to change mappings by stopping the machine etc, we still have to flush out everything from the CPU relating to the object (since some lines may be dirty), and then flush the memory controller buffers before accessing it through the uncached mapping. So at least currently, we're all in the same boat when it comes to new object allocations: they will be expensive unless you already have some uncached mappings you can re-use.

-- Jesse Barnes, Intel Open Source Technology Center

Arnd Bergmann

6:29 p.m.

On Friday 29 April 2011 18:32:09 Jesse Barnes wrote:

...

On Fri, 29 Apr 2011 08:59:58 +0100 Russell King - ARM Linux linux@arm.linux.org.uk wrote:

...
On Fri, Apr 29, 2011 at 07:50:12AM +0200, Thomas Hellstrom wrote:

...
However, we should be able to construct a completely generic api around these operations, and for architectures that don't support them we need to determine

a) Whether we want to support them anyway (IIRC the problem with PPC is that the linear kernel map has huge tlb entries that are very inefficient to break up?)

That same issue applies to ARM too - you'd need to stop the entire machine, rewrite all processes page tables, flush tlbs, and only then restart. Otherwise there's the possibility of ending up with conflicting types of TLB entries, and I'm not sure what the effect of having two matching TLB entries for the same address would be.

Right, I don't think anyone wants to see this sort of thing happen with any frequency. So either a large, uncached region can be set up a boot time for allocations, or infrequent, large requests and conversions can be made on demand, with memory being freed back to the main, coherent pool under pressure.

I'd like to first have an official confirmation from the CPU designers if there is actually a problem with mapping a single page both cacheable and noncacheable. Based on what Catalin said, it's probably allowed and the current spec is just being more paranoid than it needs to be. Also, KyongHo Cho said that it might only be relevant for pages that are mapped executable.

If that is the case, we can probably work around this by turning the entire linear mapping (except for the kernel binary) into nonexecutable mode, if we don't do that already. This is desirable for security purposes anyway.

Arnd

Russell King - ARM Linux

10:15 p.m.

On Fri, Apr 29, 2011 at 08:29:50PM +0200, Arnd Bergmann wrote:

...

On Friday 29 April 2011 18:32:09 Jesse Barnes wrote:

...
On Fri, 29 Apr 2011 08:59:58 +0100 Russell King - ARM Linux linux@arm.linux.org.uk wrote:

...
On Fri, Apr 29, 2011 at 07:50:12AM +0200, Thomas Hellstrom wrote:

...
However, we should be able to construct a completely generic api around these operations, and for architectures that don't support them we need to determine

a) Whether we want to support them anyway (IIRC the problem with PPC is that the linear kernel map has huge tlb entries that are very inefficient to break up?)

That same issue applies to ARM too - you'd need to stop the entire machine, rewrite all processes page tables, flush tlbs, and only then restart. Otherwise there's the possibility of ending up with conflicting types of TLB entries, and I'm not sure what the effect of having two matching TLB entries for the same address would be.

Right, I don't think anyone wants to see this sort of thing happen with any frequency. So either a large, uncached region can be set up a boot time for allocations, or infrequent, large requests and conversions can be made on demand, with memory being freed back to the main, coherent pool under pressure.

I'd like to first have an official confirmation from the CPU designers if there is actually a problem with mapping a single page both cacheable and noncacheable.

Everytime this gets discussed, someone says that because they don't believe what I say. OMAP folk confirmed it last time around.

I'm getting tired of this. I'm going to give up with answering any further Linux questions until next week and I'll delete my entire mailbox this weekend as I really can't be bothered to catch up with all the crap that's happened over easter. I'm really getting pissed off at all the shite crap that's flying around at the moment that I'm really starting to not care one ounce about Linux, either on ARM or on this utterly shite and broken x86 hardware.

Let ARM rot in mainline. I really don't care anymore.

David Brown

2 May 2 May

4:42 a.m.

On Fri, Apr 29 2011, Russell King - ARM Linux wrote:

...

On Fri, Apr 29, 2011 at 08:29:50PM +0200, Arnd Bergmann wrote:

...

...
I'd like to first have an official confirmation from the CPU designers if there is actually a problem with mapping a single page both cacheable and noncacheable.

Everytime this gets discussed, someone says that because they don't believe what I say. OMAP folk confirmed it last time around.

I'll confirm this from the Qualcomm side as well. You cannot have multiple inconsistent mappings of the same page without having difficult to find problems.

The spec clarifications appear to give ways of dealing with it if it happens, and bounds on what can go wrong, but I wouldn't call it something we want to do normally. Corrupt data is arguably less of a problem than nasal demons, but still a problem.

David

...

I'm getting tired of this. I'm going to give up with answering any further Linux questions until next week and I'll delete my entire mailbox this weekend as I really can't be bothered to catch up with all the crap that's happened over easter. I'm really getting pissed off at all the shite crap that's flying around at the moment that I'm really starting to not care one ounce about Linux, either on ARM or on this utterly shite and broken x86 hardware.

Let ARM rot in mainline. I really don't care anymore.

Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig

-- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

Arnd Bergmann

11:26 a.m.

On Monday 02 May 2011, David Brown wrote:

...

I'll confirm this from the Qualcomm side as well. You cannot have multiple inconsistent mappings of the same page without having difficult to find problems.

I believe Catalin was referring to the case where you have only one nonconsistent (cacheable) mapping plus multiple consistent (cacheable) mappings. I don't think anyone has suggested doing DMA to a page that has multiple nonconsistent mappings with virtually indexed caches.

...

The spec clarifications appear to give ways of dealing with it if it happens, and bounds on what can go wrong, but I wouldn't call it something we want to do normally. Corrupt data is arguably less of a problem than nasal demons, but still a problem.

Anything that has a theoretical chance of corrupting data is not an option, but I'd really like to see what the clarified spec says about this. Even if there is a way to legally leave a page for dma_alloc_coherent in the linear mapping, it might turn out to be harder to do than using highmem pages or unmapping supersections at run time as was suggested.

Arnd

Benjamin Herrenschmidt

29 Apr 29 Apr

10:37 p.m.

On Fri, 2011-04-29 at 20:29 +0200, Arnd Bergmann wrote:

...

If that is the case, we can probably work around this by turning the entire linear mapping (except for the kernel binary) into nonexecutable mode, if we don't do that already. This is desirable for security purposes anyway

You'd still have an "edge" problem if you use large pages for the linear mapping, you can't obviously make part of the kernel text NX and you'd have to make sure you 'exclude' from those GPU allocations whatever overlaps with your last executable large page.

In a way, it's a similar problem I have with bolted memory on BookE where I can't restrict GPU allocations to memory that isn't bolted :-)

Cheers, Ben.

Joerg Roedel

1:42 p.m.

On Thu, Apr 28, 2011 at 03:37:00PM -0400, Jerome Glisse wrote:

...

As Jesse pointed out already, for performance reasons it's lot better if you let the driver decide even if you have an iommu capable of handling coherency for you. My understanding is that each time coherency is asked for it trigger bus activities of some kind (i think snoop is the term used for pci) this traffic can slow down both the cpu and the device. For graphic driver we have a lot of write once and use (once or more) buffer and it makes a lot of sense to have those buffer allocated using uncached memory so we can tell the device (in case of drm driver) that there is no need to trigger snoop activities for coherency. So i believe the decision should ultimately be in the driver side.

Stupid question: Couldn't these write-once-read-often buffers just stay in the memory of the GPU instead of refetching them every time from main memory? Or is that necessary because of the limited space on some GPUs?

Regards,

Joerg

Jerome Glisse

2:19 p.m.

On Fri, Apr 29, 2011 at 9:42 AM, Joerg Roedel joro@8bytes.org wrote:

...

On Thu, Apr 28, 2011 at 03:37:00PM -0400, Jerome Glisse wrote:

...
As Jesse pointed out already, for performance reasons it's lot better if you let the driver decide even if you have an iommu capable of handling coherency for you. My understanding is that each time coherency is asked for it trigger bus activities of some kind (i think snoop is the term used for pci) this traffic can slow down both the cpu and the device. For graphic driver we have a lot of write once and use (once or more) buffer and it makes a lot of sense to have those buffer allocated using uncached memory so we can tell the device (in case of drm driver) that there is no need to trigger snoop activities for coherency. So i believe the decision should ultimately be in the driver side.

Stupid question: Couldn't these write-once-read-often buffers just stay in the memory of the GPU instead of refetching them every time from main memory? Or is that necessary because of the limited space on some GPUs?

Regards,

Joerg

We might be talking about several G of data, so using system is not uncommon. Also when uploading data to GPU vram is better to let the GPU do dma from system memory rather than having the CPU do memcpy.

Cheers, Jerome

Jordan Crouse

3:37 p.m.

On 04/29/2011 07:42 AM, Joerg Roedel wrote:

...

On Thu, Apr 28, 2011 at 03:37:00PM -0400, Jerome Glisse wrote:

...
As Jesse pointed out already, for performance reasons it's lot better if you let the driver decide even if you have an iommu capable of handling coherency for you. My understanding is that each time coherency is asked for it trigger bus activities of some kind (i think snoop is the term used for pci) this traffic can slow down both the cpu and the device. For graphic driver we have a lot of write once and use (once or more) buffer and it makes a lot of sense to have those buffer allocated using uncached memory so we can tell the device (in case of drm driver) that there is no need to trigger snoop activities for coherency. So i believe the decision should ultimately be in the driver side.

Stupid question: Couldn't these write-once-read-often buffers just stay in the memory of the GPU instead of refetching them every time from main memory? Or is that necessary because of the limited space on some GPUs?

Not all embedded GPUs have their own dedicated memory. On the MSM architecture the devices and the CPU share the same physical pool.

Jordan

FUJITA Tomonori

28 Apr 28 Apr

2:38 p.m.

I'm busy at a conference so I've not read the whole thread yet..

On Thu, 28 Apr 2011 16:29:52 +0200 Arnd Bergmann arnd@arndb.de wrote:

...

I was mistaken there: all drivers that use dma_alloc_noncoherent either use dma_cache_sync() or they do something that is more broken, but they don't do dma_sync_*.

As the DMA-API.txt says, dma_alloc_noncoherent should be used with dma_cache_sync(). You shouldn't use dma_sync* API with a memory returned by dma_alloc_noncoherent().

Benjamin Herrenschmidt

29 Apr 29 Apr

12:25 a.m.

...

However, dma_alloc_coherent() memory can't be used with the dma_sync_* API as its return address (unlike other architectures) is not in the kernel direct mapped memory range.

Well, on non-coherent architectures, dma_sync_* are cache flushes, I don't see the point of doing those on a non-cachable mapping anyways.

...

The only thing valid for dma_sync_* are buffers which have been passed to the dma_map_* APIs.

Right, at least that's our expectation on powerpc as well.

...

Instead, I think what you're referring to is dma_cache_sync(), which is the API to be used with dma_alloc_noncoherent(), which we don't implement.

Too may confusing APIs....

Ben.

...

As we have problems with some SMP implementations, and the noncoherent API doesn't have the idea of buffer ownership, it's rather hard to deal with the DMA cache implications with the existing API, especially with the issues of speculative prefetching. The current usage (looking at drivers/scsi/53c700.c) doesn't cater for speculative prefetch as the dma_cache_sync(,,,DMA_FROM_DEVICE) is done well in advance of the DMA actually happening.

So all in all, I think the noncoherent API is broken as currently designed - and until we have devices on ARM which use it, I don't see much point into trying to fix the current thing especially as we'd be unable to test. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

Arnd Bergmann

11:21 a.m.

On Friday 29 April 2011, Benjamin Herrenschmidt wrote:

...

...
However, dma_alloc_coherent() memory can't be used with the dma_sync_* API as its return address (unlike other architectures) is not in the kernel direct mapped memory range.

Well, on non-coherent architectures, dma_sync_* are cache flushes, I don't see the point of doing those on a non-cachable mapping anyways.

The point was that you cannot do

#define dma_alloc_coherent dma_alloc_noncoherent

on ARM, as some other architectures do.

Arnd

Joerg Roedel

28 Apr 28 Apr

10:41 a.m.

On Wed, Apr 27, 2011 at 08:35:14AM +0100, Russell King - ARM Linux wrote:

...

On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:

...

dma_map_ops design is broken - we can't have the entire DMA API indirected through that structure. Whether you have an IOMMU or not is completely independent of whether you have to do DMA cache handling. Moreover, with dmabounce, having the DMA cache handling in place doesn't make sense.

So you can't have a dma_map_ops for the cache handling bits, a dma_map_ops for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't work like that.

Nobody says that the complete feature-set of the dma_ops needs to be provided through the IOMMU-API. The different APIs are there to solve different problems:

The IOMMU-API provides low-level access to IOMMU hardware and to map io addresses to physical addresses (which can be chosen by the caller). The IOMMU-API does not care about address space layout or cache management.

The DMA-API cares about address management. Every dma_ops implementation using an IOMMU has an address allocator for io addresses implemented. the DMA-API also cares about cache-management.

So if we can abstract the different IOMMUs on all architectures in the IOMMU-API I see no reason why we can't have a common dma_ops implementation. The dma-buffer ownership management (cpu<->device) can be put into archtectural call-backs so that architectures that need it just implement them and everything should work.

Or I am too naive to believe that (which is possible because of my limited ARM knowledge). In this case please correct me :)

Regards,

Joerg

Russell King - ARM Linux

11:01 a.m.

On Thu, Apr 28, 2011 at 12:41:44PM +0200, Joerg Roedel wrote:

...

So if we can abstract the different IOMMUs on all architectures in the IOMMU-API I see no reason why we can't have a common dma_ops implementation. The dma-buffer ownership management (cpu<->device) can be put into archtectural call-backs so that architectures that need it just implement them and everything should work.

That is precisely what I'm arguing for. The DMA cache management is architecture specific and should stay in the architecture specific code. The IOMMU level stuff should bolt into that at the architecture specific level.

So, eg, for ARM:

dma_addr_t dma_map_page(struct device *dev, struct page *page, size_t offset, size_t size, enum dma_data_direction dir) { struct dma_map_ops *ops = get_dma_ops(dev); dma_addr_t addr;

BUG_ON(!valid_dma_direction(dir)); if (ops->flags & DMA_MANAGE_CACHE || !dev->dma_cache_coherent) __dma_page_cpu_to_dev(page, offset, size, dir); addr = ops->map_page(dev, page, offset, size, dir, NULL); debug_dma_map_page(dev, page, offset, size, dir, addr, false);

return addr; }

Things like swiotlb and dmabounce would not set DMA_MANAGE_CACHE in ops->flags, but real iommus and the standard no-iommu implementations would be required to set it to ensure that data is visible in memory for CPUs which have DMA incoherent caches.

Maybe renaming DMA_MANAGE_CACHE to DMA_DATA_DEVICE_VISIBLE or something like that would be more explicit as to its function.

dev->dma_cache_coherent serves to cover the case mentioned by the DRM folk.

Joerg Roedel

12:25 p.m.

On Thu, Apr 28, 2011 at 12:01:29PM +0100, Russell King - ARM Linux wrote:

...

dma_addr_t dma_map_page(struct device *dev, struct page *page, size_t offset, size_t size, enum dma_data_direction dir) { struct dma_map_ops *ops = get_dma_ops(dev); dma_addr_t addr;

BUG_ON(!valid_dma_direction(dir)); if (ops->flags & DMA_MANAGE_CACHE || !dev->dma_cache_coherent) __dma_page_cpu_to_dev(page, offset, size, dir); addr = ops->map_page(dev, page, offset, size, dir, NULL); debug_dma_map_page(dev, page, offset, size, dir, addr, false);

return addr; }

Things like swiotlb and dmabounce would not set DMA_MANAGE_CACHE in ops->flags, but real iommus and the standard no-iommu implementations would be required to set it to ensure that data is visible in memory for CPUs which have DMA incoherent caches.

Do we need flags for that? A flag is necessary if the cache-management differs between IOMMU implementations on the same platform. If cache-management is only specific to the platform (or architecture) then it does make more sense to just call the function without flag checking and every platform with coherent DMA just implements these as static inline noops.

Regards,

Joerg

Russell King - ARM Linux

12:42 p.m.

On Thu, Apr 28, 2011 at 02:25:09PM +0200, Joerg Roedel wrote:

...

On Thu, Apr 28, 2011 at 12:01:29PM +0100, Russell King - ARM Linux wrote:

...
dma_addr_t dma_map_page(struct device *dev, struct page *page, size_t offset, size_t size, enum dma_data_direction dir) { struct dma_map_ops *ops = get_dma_ops(dev); dma_addr_t addr;

BUG_ON(!valid_dma_direction(dir)); if (ops->flags & DMA_MANAGE_CACHE || !dev->dma_cache_coherent) __dma_page_cpu_to_dev(page, offset, size, dir); addr = ops->map_page(dev, page, offset, size, dir, NULL); debug_dma_map_page(dev, page, offset, size, dir, addr, false);

return addr; }

Things like swiotlb and dmabounce would not set DMA_MANAGE_CACHE in ops->flags, but real iommus and the standard no-iommu implementations would be required to set it to ensure that data is visible in memory for CPUs which have DMA incoherent caches.

Do we need flags for that? A flag is necessary if the cache-management differs between IOMMU implementations on the same platform. If cache-management is only specific to the platform (or architecture) then it does make more sense to just call the function without flag checking and every platform with coherent DMA just implements these as static inline noops.

Sigh. You're not seeing the point.

There is _no_ point doing the cache management _if_ we're using something like dmabounce or swiotlb, as we'll be using memcpy() at some point with the buffer. Moreover, dmabounce or swiotlb may have to do its own cache management _after_ that memcpy() to ensure that the page cache requirements are met.

Doing DMA cache management for dmabounce or swiotlb will result in unnecessary overhead - and as we can see from the MMC discussions, it has a _significant_ performance impact.

Think about it. If you're using dmabounce, but still do the cache management:

1. you flush the data out of the CPU cache back to memory. 2. you allocate new memory using dma_alloc_coherent() for the DMA buffer which is accessible to the device. 3. you memcpy() the data out of the buffer you just flushed into the DMA buffer - this re-fills the cache, evicting entries which may otherwise be hot due to the cache fill policy.

Step 1 is entirely unnecessary and is just a complete and utter waste of CPU resources.

Joerg Roedel

12:59 p.m.

On Thu, Apr 28, 2011 at 01:42:42PM +0100, Russell King - ARM Linux wrote:

...

Sigh. You're not seeing the point.

There is _no_ point doing the cache management _if_ we're using something like dmabounce or swiotlb, as we'll be using memcpy() at some point with the buffer. Moreover, dmabounce or swiotlb may have to do its own cache management _after_ that memcpy() to ensure that the page cache requirements are met.

Well, I was talking about a generic dma_ops implementation based on the iommu-api so that every system that has iommu hardware can use a common code-set. If you have to dma-bounce you don't have iommu hardware and thus you don't use this common implementation of dma_ops (but probably the swiotlb implementation which is already mostly generic).

...

Doing DMA cache management for dmabounce or swiotlb will result in unnecessary overhead - and as we can see from the MMC discussions, it has a _significant_ performance impact.

Yeah, I see that from your explanation below. But as I said, swiotlb backend is not a target use-case for a common iommu-api-bound dma_ops implementation.

...

Think about it. If you're using dmabounce, but still do the cache management:

you flush the data out of the CPU cache back to memory.

you allocate new memory using dma_alloc_coherent() for the DMA buffer which is accessible to the device.

you memcpy() the data out of the buffer you just flushed into the DMA buffer - this re-fills the cache, evicting entries which may otherwise be hot due to the cache fill policy.

Step 1 is entirely unnecessary and is just a complete and utter waste of CPU resources.

Thanks for the explanation.

Regards,

Joerg

Arnd Bergmann

1:02 p.m.

On Thursday 28 April 2011, Russell King - ARM Linux wrote:

...

...
Do we need flags for that? A flag is necessary if the cache-management differs between IOMMU implementations on the same platform. If cache-management is only specific to the platform (or architecture) then it does make more sense to just call the function without flag checking and every platform with coherent DMA just implements these as static inline noops.

Sigh. You're not seeing the point.

There is no point doing the cache management if we're using something like dmabounce or swiotlb, as we'll be using memcpy() at some point with the buffer. Moreover, dmabounce or swiotlb may have to do its own cache management after that memcpy() to ensure that the page cache requirements are met.

I think the misunderstanding is that you are saying we need the flag in dma_map_ops because you prefer to keep the cache management outside of the individual dma_map_ops implementations.

What I guess Jörg is thinking of is to have the generic IOMMU version of dma_map_ops call into the architecture specific code to manage the caches on architectures that need it. That implementation would of course not require the flag in dma_map_ops because the architecture specific callback would use other ways (hardcoded for an architecture, or looking at the individual device) to determine if this is ever needed.

That is also what I had in mind earlier, but you argued against it on the base that putting the logic into the common code would lead to a higher risk of people accidentally breaking it when they only care about coherent architectures.

Arnd

Russell King - ARM Linux

1:19 p.m.

On Thu, Apr 28, 2011 at 03:02:16PM +0200, Arnd Bergmann wrote:

...

I think the misunderstanding is that you are saying we need the flag in dma_map_ops because you prefer to keep the cache management outside of the individual dma_map_ops implementations.

What I guess Jörg is thinking of is to have the generic IOMMU version of dma_map_ops call into the architecture specific code to manage the caches on architectures that need it. That implementation would of course not require the flag in dma_map_ops because the architecture specific callback would use other ways (hardcoded for an architecture, or looking at the individual device) to determine if this is ever needed.

That is also what I had in mind earlier, but you argued against it on the base that putting the logic into the common code would lead to a higher risk of people accidentally breaking it when they only care about coherent architectures.

You still need this same cache handling code even when you don't have an iommu. I don't see the point in having a dma_ops level of indirection followed by a separate iommu_ops level of indirection - it seems to me to be a waste of code and CPU time, and I don't see why its even necessary when there's a much simpler way to deal with it (as I illustrated).

Joerg Roedel

1:56 p.m.

On Thu, Apr 28, 2011 at 02:19:28PM +0100, Russell King - ARM Linux wrote:

...

On Thu, Apr 28, 2011 at 03:02:16PM +0200, Arnd Bergmann wrote:

...

You still need this same cache handling code even when you don't have an iommu.

You can reference the same code from different places.

...

I don't see the point in having a dma_ops level of indirection followed by a separate iommu_ops level of indirection - it seems to me to be a waste of code and CPU time, and I don't see why its even necessary when there's a much simpler way to deal with it (as I illustrated).

There is no waste of code, just the opposite. Most of the dma_ops implementations that use an IOMMU today have a lot of similiarities in their code. All this code (on x86, alpha, sparc, ia64, ...) can be unified to a generic solution that fits all (by abstracting the differences between iommus into the iommu-api). So the current situation is a much bigger code waste than having this unified. The ARM platforms supporting iommu hardware will benefit from this as well. It simply doesn't make sense to have one dma_ops implementation for each iommu hardware around.

Regards,

Joerg

Russell King - ARM Linux

2:30 p.m.

On Thu, Apr 28, 2011 at 03:56:21PM +0200, Joerg Roedel wrote:

...

There is no waste of code, just the opposite. Most of the dma_ops implementations that use an IOMMU today have a lot of similiarities in their code. All this code (on x86, alpha, sparc, ia64, ...) can be unified to a generic solution that fits all (by abstracting the differences between iommus into the iommu-api). So the current situation is a much bigger code waste than having this unified. The ARM platforms supporting iommu hardware will benefit from this as well. It simply doesn't make sense to have one dma_ops implementation for each iommu hardware around.

I'll defer until there's patches available then - I don't think there's much value continuing to discuss this until that time.

Catalin Marinas

27 Apr 27 Apr

9:52 a.m.

Arnd,

On 21 April 2011 20:29, Arnd Bergmann arnd@arndb.de wrote:

...

I think the recent discussions on linaro-mm-sig and the BoF last week at ELC have been quite productive, and at least my understanding of the missing pieces has improved quite a bit. This is a list of things that I think need to be done in the kernel. Please complain if any of these still seem controversial:

Fix the arm version of dma_alloc_coherent. It's in use today and

is broken on modern CPUs because it results in both cached and uncached mappings. Rebecca suggested different approaches how to get there.

It's not broken since we moved to using Normal non-cacheable memory for the coherent DMA buffers (as long as you flush the cacheable alias before using the buffer, as we already do). The ARM ARM currently says unpredictable for such situations but this is being clarified in future updates and the Normal non-cacheable vs cacheable aliases can be used (given correct cache maintenance before using the buffer).

...

Implement dma_alloc_noncoherent on ARM. Marek pointed out

that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

As Russell pointed out, there are 4 main combinations with iommu and some coherency support (i.e. being able to snoop the CPU caches). But in an SoC you can have different devices with different iommu and coherency configurations. Some of them may even be able to see the L2 cache but not the L1 (in which case it would help if we can get an inner non-cacheable outer cacheable mapping).

Anyway, we end up with different DMA ops per device via dev_archdata.

-- Catalin

Arnd Bergmann

10:43 a.m.

On Wednesday 27 April 2011, Catalin Marinas wrote:

...

Arnd,

On 21 April 2011 20:29, Arnd Bergmann arnd@arndb.de wrote:

...
I think the recent discussions on linaro-mm-sig and the BoF last week at ELC have been quite productive, and at least my understanding of the missing pieces has improved quite a bit. This is a list of things that I think need to be done in the kernel. Please complain if any of these still seem controversial:

Fix the arm version of dma_alloc_coherent. It's in use today and

is broken on modern CPUs because it results in both cached and uncached mappings. Rebecca suggested different approaches how to get there.

It's not broken since we moved to using Normal non-cacheable memory for the coherent DMA buffers (as long as you flush the cacheable alias before using the buffer, as we already do). The ARM ARM currently says unpredictable for such situations but this is being clarified in future updates and the Normal non-cacheable vs cacheable aliases can be used (given correct cache maintenance before using the buffer).

Thanks for that information, I believe a number of people in the previous discussions were relying on the information from the documentation. Are you sure that this is not only correct for the cores made by ARM ltd but also for the other implementations that may have relied on documentation?

As I mentioned before, there are other architectures, where having conflicting cache settings in TLB entries for the same pysical page immediately checkstops the CPU, and I guess that this was also allowed by the current version of the ARM ARM.

...

...

Implement dma_alloc_noncoherent on ARM. Marek pointed out

that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

As Russell pointed out, there are 4 main combinations with iommu and some coherency support (i.e. being able to snoop the CPU caches). But in an SoC you can have different devices with different iommu and coherency configurations. Some of them may even be able to see the L2 cache but not the L1 (in which case it would help if we can get an inner non-cacheable outer cacheable mapping).

Anyway, we end up with different DMA ops per device via dev_archdata.

Having different DMA ops per device was the solution that I was suggesting with dma_mapping_common.h, but Russell pointed out that it may not be the best option.

The alternative would be to have just one set of dma_mapping functions as we do today, but to extend the functions to also cover the iommu case, for instance (example, don't take literally):

static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr, size_t size, enum dma_data_direction dir) { dma_addr_t ret;

#ifdef CONFIG_DMABOUNCE if (dev->archdata.dmabounce) return dmabounce_map_single(dev, cpu_addr, size, dir); #endif

#ifdef CONFIG_IOMMU if (dev->archdata.iommu) ret = iommu_map_single(dev, cpu_addr, size, dir); else #endif dma_addr = virt_to_dma(dev, ptr);

dma_sync_single_for_device(dev, dma_addr, size, dir); }

This would not even conflict with having a common implementation for iommu based dma_map_ops -- we would just call the iommu functions directly when needed rather than having an indirect function call.

Arnd

Catalin Marinas

11:08 a.m.

On Wed, 2011-04-27 at 11:43 +0100, Arnd Bergmann wrote:

...

On Wednesday 27 April 2011, Catalin Marinas wrote:

...
On 21 April 2011 20:29, Arnd Bergmann arnd@arndb.de wrote:

...
I think the recent discussions on linaro-mm-sig and the BoF last week at ELC have been quite productive, and at least my understanding of the missing pieces has improved quite a bit. This is a list of things that I think need to be done in the kernel. Please complain if any of these still seem controversial:

Fix the arm version of dma_alloc_coherent. It's in use today and

is broken on modern CPUs because it results in both cached and uncached mappings. Rebecca suggested different approaches how to get there.

It's not broken since we moved to using Normal non-cacheable memory for the coherent DMA buffers (as long as you flush the cacheable alias before using the buffer, as we already do). The ARM ARM currently says unpredictable for such situations but this is being clarified in future updates and the Normal non-cacheable vs cacheable aliases can be used (given correct cache maintenance before using the buffer).

Thanks for that information, I believe a number of people in the previous discussions were relying on the information from the documentation. Are you sure that this is not only correct for the cores made by ARM ltd but also for the other implementations that may have relied on documentation?

It is a clarification in the ARM ARM so it covers all the cores made by architecture licensees, not just ARM Ltd. It basically makes the "unpredictable" part more predictable to allow certain types of aliases (e.g. Strongly Ordered vs Normal memory would still be disallowed).

All the current implementations are safe with Normal memory aliases (cacheable vs non-cacheable) but of course, there may be some performance benefits in not having any alias.

...

As I mentioned before, there are other architectures, where having conflicting cache settings in TLB entries for the same pysical page immediately checkstops the CPU, and I guess that this was also allowed by the current version of the ARM ARM.

The current version of the ARM ARM says "unpredictable". But this general definition of "unpredictable" does not allow it to deadlock (hardware) or have security implications. It is however allowed to corrupt data.

...

...
...

Implement dma_alloc_noncoherent on ARM. Marek pointed out

that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

As Russell pointed out, there are 4 main combinations with iommu and some coherency support (i.e. being able to snoop the CPU caches). But in an SoC you can have different devices with different iommu and coherency configurations. Some of them may even be able to see the L2 cache but not the L1 (in which case it would help if we can get an inner non-cacheable outer cacheable mapping).

Anyway, we end up with different DMA ops per device via dev_archdata.

Having different DMA ops per device was the solution that I was suggesting with dma_mapping_common.h, but Russell pointed out that it may not be the best option.

IMHO, that's the most flexible option. I can't say for sure whether we'll need such flexibility in the future.

...

The alternative would be to have just one set of dma_mapping functions as we do today, but to extend the functions to also cover the iommu case, for instance (example, don't take literally):

static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr, size_t size, enum dma_data_direction dir) { dma_addr_t ret;

#ifdef CONFIG_DMABOUNCE if (dev->archdata.dmabounce) return dmabounce_map_single(dev, cpu_addr, size, dir); #endif

#ifdef CONFIG_IOMMU if (dev->archdata.iommu) ret = iommu_map_single(dev, cpu_addr, size, dir); else #endif dma_addr = virt_to_dma(dev, ptr);
    dma_sync_single_for_device(dev, dma_addr, size, dir);
}

This would not even conflict with having a common implementation for iommu based dma_map_ops -- we would just call the iommu functions directly when needed rather than having an indirect function call.

I don't particularly like having lots of #ifdef's (but we could probably have some macros checking archdata.* to make this cleaner).

We also need a way to specify a coherency level as we are getting platforms with devices connected to something like ACP (ARM Coherency Port).

-- Catalin

Valdis.Kletnieks＠vt.edu

28 Apr 28 Apr

12:15 a.m.

On Wed, 27 Apr 2011 12:08:28 BST, Catalin Marinas said:

...

The current version of the ARM ARM says "unpredictable". But this general definition of "unpredictable" does not allow it to deadlock (hardware) or have security implications. It is however allowed to corrupt data.

Not allowed to have security implications, but is allowed to corrupt data.

*boggle* :)

(The problem being, of course, that if the attacker is able to predict/control what gets corrupted, it can easily end up leveraged into a security implication.)

Catalin Marinas

8:27 a.m.

On Thu, 2011-04-28 at 01:15 +0100, Valdis.Kletnieks@vt.edu wrote:

...

On Wed, 27 Apr 2011 12:08:28 BST, Catalin Marinas said:

...
The current version of the ARM ARM says "unpredictable". But this general definition of "unpredictable" does not allow it to deadlock (hardware) or have security implications. It is however allowed to corrupt data.

Not allowed to have security implications, but is allowed to corrupt data.

By security I was referring to TrustZone extensions. IOW, unpredictable in normal (non-secure) world should not cause data corruption in the secure world.

-- Catalin

Arnd Bergmann

12:12 p.m.

On Thursday 28 April 2011, Catalin Marinas wrote:

...

On Thu, 2011-04-28 at 01:15 +0100, Valdis.Kletnieks@vt.edu wrote:

...
On Wed, 27 Apr 2011 12:08:28 BST, Catalin Marinas said:

...
The current version of the ARM ARM says "unpredictable". But this general definition of "unpredictable" does not allow it to deadlock (hardware) or have security implications. It is however allowed to corrupt data.

Not allowed to have security implications, but is allowed to corrupt data.

By security I was referring to TrustZone extensions. IOW, unpredictable in normal (non-secure) world should not cause data corruption in the secure world.

That definition is rather useless for operating systems that don't use Trustzone then, right?

Arnd

Russell King - ARM Linux

12:36 p.m.

On Thu, Apr 28, 2011 at 02:12:40PM +0200, Arnd Bergmann wrote:

...

On Thursday 28 April 2011, Catalin Marinas wrote:

...
On Thu, 2011-04-28 at 01:15 +0100, Valdis.Kletnieks@vt.edu wrote:

...
On Wed, 27 Apr 2011 12:08:28 BST, Catalin Marinas said:

...
The current version of the ARM ARM says "unpredictable". But this general definition of "unpredictable" does not allow it to deadlock (hardware) or have security implications. It is however allowed to corrupt data.

Not allowed to have security implications, but is allowed to corrupt data.

By security I was referring to TrustZone extensions. IOW, unpredictable in normal (non-secure) world should not cause data corruption in the secure world.

That definition is rather useless for operating systems that don't use Trustzone then, right?

I'm not sure what you're implying. By running on a device with Trustzone extensions, Linux is using them whether it knows it or not.

Linux on ARMs evaluation boards runs on the secure size of the Trustzone dividing line. Linux on OMAP SoCs runs on the insecure size of that, and has to make secure monitor calls to manipulate certain registers (eg, to enable workarounds for errata etc). As SMC calls are highly implementation specific, there is and can be no "trustzone" driver.

Arnd Bergmann

12:48 p.m.

On Thursday 28 April 2011, Russell King - ARM Linux wrote:

...

On Thu, Apr 28, 2011 at 02:12:40PM +0200, Arnd Bergmann wrote:

...
On Thursday 28 April 2011, Catalin Marinas wrote:

...
On Thu, 2011-04-28 at 01:15 +0100, Valdis.Kletnieks@vt.edu wrote:

...
On Wed, 27 Apr 2011 12:08:28 BST, Catalin Marinas said:

...
The current version of the ARM ARM says "unpredictable". But this general definition of "unpredictable" does not allow it to deadlock (hardware) or have security implications. It is however allowed to corrupt data.

Not allowed to have security implications, but is allowed to corrupt data.

By security I was referring to TrustZone extensions. IOW, unpredictable in normal (non-secure) world should not cause data corruption in the secure world.

That definition is rather useless for operating systems that don't use Trustzone then, right?

I'm not sure what you're implying. By running on a device with Trustzone extensions, Linux is using them whether it knows it or not.

Linux on ARMs evaluation boards runs on the secure size of the Trustzone dividing line. Linux on OMAP SoCs runs on the insecure size of that, and has to make secure monitor calls to manipulate certain registers (eg, to enable workarounds for errata etc). As SMC calls are highly implementation specific, there is and can be no "trustzone" driver.

My point was that when Linux runs in the secure partition (ok, I didn't know we did that, but still), anything that corrupts Linux data has security implications. If Linux runs outside of Trustzone, you can also currupt Linux and the security is completely pointless because after Linux is gone, you have nothing left that drives your devices or runs user processes.

The only case where TrustZone would help is when you have an operating system running in the secure partition as some sort of microkernel (a.k.a. hypervisor) and have the "unpredictable" behavior isolated in nonessential parts of the system.

Arnd

Dave Martin

3 May 3 May

2:45 p.m.

On Thu, Apr 28, 2011 at 02:12:40PM +0200, Arnd Bergmann wrote:

...

On Thursday 28 April 2011, Catalin Marinas wrote:

...
On Thu, 2011-04-28 at 01:15 +0100, Valdis.Kletnieks@vt.edu wrote:

...
On Wed, 27 Apr 2011 12:08:28 BST, Catalin Marinas said:

...
The current version of the ARM ARM says "unpredictable". But this general definition of "unpredictable" does not allow it to deadlock (hardware) or have security implications. It is however allowed to corrupt data.

Not allowed to have security implications, but is allowed to corrupt data.

By security I was referring to TrustZone extensions. IOW, unpredictable in normal (non-secure) world should not cause data corruption in the secure world.

That definition is rather useless for operating systems that don't use Trustzone then, right?

IIUC, the restriction on unpredictable behaviour is basically that the processor can't do anything which would result in or otherwise imply an escalation of privilege.

TrustZone is one kind of privilege, but there are plenty of other operations implying privilege (entering privileged mode from user mode, masking or intercepting interrupts or exceptions, bypassing or reconfiguring MMU permissions etc.) "Unpredictable" behaviours are not allowed to have any such consequences IIRC. Without that restriction you wouldn't really have any OS security at all.

In the kernel, we do have to be careful about avoiding unpredictable behaviours, since we're already running at maximum privilege (not including TZ) -- so the damage which unpredictable behaviours can wreak is much greater, by running invalid code, misconfiguring the MMU, allowing caches to get out of sync etc. But that's not fundamentally different from the general need to avoid kernel bugs -- the scope of _any_ kernel code to do damage is greater than for userspace code, whether it involves architecturally unpredictable behaviour, or just plain ordinary bugs or security holes in the C code.

---Dave

...

Arnd

linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Arnd Bergmann

29 Apr 29 Apr

3:41 p.m.

On Wednesday 27 April 2011, Catalin Marinas wrote:

...

...
...
It's not broken since we moved to using Normal non-cacheable memory for the coherent DMA buffers (as long as you flush the cacheable alias before using the buffer, as we already do). The ARM ARM currently says unpredictable for such situations but this is being clarified in future updates and the Normal non-cacheable vs cacheable aliases can be used (given correct cache maintenance before using the buffer).

Thanks for that information, I believe a number of people in the previous discussions were relying on the information from the documentation. Are you sure that this is not only correct for the cores made by ARM ltd but also for the other implementations that may have relied on documentation?

It is a clarification in the ARM ARM so it covers all the cores made by architecture licensees, not just ARM Ltd. It basically makes the "unpredictable" part more predictable to allow certain types of aliases (e.g. Strongly Ordered vs Normal memory would still be disallowed).

All the current implementations are safe with Normal memory aliases (cacheable vs non-cacheable) but of course, there may be some performance benefits in not having any alias.

A lot of the discussions we are about to have in Budapest will be around solving the problem of having only valid combinations of mappings, so we really need to have a clear statement in specification form about what is actually valid.

Would it be possible to have an updated version of the relevant section of the ARM ARM by next week so we can use that as the base for our discussions?

Arnd

Catalin Marinas

4:42 p.m.

On Friday, 29 April 2011, Arnd Bergmann arnd@arndb.de wrote:

...

On Wednesday 27 April 2011, Catalin Marinas wrote:

...
...
...
It's not broken since we moved to using Normal non-cacheable memory for the coherent DMA buffers (as long as you flush the cacheable alias before using the buffer, as we already do). The ARM ARM currently says unpredictable for such situations but this is being clarified in future updates and the Normal non-cacheable vs cacheable aliases can be used (given correct cache maintenance before using the buffer).

Thanks for that information, I believe a number of people in the previous discussions were relying on the information from the documentation. Are you sure that this is not only correct for the cores made by ARM ltd but also for the other implementations that may have relied on documentation?

It is a clarification in the ARM ARM so it covers all the cores made by architecture licensees, not just ARM Ltd. It basically makes the "unpredictable" part more predictable to allow certain types of aliases (e.g. Strongly Ordered vs Normal memory would still be disallowed).

All the current implementations are safe with Normal memory aliases (cacheable vs non-cacheable) but of course, there may be some performance benefits in not having any alias.

A lot of the discussions we are about to have in Budapest will be around solving the problem of having only valid combinations of mappings, so we really need to have a clear statement in specification form about what is actually valid.

Would it be possible to have an updated version of the relevant section of the ARM ARM by next week so we can use that as the base for our discussions?

I'll ask the architecture people here in ARM and get back to you (there is holiday until Tuesday next week in the UK).

-- Catalin

Laurent Pinchart

3 May 3 May

3:05 p.m.

On Wednesday 27 April 2011 12:43:16 Arnd Bergmann wrote:

...

On Wednesday 27 April 2011, Catalin Marinas wrote:

...
On 21 April 2011 20:29, Arnd Bergmann arnd@arndb.de wrote:

...
I think the recent discussions on linaro-mm-sig and the BoF last week at ELC have been quite productive, and at least my understanding of the missing pieces has improved quite a bit. This is a list of things that I think need to be done in the kernel. Please complain if any of these still seem controversial:

Fix the arm version of dma_alloc_coherent. It's in use today and

is broken on modern CPUs because it results in both cached and uncached mappings. Rebecca suggested different approaches how to get there.

It's not broken since we moved to using Normal non-cacheable memory for the coherent DMA buffers (as long as you flush the cacheable alias before using the buffer, as we already do). The ARM ARM currently says unpredictable for such situations but this is being clarified in future updates and the Normal non-cacheable vs cacheable aliases can be used (given correct cache maintenance before using the buffer).

Thanks for that information, I believe a number of people in the previous discussions were relying on the information from the documentation. Are you sure that this is not only correct for the cores made by ARM ltd but also for the other implementations that may have relied on documentation?

As I mentioned before, there are other architectures, where having conflicting cache settings in TLB entries for the same pysical page immediately checkstops the CPU, and I guess that this was also allowed by the current version of the ARM ARM.

...
...

Implement dma_alloc_noncoherent on ARM. Marek pointed out

that this is needed, and it currently is not implemented, with an outdated comment explaining why it used to not be possible to do it.

As Russell pointed out, there are 4 main combinations with iommu and some coherency support (i.e. being able to snoop the CPU caches). But in an SoC you can have different devices with different iommu and coherency configurations. Some of them may even be able to see the L2 cache but not the L1 (in which case it would help if we can get an inner non-cacheable outer cacheable mapping).

Anyway, we end up with different DMA ops per device via dev_archdata.

Having different DMA ops per device was the solution that I was suggesting with dma_mapping_common.h, but Russell pointed out that it may not be the best option.

The alternative would be to have just one set of dma_mapping functions as we do today, but to extend the functions to also cover the iommu case, for instance (example, don't take literally):

static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr, size_t size, enum dma_data_direction dir) { dma_addr_t ret;

#ifdef CONFIG_DMABOUNCE if (dev->archdata.dmabounce) return dmabounce_map_single(dev, cpu_addr, size, dir); #endif

#ifdef CONFIG_IOMMU if (dev->archdata.iommu) ret = iommu_map_single(dev, cpu_addr, size, dir); else #endif

I wish it was that simple.

The OMAP4 ISS (Imaging Subsystem) has no IOMMU, but it can use the OMAP4 DMM (Dynamic Memory Manager) which acts as a memory remapper. Basically (if my understanding is correct), the ISS is configured to read/write from/to physical addresses. If those physical addresses are in the DMM address range, the DMM translates the accesses to physical accesses, acting as an IOMMU.

The ISS can thus write to physically contiguous memory directly, or to scattered physical pages through the DMM. Whether an IOMMU (or, to be correct in this case, the IOMMU-like DMM) needs to handle the DMA is a per-buffer decision, not a per-device decision.

...

dma_addr = virt_to_dma(dev, ptr);
dma_sync_single_for_device(dev, dma_addr, size, dir); }

This would not even conflict with having a common implementation for iommu based dma_map_ops -- we would just call the iommu functions directly when needed rather than having an indirect function call.

-- Regards, Laurent Pinchart

Arnd Bergmann

3:31 p.m.

On Tuesday 03 May 2011, Laurent Pinchart wrote:

...

I wish it was that simple.

The OMAP4 ISS (Imaging Subsystem) has no IOMMU, but it can use the OMAP4 DMM (Dynamic Memory Manager) which acts as a memory remapper. Basically (if my understanding is correct), the ISS is configured to read/write from/to physical addresses. If those physical addresses are in the DMM address range, the DMM translates the accesses to physical accesses, acting as an IOMMU.

The ISS can thus write to physically contiguous memory directly, or to scattered physical pages through the DMM. Whether an IOMMU (or, to be correct in this case, the IOMMU-like DMM) needs to handle the DMA is a per-buffer decision, not a per-device decision.

This doesn't sound too unusual for IOMMU implementations. A lot of time you can access e.g. low memory using a direct mapping but you need the IOMMU code for highmem. I've also seen a machine where a linear mapping exists for all the memory in strict ordering, while you can use relaxed DMA ordering when you go through the IOMMU address range. If we manage to come up with a common dma-mapping API implementation for all IOMMUs, it certainly needs to handle that case as well.

Arnd

FUJITA Tomonori

27 Apr 27 Apr

2:06 p.m.

On Wed, 27 Apr 2011 10:52:25 +0100 Catalin Marinas catalin.marinas@arm.com wrote:

...

Anyway, we end up with different DMA ops per device via dev_archdata.

Several architectures already do. What's wrong with the approach for arm?

Catalin Marinas

2:29 p.m.

On Wed, 2011-04-27 at 15:06 +0100, FUJITA Tomonori wrote:

...

On Wed, 27 Apr 2011 10:52:25 +0100 Catalin Marinas catalin.marinas@arm.com wrote:

...
Anyway, we end up with different DMA ops per device via dev_archdata.

Several architectures already do. What's wrong with the approach for arm?

Nothing wrong IMHO but it depends on how you group the DMA ops as it may not be feasible to have all the combinations dmabounce/iommu/coherency combinations. I think the main combinations would be:

1. standard (no-iommu) + non-coherent 2. standard (no-iommu) + coherent 3. iommu + non-coherent 4. iommu + coherent 5. dmabounce + non-coherent 6. dmabounce + coherent

I think dmabounce and iommu can be exclusive (unless the iommu cannot access the whole RAM). If that's the case, we can have three type of DMA ops:

1. standard 2. iommu 3. dmabounce

with an additional flag via dev_archdata for cache coherency level (a device may be able to snoop the L1 or L2 cache etc.)

-- Catalin

FUJITA Tomonori

2:34 p.m.

On Wed, 27 Apr 2011 15:29:30 +0100 Catalin Marinas catalin.marinas@arm.com wrote:

...

On Wed, 2011-04-27 at 15:06 +0100, FUJITA Tomonori wrote:

...
On Wed, 27 Apr 2011 10:52:25 +0100 Catalin Marinas catalin.marinas@arm.com wrote:

...
Anyway, we end up with different DMA ops per device via dev_archdata.

Several architectures already do. What's wrong with the approach for arm?

Nothing wrong IMHO but it depends on how you group the DMA ops as it may not be feasible to have all the combinations dmabounce/iommu/coherency combinations. I think the main combinations would be:

standard (no-iommu) + non-coherent

standard (no-iommu) + coherent

iommu + non-coherent

iommu + coherent

dmabounce + non-coherent

dmabounce + coherent

I think dmabounce and iommu can be exclusive (unless the iommu cannot access the whole RAM). If that's the case, we can have three type of DMA ops:

standard

iommu

dmabounce

with an additional flag via dev_archdata for cache coherency level (a device may be able to snoop the L1 or L2 cache etc.)

Sounds nothing wrong to me too. I like to see arm people to switch from dmabounce to swiotlb though.

Russell King - ARM Linux

8:29 p.m.

On Wed, Apr 27, 2011 at 11:06:00PM +0900, FUJITA Tomonori wrote:

...

On Wed, 27 Apr 2011 10:52:25 +0100 Catalin Marinas catalin.marinas@arm.com wrote:

...
Anyway, we end up with different DMA ops per device via dev_archdata.

Several architectures already do. What's wrong with the approach for arm?

Please read the rest of the thread, where I've already explained the issue.

Benjamin Herrenschmidt

9:45 p.m.

On Wed, 2011-04-27 at 10:52 +0100, Catalin Marinas wrote:

...

It's not broken since we moved to using Normal non-cacheable memory for the coherent DMA buffers (as long as you flush the cacheable alias before using the buffer, as we already do). The ARM ARM currently says unpredictable for such situations but this is being clarified in future updates and the Normal non-cacheable vs cacheable aliases can be used (given correct cache maintenance before using the buffer).

Don't you have a risk where speculative loads or prefetches might bring back some stuff into the cache via the cachable mapping ? Is that an issue ? As long as it's non-dirty and the cachable mapping isn't otherwise used, I suppose it might be a non-issue, tho I've seen in powerpc land cases of processors that can checkstop if a subsequent non cachable access "hits" the stuff that was loaded in the cache.

Cheers, Ben.

KyongHo Cho

28 Apr 28 Apr

7:24 a.m.

On Thu, Apr 28, 2011 at 6:45 AM, Benjamin Herrenschmidt benh@kernel.crashing.org wrote:

...

Don't you have a risk where speculative loads or prefetches might bring back some stuff into the cache via the cachable mapping ? Is that an issue ? As long as it's non-dirty and the cachable mapping isn't otherwise used, I suppose it might be a non-issue, tho I've seen in powerpc land cases of processors that can checkstop if a subsequent non cachable access "hits" the stuff that was loaded in the cache.

Cheers, Ben.

As far as I know, ARM processors does not have the capability to detect non-cacheable access hits the stuff in the cache.

IMHO, speculative prefetch becomes a problem when a coherent buffer (that is not-cacheable in ARM) is modified by a DMA transaction while old data is already loaded in the cache via another cacheable mapping onto the buffer even though it is never touched by CPU. We can avoid this problem if the kernel removes 'executable' property from the cacheable mapping. But it is not able to modify page table entries in the direct mapping area.

Regards, KyongHo

Catalin Marinas

8:31 a.m.

On Wed, 2011-04-27 at 22:45 +0100, Benjamin Herrenschmidt wrote:

...

On Wed, 2011-04-27 at 10:52 +0100, Catalin Marinas wrote:

...
It's not broken since we moved to using Normal non-cacheable memory for the coherent DMA buffers (as long as you flush the cacheable alias before using the buffer, as we already do). The ARM ARM currently says unpredictable for such situations but this is being clarified in future updates and the Normal non-cacheable vs cacheable aliases can be used (given correct cache maintenance before using the buffer).

Don't you have a risk where speculative loads or prefetches might bring back some stuff into the cache via the cachable mapping ? Is that an issue ? As long as it's non-dirty and the cachable mapping isn't otherwise used, I suppose it might be a non-issue, tho I've seen in powerpc land cases of processors that can checkstop if a subsequent non cachable access "hits" the stuff that was loaded in the cache.

At the CPU cache level, unexpected cache hit is considered a miss, IOW non-cacheable memory accesses ignore the cache lines that may have been speculatively loaded.

-- Catalin

Benjamin Herrenschmidt

27 Apr 27 Apr

9:31 p.m.

On Thu, 2011-04-21 at 21:29 +0200, Arnd Bergmann wrote:

...

Extend the dma_map_ops to have a way for mapping a buffer from dma_alloc_{non,}coherent into user space. We have not discussed that yet, but after thinking this for some time, I believe this would be the right approach to map buffers into user space from code that doesn't care about the underlying hardware.

Yes. There is a dma_mmap_coherent() call that's not part of the "Real" API but is implemented by some archs and used by Alsa (I added support for it on powerpc recently).

Maybe that should go into the dma ops.

The question remains, if we ever want to do more complex demand-paged operations, should we also expose a lower level set of functions to get struct page out of a dma_alloc_coherent() allocation and to get the pgprot for the user dma mapping ?

...

After all these are in place, building anything on top of dma_alloc_{non,}coherent should be much easier. The question of passing buffers between V4L and DRM is still completely unsolved as far as I can tell, but that discussion might become more focused if we can agree on the above points and assume that it will be done.

My gut feeling is that it should be done by having V4L use DRM buffers in the first place...

...

I expect that I will have to update the list above as people point out mistakes in my assumptions.

Cheers, Ben.

Russell King - ARM Linux

28 Apr 28 Apr

9:42 a.m.

On Thu, Apr 28, 2011 at 07:31:06AM +1000, Benjamin Herrenschmidt wrote:

...

The question remains, if we ever want to do more complex demand-paged operations, should we also expose a lower level set of functions to get struct page out of a dma_alloc_coherent() allocation and to get the pgprot for the user dma mapping ?

I don't think so - that places the requirement that dma_alloc_coherent() must be backed by memory with a set of struct page, which may not always be the case.

Think about dma_alloc_coherent() with dma_declare_coherent_memory() used with memory which is not part of system RAM.

Joerg Roedel

10:27 a.m.

On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:

...

Implement an architecture independent version of dma_map_ops based on the iommu.h API. As Joerg mentioned, this has been missing for some time, and it would be better to do it once than for each IOMMU separately. This is probably a lot of work.

Yes, thats missing for a long time. It will also need some changes to the IOMMU-API but that should be doable. The best would be to extend the IOMMU-API so that it also supports GART-like IOMMUs. This way every dma_ops implementation on all the architectures providing such an IOMMU could be covered with the architecture independent dma_ops implementation. This would only leave the low-level hardware access in the IOMMU drivers. I think this also requires to change the current semantics of the existing IOMMU-API implementations. I will prepare a write-up of my ideas for discussion.

Regards,

Joerg

Arnd Bergmann

12:15 p.m.

On Thursday 28 April 2011, Joerg Roedel wrote:

...

On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:

...

Implement an architecture independent version of dma_map_ops based on the iommu.h API. As Joerg mentioned, this has been missing for some time, and it would be better to do it once than for each IOMMU separately. This is probably a lot of work.

Yes, thats missing for a long time. It will also need some changes to the IOMMU-API but that should be doable. The best would be to extend the IOMMU-API so that it also supports GART-like IOMMUs. This way every dma_ops implementation on all the architectures providing such an IOMMU could be covered with the architecture independent dma_ops implementation. This would only leave the low-level hardware access in the IOMMU drivers. I think this also requires to change the current semantics of the existing IOMMU-API implementations. I will prepare a write-up of my ideas for discussion.

Ok, thanks!

Please include Marek in this, he said he has already started with an implementation. Any insight from you will certainly help.

Arnd

5289

days inactive

5301

days old

linaro-mm-sig@lists.linaro.org

97 comments

participants

tags (0)

participants (20)

Alan Cox
Alex Deucher
Anca Emanuel
Arnd Bergmann
Benjamin Herrenschmidt
Catalin Marinas
Dave Martin
David Brown
FUJITA Tomonori
Jerome Glisse
Jesse Barnes
Joerg Roedel
Jordan Crouse
KyongHo Cho
Laurent Pinchart
Marek Szyprowski
Russell King - ARM Linux
Thomas Hellstrom
Valdis.Kletnieks＠vt.edu
Zach Pfeffer