On Wednesday 21 May 2014 16:32:16 Catalin Marinas wrote:
On Wed, May 21, 2014 at 03:43:39PM +0100, Arnd Bergmann wrote:
On Wednesday 21 May 2014 14:56:36 Catalin Marinas wrote:
On Tue, May 20, 2014 at 08:37:10PM +0100, Arnd Bergmann wrote:
On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote: The point I was trying to make is a different one: For the streaming mapping, you can allow any mask as long as the swiotlb is able to create a bounce buffer.
dma_mask is a hardware parameter and is used by the streaming DMA API implementation (which could be swiotlb) to decide whether to bounce or not. The streaming DMA API can take any CPU address as input, independent of the dma_mask, and bounce accordingly. If it doesn't have a bounce buffer matching the dma_mask, dma_supported() would return false.
This does not work for the coherent mapping, because that by definition cannot use bounce buffers.
Yes but most of the time these masks have the same value.
Let me start again with an example:
io_tlb_end = 0x1000000; // 16 MB dma_mask = 0x10000000; // 256 MB ZONE_DMA = 0x100000000; // 4 GB max_pfn = 0x1000000000; // 64 GB
The device driver has set dma_mask and dma_coherent_mask to 256 because of a limitation in the hardware. This succeeded because io_tlb_end is well within the dma mask.
Since the coherent_dma_mask is smaller than max_pfn, the swiotlb version of dma_alloc_coherent then allocates the buffer using GFP_DMA. However, that likely returns an address above dma_mask/coherent_dma_mask.
swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA). If the returned address is not within coherent_dma_mask, it frees the allocated pages and tries to allocate from its bounce buffer (io_tlb).
Ok, got it now. Thanks for your patience. I suppose this was done to maintain the API guarantee that dma_set_coherent_mask(dev, dma_get_mask(dev)) always succeeds.
My point was that to fix this case, you must not compare the coherent_dma_mask requested by the device against io_tlb_end but against ZONE_DMA.
See above, I think swiotlb does the right thing.
We have a problem, however, with CMA, as we assume that the returned address is within coherent_dma_mask (we need a contiguous_dma_supported function, similar to swiotlb_dma_supported). We also don't limit the CMA buffer to ZONE_DMA on arm64.
Right. I guess the answer to that is to teach CMA to honor GFP_DMA. This was probably never needed on ARM32 but is an obvious extension.
As far as I understand it, you are not allowed to create a noncacheable mapping for a page that also has a cachable mapping, or did that change between armv7 and armv8?
No change between ARMv7 and ARMv8. In fact, ARMv7 has clarified what "unpredictable" means when you mix different attributes. Basically, we can mix the same memory type (Normal Cacheable vs Normal NonCacheable) but if the cacheability is different, we need to do some cache maintenance before accessing the non-cacheable mapping the first time (clear potentially dirty lines that could be evicted randomly). Any speculatively loaded cache lines via the cacheable mapping would not be hit when accessed via the non-cacheable one.
Ah, only for the first access? I had remembered this differently, thanks for the clarification.
The first time, but assuming that you no longer write to the cacheable alias to dirty it again (basically it's like the streaming DMA, if you want to access the cacheable alias you need cache maintenance).
Ok.
For CMA, we use the trick to change the mapping in place, but that doesn't work for normal pages, which are mapped using huge tlbs.
That's true for arm32 and we could do the same for arm64, though I don't think its worth because we could break very huge mappings (e.g. 1GB). We have plenty of VA space and we just create a new non-coherent mapping (this latter part could be optimised, even generically, to use huge pages where possible).
I thought we specifically did this on arm32 to avoid the duplicate mapping because we had concluded that it would be broken even with the updated ARMv7 definition.
We did it because it was vaguely defined as "unpredictable" in the ARM ARM for a long time until the hw guys realised it's not easy to change Linux and decided to clarify what could actually happen (see A3.5.7 in the ARMv7 ARM, "Mismatched memory attributes").
But in the past we used to have coherent DMA buffers mapped as Strongly Ordered and this would be a different memory type than Normal (NonCacheable). But even in this case, ARM ARM now states that such mapping is permitted but the Strongly Ordered properties are not guaranteed (could simply behave line Normal NonCacheable).
Right.
Of course, another approach would be to change the cacheability attributes of the io_tlb buffer (or the CMA one) but current approach has the advantage that the io_tlb buffer can be used for both coherent and non-coherent devices.
That definitely shouldn't be done without performance testing. However I wonder if they were done in the first place for the swiotlb: It's possible that it's cheaper to have the bounce buffer for noncoherent devices be mapped noncachable so we do cache bypassing copy into the bounce buffer and out of it and save the extra flushes.
I doesn't do this AFAICT but it's an interesting idea. Well, to be optimised when we can benchmark this on hardware.
One issue though is when some (or all) devices are coherent. In this case, even if you use a bounce buffer, it needs to be mapped coherently at the CPU level (e.g. the device has its own cache that can be snooped, so CPU accesses need to be cacheable).
Do you mean there are cases where an uncached buffer is not coherent with a device cache?
Yes, if for example you have a device (e.g. GPU, micro-controller) on a coherent interconnect, non-cacheable CPU accesses are not guaranteed to (probably should not) hit into the other device's cache (it could be seen as a side-effect of allowing cacheable vs non-cacheable aliases).
Rihgt, makes sense. So if we do this trick, it should only be done for devices that are not coherent to start with.
Arnd