On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote:
On arm that doesn't work. The iommu api seems like a good fit, except the dma-api tends to get in the way a bit (drm/msm apparently has similar problems like tegra), and if you need contiguous memory dma_alloc_coherent is the only way to get at contiguous memory. There was a huge discussion years ago about that, and direct cma access was shot down because it would have exposed too much of the caching attribute mangling required (most arm platforms need wc-pages to not be in the kernel's linear map apparently).
I think you completely misunderstand ARM from what you've written above, and this worries me greatly about giving DRM the level of control that is being asked for.
Modern ARMs have a PIPT cache or a non-aliasing VIPT cache, and cache attributes are stored in the page tables. These caches are inherently non-aliasing when there are multiple mappings (which is a great step forward compared to the previous aliasing caches.)
As the cache attributes are stored in the page tables, this in theory allows different virtual mappings of the same physical memory to have different cache attributes. However, there's a problem, and that's called speculative prefetching.
Let's say you have one mapping which is cacheable, and another that is marked as write combining. If a cache line is speculatively prefetched through the cacheable mapping of this memory, and then you read the same physical location through the write combining mapping, it is possible that you could read cached data.
So, it is generally accepted that all mappings of any particular physical bit of memory should have the same cache attributes to avoid unpredictable behaviour.
This presents a problem with what is generally called "lowmem" where the memory is mapped in kernel virtual space with cacheable attributes. It can also happen with highmem if the memory is kmapped.
This is why, on ARM, you can't use something like get_free_pages() to grab some pages from the system, pass it to the GPU, map it into userspace as write-combining, etc. It _might_ work for some CPUs, but ARM CPUs vary in how much prefetching they do, and what may work for one particular CPU is in no way guaranteed to work for another ARM CPU.
The official line from architecture folk is to assume that the caches infinitely speculate, are of infinite size, and can writeback *dirty* data at any moment.
The way to stop things like speculative prefetches to particular physical memory is to, quite "simply", not have any cacheable mappings of that physical memory anywhere in the system.
Now, cache flushes on ARM tend to be fairly expensive for GPU buffers. If you have, say, an 8MB buffer (for a 1080p frame) and you need to do a cache operation on that buffer, you'll be iterating over it 32 or maybe 64 bytes at a time "just in case" there's a cache line present. Referring to my previous email, where I detailed the potential need for _two_ flushes, one before the GPU operation and one after, and this becomes _really_ expensive. At that point, you're probably way better off using write-combine memory where you don't need to spend CPU cycles performing cache flushing - potentially across all CPUs in the system if cache operations aren't broadcasted.
This isn't a simple matter of "just provide some APIs for cache operations" - there's much more that needs to be understood by all parties here, especially when we have GPU drivers that can be used with quite different CPUs.
It may well be that for some combinations of CPUs and workloads, it's better to use write-combine memory without cache flushing, but for other CPUs that tradeoff (for the same workload) could well be different.
Older ARMs get more interesting, because they have aliasing caches. That means the CPU cache aliases across different virtual space mappings in some way, which complicates (a) the mapping of memory and (b) handling the cache operations on it.
It's too late for me to go into that tonight, and I probably won't be reading mail for the next week and a half, sorry.