On 09/01/2019 11:05, Will Deacon wrote:
On Wed, Jan 09, 2019 at 11:41:00AM +0100, Ard Biesheuvel wrote:
(adding Will who was part of a similar discussion before)
On Tue, 8 Jan 2019 at 19:35, Carsten Haitzler Carsten.Haitzler@arm.com wrote:
On 08/01/2019 17:07, Grant Likely wrote:
FYI I have a Radeon RX550 with amdgpu on my thunder-x2. yes - it's a server ARM (aarch64) system, but it works a charm. 2 screens attached. I did have to do the following:
- patch kernel DRM code to force uncached mappings (the code apparently
assumes WC x86-style):
--- ./include/drm/drm_cache.h~ 2018-08-12 21:41:04.000000000 +0100 +++ ./include/drm/drm_cache.h 2018-11-16 11:06:16.976842816 +0000 @@ -48,7 +48,7 @@ #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3) return false; #else
return true;
return false;
#endif }
OK, so this is rather interesting. First of all, this is the exact change we apply to the nouveau driver to work on SynQuacer, i.e., demote all normal-non cacheable mappings of memory exposed by the PCIe controller via a BAR to device mappings. On SynQuacer, we need this because of a known silicon bug in the integration of the PCIe IP.
However, the fact that even on TX2, you need device mappings to map RAM exposed via PCIe is rather troubling, and it has come up in the past as well. The problem is that the GPU driver stack on Linux, including VDPAU libraries and other userland pieces all assume that memory exposed via PCIe has proper memory semantics, including the ability to perform unaligned accesses on it or use DC ZVA instructions to clear it. As we all know, these driver stacks are rather complex, and adding awareness to each level in the stack regarding whether a certain piece of memory is real memory or PCI memory is going to be cumbersome.
When we discussed this in the past, an ARM h/w engineer pointed out that normal-nc is fundamentally incompatible with AMBA or AXI or whatever we use on ARM to integrate these components at the silicon level.
FWIW, I still don't understand exactly what the point being made was in that thread, but I do know that many of the assertions along the way were either vague or incorrect. Yes, it's possible to integrate different buses in a way that doesn't work, but I don't see anything "fundamental" about it.
If that means we can only use device mappings, it means we will need to make intrusive changes to a *lot* of code to ensure it doesn't use memcpy() or do other things that device mappings don't tolerate on ARM.
Even if we got it working, it would probably be horribly slow.
I got my AMD GPU working with the above.. and it's not horribly slow. it's a universe better than nouveau with an nv gpu. i get silky smooth 60fps across my 2 screens etc. etc. - nouveau could only do 30 (not performance really- likely buffer swap/sync/state bugs). i've run glmark on this and it's ballpark around right for the same thing on x86 within like 10 or 15% or so.
So, can we get the right people from the ARM side involved to clarify this once and for all?
Last time I looked at this code, the problem actually seemed to be that the DRM core ends up trying to remap the CPU pages in ttm_set_pages_uc(). This is a NOP for !x86, so I think we end up with the CPU using a cacheable mapping but the device using a non-cacheable mapping, which could explain the hang.
At the time, implementing set_pages_uc() to remap the linear mapping wasn't feasible because it would preclude the use of block mappings, but now that we're using page mappings by default maybe you could give it a try.
well also assuming a pci mapping can do WC (x86 style) when art least on ARM it clearly can't guarantee that is a very simple fix to go from not working to working. i haven't found it to be slow either. :)
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.