On 09/01/2019 11:54, Ard Biesheuvel wrote:
On Wed, 9 Jan 2019 at 12:05, Will Deacon will.deacon@arm.com wrote:
On Wed, Jan 09, 2019 at 11:41:00AM +0100, Ard Biesheuvel wrote:
(adding Will who was part of a similar discussion before)
On Tue, 8 Jan 2019 at 19:35, Carsten Haitzler Carsten.Haitzler@arm.com wrote:
On 08/01/2019 17:07, Grant Likely wrote:
FYI I have a Radeon RX550 with amdgpu on my thunder-x2. yes - it's a server ARM (aarch64) system, but it works a charm. 2 screens attached. I did have to do the following:
- patch kernel DRM code to force uncached mappings (the code apparently
assumes WC x86-style):
--- ./include/drm/drm_cache.h~ 2018-08-12 21:41:04.000000000 +0100 +++ ./include/drm/drm_cache.h 2018-11-16 11:06:16.976842816 +0000 @@ -48,7 +48,7 @@ #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3) return false; #else
return true;
return false;
#endif }
OK, so this is rather interesting. First of all, this is the exact change we apply to the nouveau driver to work on SynQuacer, i.e., demote all normal-non cacheable mappings of memory exposed by the PCIe controller via a BAR to device mappings. On SynQuacer, we need this because of a known silicon bug in the integration of the PCIe IP.
However, the fact that even on TX2, you need device mappings to map RAM exposed via PCIe is rather troubling, and it has come up in the past as well. The problem is that the GPU driver stack on Linux, including VDPAU libraries and other userland pieces all assume that memory exposed via PCIe has proper memory semantics, including the ability to perform unaligned accesses on it or use DC ZVA instructions to clear it. As we all know, these driver stacks are rather complex, and adding awareness to each level in the stack regarding whether a certain piece of memory is real memory or PCI memory is going to be cumbersome.
When we discussed this in the past, an ARM h/w engineer pointed out that normal-nc is fundamentally incompatible with AMBA or AXI or whatever we use on ARM to integrate these components at the silicon level.
FWIW, I still don't understand exactly what the point being made was in that thread, but I do know that many of the assertions along the way were either vague or incorrect. Yes, it's possible to integrate different buses in a way that doesn't work, but I don't see anything "fundamental" about it.
OK
If that means we can only use device mappings, it means we will need to make intrusive changes to a *lot* of code to ensure it doesn't use memcpy() or do other things that device mappings don't tolerate on ARM.
Even if we got it working, it would probably be horribly slow.
So, can we get the right people from the ARM side involved to clarify this once and for all?
Last time I looked at this code, the problem actually seemed to be that the DRM core ends up trying to remap the CPU pages in ttm_set_pages_uc(). This is a NOP for !x86, so I think we end up with the CPU using a cacheable mapping but the device using a non-cacheable mapping, which could explain the hang.
At the time, implementing set_pages_uc() to remap the linear mapping wasn't feasible because it would preclude the use of block mappings, but now that we're using page mappings by default maybe you could give it a try.
OK, so I jumped to the conclusion that the similarity of the hack is due to the similarity of the issue it addresses, but this may not be the case.
Carsten, could you please explain /why/ this first change is required? It appears to only affect GTT mappings, and not VRAM mappings, both of which are essentially PCIe memory, only the former may be backed by system memory under the hood.
In any case, this is getting off-topic so I propose we trim the cc list a bit if we continue this discussion.
Well the only 2 drivers to use drm_arch_can_wc_memory() are radeon and amdgpu. so we're limiting the affects to those. they use them to set up memory mappings and flags. if you look at the code int he amdgpu driver relevant to this:
bo->flags = bp->flags;
#ifdef CONFIG_X86_32 /* XXX: Write-combined CPU mappings of GTT seem broken on 32-bit * See https://bugs.freedesktop.org/show_bug.cgi?id=84627 */ bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC; #elif defined(CONFIG_X86) && !defined(CONFIG_X86_PAT) /* Don't try to enable write-combining when it can't work, or things * may be slow * See https://bugs.freedesktop.org/show_bug.cgi?id=88758 */
#ifndef CONFIG_COMPILE_TEST #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance \ thanks to write-combining #endif
if (bo->flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC) DRM_INFO_ONCE("Please enable CONFIG_MTRR and CONFIG_X86_PAT for " "better performance thanks to write-combining\n"); bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC; #else /* For architectures that don't support WC memory, * mask out the WC flag from the BO */ if (!drm_arch_can_wc_memory()) bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC; #endif
You can see that it's all about WC and even on certain x86 setups it's not working so it gets turned off. Not so on ARM. My understanding is that ARM can't do "WC" in a guaranteed way like x86, so turning it off is the right thing to do anyway, and no one bothered to do it because no one really tests AMD devices on ARM (much) or doesn't bother getting patches back upstream. Yes - it only affects the GTT but that is totally sufficient for things to go from totally broken (nothing even comes up to totally working for me. I haven't looked but i assume it's using that mapping for control registers or something because my issues were with the command queue sequence numbers and so on not large volumes of data like textures themselves etc.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.