On Wed, 9 Jan 2019 at 20:07, Ard Biesheuvel ard.biesheuvel@linaro.org wrote:
On Wed, 9 Jan 2019 at 16:30, Carsten Haitzler Carsten.Haitzler@arm.com wrote:
On 09/01/2019 14:48, Ard Biesheuvel wrote:
On Wed, 9 Jan 2019 at 15:38, Carsten Haitzler Carsten.Haitzler@arm.com wrote:
On 09/01/2019 14:36, Ard Biesheuvel wrote:
On Wed, 9 Jan 2019 at 15:34, Carsten Haitzler Carsten.Haitzler@arm.com wrote:
On 09/01/2019 14:33, Bero Rosenkränzer wrote:
On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler Carsten.Haitzler@arm.com wrote:
My understanding is that ARM can't do "WC" in a guaranteed way like x86, so turning it off is the right thing to do anyway,
My understanding too.
FWIW I've added the fix to the OpenMandriva distro kernel https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8... Let's see if any user starts screaming ;)
ttyl bero
let's see,. i have put in a patch to the internal kernel patch review before i send off to dri-devel. it's exactly your patch there just with a commit log explaining why.
So what exactly is it about x86 style wc that ARM cannot do?
From Pavel Shamis here at ARM:
"Short version.
X86 has well define behavior for WC memory – it combines multiples consecutive stores (has to be aligned to the cache line ) in 64B cache line writes over PCIe.
On Arm WC corresponds to Normal NC. Arm uarch does not do combining to cache line size. On some uarch we do 16B combining but not cache line.
The first uarch that will be doing cache line size combining is Aries.
It is important to note that WC is an opportunistic optimization and the software/hardware should not make an assumption that it always “combines” (true for x86 and arm)"
OK, so that only means that ARM WC mappings may behave more like x86 uncached mappings than x86 WC mappings. It does not explain why things break if we use them.
The problem with using uncached mappings here is that it breaks use cases that expect memory semantics, for unaligned access or DC ZVA instructions. At least VDPAU on nouveau breaks due to this, and likely many more other use cases as well.
For amdgpu though it works and this is and AMD+Radeon only code path. At least it works on the only ARM system I have an AMD GPU plugged into. you need the same fix for SynQuacer. Gettign a fix upstream like this will alleaviet a reasonable amount of pain for end-users even if not perfect.
I do not plan on going any further with this patch because it's for my tx2 and that is my ONLY workstation at work and it takes like 10 minutes per reboot cycle. I have many things to do and getting my gfx card to a working state was the primary focus. Spending days just rebooting to try things with something I am not familiar with (thwe ttm mappings) is not something I have time for. Looking at the history of other bugs that affect WC/UC mappings in radeon/madgpu shows that this is precisely the kind of fix that has been done multiple times in the past for x86 and obviously some MIPS and PPC systems. there's mountains of precedent that this is a quick and simple fix that has been implemented many time in the past, so from that point of view I think its a decent fix in and of itself when it comes to time vs. reward.
I can confirm that this change fixes all the issues I observed on AMD Seattle with HD5450 and HD7450 cards which use the Radeon driver (not the amdpgu one)
So I will attempt to dig into this a bit further myself, and hopefully find something that carries over to amdgpu as well, so I may ask you to test something if I do.
It may not be perfect, but it is better than it was and other MIPS/PPC and even x86 32bit systems already need this kind of fix. In the same way it seems ARM needs it too and no one to date has bothered upstream. I'd rather things improve for at least some set of people than they do not improve at all for an undefined amount of time. Note that working is an improvement to "fast but doesn't work" in my book. :) Don't get me wrong. Looking for a better fix in the meantime,if one could exist, is a positive thing. It's not something I can get stuck into as above.
I'd just like to see if we can fix properly before we upstream a hack.
OK, so it seems that if drm_arch_can_wc_memory() returns true, the TTM code also prefers uncached WC mappings over cacheable mappings for system memory, resulting in mismatched attributes. I am not sure what the rationale is for this, but I'd be much happier having a fix that turns normal-nc mappings into cacheable ones than into device ones.
Could you please give the following patch a spin (and revert the other change)?
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -180,7 +180,9 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain) places[c].fpfn = 0; places[c].lpfn = 0; places[c].flags = TTM_PL_FLAG_SYSTEM; - if (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC) + if (!IS_ENABLED(CONFIG_ARM64) && + !IS_ENABLED(CONFIG_ARM) && + (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC)) places[c].flags |= TTM_PL_FLAG_WC | TTM_PL_FLAG_UNCACHED; else