On 09/01/2019 21:10, Ard Biesheuvel wrote:
On Wed, 9 Jan 2019 at 20:07, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
On Wed, 9 Jan 2019 at 16:30, Carsten Haitzler <Carsten.Haitzler@arm.com> wrote:
On 09/01/2019 14:48, Ard Biesheuvel wrote:
On Wed, 9 Jan 2019 at 15:38, Carsten Haitzler <Carsten.Haitzler@arm.com> wrote:
On 09/01/2019 14:36, Ard Biesheuvel wrote:

On Wed, 9 Jan 2019 at 15:34, Carsten Haitzler <Carsten.Haitzler@arm.com> wrote:

On 09/01/2019 14:33, Bero Rosenkränzer wrote:

On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler <Carsten.Haitzler@arm.com> wrote:

My understanding is that ARM can't do "WC" in a guaranteed way like x86,
so turning it off is the right thing to do anyway,

My understanding too.

FWIW I've added the fix to the OpenMandriva distro kernel
https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8297e0b8799b929f86
Let's see if any user starts screaming ;)

ttyl
bero

let's see,. i have put in a patch to the internal kernel patch review before i send off to dri-devel. it's exactly your patch there just with a commit log explaining why.

So what exactly is it about x86 style wc that ARM cannot do?

>From Pavel Shamis here at ARM:

"Short version.

X86 has well define behavior for WC memory – it combines multiples consecutive stores (has to be aligned to the cache line ) in 64B cache line writes over PCIe.

On Arm WC corresponds to Normal NC. Arm uarch does not do combining to cache line size. On some uarch we do 16B combining but not cache line.

The first uarch that will be doing cache line size combining is Aries.

 It is important to note that WC is an opportunistic optimization and the software/hardware should not make an assumption that it always “combines” (true for x86 and arm)"

OK, so that only means that ARM WC mappings may behave more like x86
uncached mappings than x86 WC mappings. It does not explain why things
break if we use them.

The problem with using uncached mappings here is that it breaks use
cases that expect memory semantics, for unaligned access or DC ZVA
instructions. At least VDPAU on nouveau breaks due to this, and likely
many more other use cases as well.
For amdgpu though it works and this is and AMD+Radeon only code path. At
least it works on the only ARM system I have an AMD GPU plugged into.
you need the same fix for SynQuacer. Gettign a fix upstream like this
will alleaviet a reasonable amount of pain for end-users even if not
perfect.

I do not plan on going any further with this patch because it's for my
tx2 and that is my ONLY workstation at work and it takes like 10 minutes
per reboot cycle. I have many things to do and getting my gfx card to a
working state was the primary focus. Spending days just rebooting to try
things with something I am not familiar with (thwe ttm mappings) is not
something I have time for. Looking at the history of other bugs that
affect WC/UC mappings in radeon/madgpu shows that this is precisely the
kind of fix that has been done multiple times in the past for x86 and
obviously some MIPS and PPC systems. there's mountains of precedent that
this is a quick and simple fix that has been implemented many time in
the past, so from that point of view I think its a decent fix in and of
itself when it comes to time vs. reward.

I can confirm that this change fixes all the issues I observed on AMD
Seattle with HD5450 and HD7450 cards which use the Radeon driver (not
the amdpgu one)

So I will attempt to dig into this a bit further myself, and hopefully
find something that carries over to amdgpu as well, so I may ask you
to test something if I do.

It may not be perfect, but it is better than it was and other MIPS/PPC
and even x86 32bit systems already need this kind of fix. In the same
way it seems ARM needs it too and no one to date has bothered upstream.
I'd rather things improve for at least some set of people than they do
not improve at all for an undefined amount of time. Note that working is
an improvement to "fast but doesn't work" in my book. :) Don't get me
wrong. Looking for a better fix in the meantime,if one could exist, is a
positive thing. It's not something I can get stuck into as above.

I'd just like to see if we can fix properly before we upstream a hack.
OK, so it seems that if drm_arch_can_wc_memory() returns true, the TTM
code also prefers uncached WC mappings over cacheable mappings for
system memory, resulting in mismatched attributes. I am not sure what
the rationale is for this, but I'd be much happier having a fix that
turns normal-nc mappings into cacheable ones than into device ones.

Could you please give the following patch a spin (and revert the other change)?

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -180,7 +180,9 @@ void amdgpu_bo_placement_from_domain(struct
amdgpu_bo *abo, u32 domain)
                places[c].fpfn = 0;
                places[c].lpfn = 0;
                places[c].flags = TTM_PL_FLAG_SYSTEM;
-               if (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC)
+               if (!IS_ENABLED(CONFIG_ARM64) &&
+                   !IS_ENABLED(CONFIG_ARM) &&
+                   (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC))
                        places[c].flags |= TTM_PL_FLAG_WC |
                                TTM_PL_FLAG_UNCACHED;
                else

Took me an hour to get a reboot to work.. (the usual fun problems). It seems to work. Well GUI is up and running in the past 10 mins about as good as it was before. So amdgpu driver verified at least for me.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.