nouveau h/w acceleration is blacklisted by default in chromium 71+

List overview All Threads
Download

newer

older

shim crashes

Full rebuilds of the Debian...

Ard Biesheuvel

8 Jan 2019 8 Jan '19

8:52 a.m.

Hello all,

(cc'ing everybody who I think might care - please ignore if you don't)

Chromium [as of 71] now defaults to software rendering when it detects the nouveau driver. This is bad for the ARM ecosystem, since it is the only driver we have for NVIDIA hardware, and it has fewer issues than AMD GPU drivers + hardware running on ARM systenms.

https://lists.freedesktop.org/archives/nouveau/2019-January/031798.html

It can be addressed by setting the 'ignore-gpu-blacklist' option, but this requires an explicit action from the user, which is unfortunate and undesirable.

Is there any way we can influence this?

Show replies by date

Renato Golin

8 Jan 8 Jan

9:35 a.m.

Hi Ard,

I don't have the whole context, but my read of that email is:

1. The claim is that "some people had some issues with indeterminate hardware and indeterminate versions of mesa", however... 2. He "did run the WebGL CTS suite, but that resulted in some hangs from the the max-texture-size-equivalent test, and some browser-level weirdness after some tests where later tests all fail"

I don't think those two statements are compatible. He can reproduce lots of failures on his own machine, probably just didn't have time to investigate all of them in detail.

Furthermore, he claims the failures are "due to what [he has] to assume is a browser bug" without any evidence to support it, and later on claims the driver is fine because "accelerated WebGL [...] in practice worked just fine (at least in my usage of it)".

To me, it smells like someone complaining that a broken piece of software is black-listed and shouldn't have because everyone know it's broken anyway, but it kinda works, so it's fine.

If that's a fair reading, I personally support blacklisting, and I second Chromium's suggestion to make the driver a first-class citizen as a way to remove it from the blacklist.

Who would do such work is a question that, to me, has no easy answer...

1. Linaro has no GPU working group and NVidia is not a member, so working on their drivers, even if open source, if we had the expertise, would be free lunch. 2. NVidia doesn't care about OSS drivers (much) because they already have their own proprietary ones on the platforms they care about. 3. Arm can't work on OSS NVidia drivers as that would compete with Mali.

Someone could hire community developers to do that work, or at least to validate it on Arm and create a list of bugs that need to be fixed, with more details than just "works for me". Linaro could do the validation matrix but would have to do it for both Arm and x86, and then hope the nouveau community would pick up the tab and fix them all. We'd also have to provide access to hardware for them to test, etc.

An alternative crappy solution would be to IFDEF the inclusion in the blacklist *exclusively* for Arm, given even we still don't care much about bugs in NVidia+Arm. But that's gotta lose some kudos from whomever proposes it and will be met with fierce refusal from the Chromium community.

The bottom line is: not many people care about nouveau on Arm, given the only platform that actually uses it today is the Synquacer.

I may be wrong, there may be a thriving community for NVidia on Arm out there. If there is, MHO is that we should talk to them instead of putting pressure on Chrimum to lift the ban. If not, there's no pressure to be put in the first place.

If, however, you propose we put pressure on nouveau specifically, for both Arm and x86, then I think it should come from the other side (x86) first, and Arm's not a first-class citizen on nouveau anyway. All in all, we're at the very bottom of the priority stack, there's no pressure we can put on anything, but Linaro could do the heavy lifting of validation matrix and help the nouveau community to identify and validate their fixes.

My tuppence, --renato

On Tue, 8 Jan 2019 at 08:52, Ard Biesheuvel ard.biesheuvel@linaro.org wrote:

...

Hello all,

(cc'ing everybody who I think might care - please ignore if you don't)

Chromium [as of 71] now defaults to software rendering when it detects the nouveau driver. This is bad for the ARM ecosystem, since it is the only driver we have for NVIDIA hardware, and it has fewer issues than AMD GPU drivers + hardware running on ARM systenms.

https://lists.freedesktop.org/archives/nouveau/2019-January/031798.html

It can be addressed by setting the 'ignore-gpu-blacklist' option, but this requires an explicit action from the user, which is unfortunate and undesirable.

Is there any way we can influence this?

-- cheers, --renato

Ard Biesheuvel

10:25 a.m.

On Tue, 8 Jan 2019 at 10:36, Renato Golin renato.golin@linaro.org wrote:

...

Hi Ard,

I don't have the whole context, but my read of that email is:

The claim is that "some people had some issues with indeterminate

hardware and indeterminate versions of mesa", however... 2. He "did run the WebGL CTS suite, but that resulted in some hangs from the the max-texture-size-equivalent test, and some browser-level weirdness after some tests where later tests all fail"

I don't think those two statements are compatible. He can reproduce lots of failures on his own machine, probably just didn't have time to investigate all of them in detail.

Furthermore, he claims the failures are "due to what [he has] to assume is a browser bug" without any evidence to support it, and later on claims the driver is fine because "accelerated WebGL [...] in practice worked just fine (at least in my usage of it)".

To me, it smells like someone complaining that a broken piece of software is black-listed and shouldn't have because everyone know it's broken anyway, but it kinda works, so it's fine.

I think one of the complaints is that there is a double standard here.

...

If that's a fair reading, I personally support blacklisting, and I second Chromium's suggestion to make the driver a first-class citizen as a way to remove it from the blacklist.

Who would do such work is a question that, to me, has no easy answer...

Linaro has no GPU working group and NVidia is not a member, so

working on their drivers, even if open source, if we had the expertise, would be free lunch. 2. NVidia doesn't care about OSS drivers (much) because they already have their own proprietary ones on the platforms they care about. 3. Arm can't work on OSS NVidia drivers as that would compete with Mali.

Someone could hire community developers to do that work, or at least to validate it on Arm and create a list of bugs that need to be fixed, with more details than just "works for me". Linaro could do the validation matrix but would have to do it for both Arm and x86, and then hope the nouveau community would pick up the tab and fix them all. We'd also have to provide access to hardware for them to test, etc.

An alternative crappy solution would be to IFDEF the inclusion in the blacklist *exclusively* for Arm, given even we still don't care much about bugs in NVidia+Arm. But that's gotta lose some kudos from whomever proposes it and will be met with fierce refusal from the Chromium community.

The bottom line is: not many people care about nouveau on Arm, given the only platform that actually uses it today is the Synquacer.

Nouveau is also used on non-PCI NVidia ARM SoCs with integrated graphics.

...

I may be wrong, there may be a thriving community for NVidia on Arm out there. If there is, MHO is that we should talk to them instead of putting pressure on Chrimum to lift the ban. If not, there's no pressure to be put in the first place.

If, however, you propose we put pressure on nouveau specifically, for both Arm and x86, then I think it should come from the other side (x86) first, and Arm's not a first-class citizen on nouveau anyway. All in all, we're at the very bottom of the priority stack, there's no pressure we can put on anything, but Linaro could do the heavy lifting of validation matrix and help the nouveau community to identify and validate their fixes.

Thanks Renato

Graeme Gregory

10:27 a.m.

On Tue, 8 Jan 2019 at 10:26, Ard Biesheuvel ard.biesheuvel@linaro.org wrote:

...

On Tue, 8 Jan 2019 at 10:36, Renato Golin renato.golin@linaro.org wrote:

...
Hi Ard,

I don't have the whole context, but my read of that email is:

The claim is that "some people had some issues with indeterminate

hardware and indeterminate versions of mesa", however... 2. He "did run the WebGL CTS suite, but that resulted in some hangs from the the max-texture-size-equivalent test, and some browser-level weirdness after some tests where later tests all fail"

I don't think those two statements are compatible. He can reproduce lots of failures on his own machine, probably just didn't have time to investigate all of them in detail.

Furthermore, he claims the failures are "due to what [he has] to assume is a browser bug" without any evidence to support it, and later on claims the driver is fine because "accelerated WebGL [...] in practice worked just fine (at least in my usage of it)".

To me, it smells like someone complaining that a broken piece of software is black-listed and shouldn't have because everyone know it's broken anyway, but it kinda works, so it's fine.

I think one of the complaints is that there is a double standard here.

...
If that's a fair reading, I personally support blacklisting, and I second Chromium's suggestion to make the driver a first-class citizen as a way to remove it from the blacklist.

Who would do such work is a question that, to me, has no easy answer...

Linaro has no GPU working group and NVidia is not a member, so

working on their drivers, even if open source, if we had the expertise, would be free lunch. 2. NVidia doesn't care about OSS drivers (much) because they already have their own proprietary ones on the platforms they care about. 3. Arm can't work on OSS NVidia drivers as that would compete with Mali.

Someone could hire community developers to do that work, or at least to validate it on Arm and create a list of bugs that need to be fixed, with more details than just "works for me". Linaro could do the validation matrix but would have to do it for both Arm and x86, and then hope the nouveau community would pick up the tab and fix them all. We'd also have to provide access to hardware for them to test, etc.

An alternative crappy solution would be to IFDEF the inclusion in the blacklist *exclusively* for Arm, given even we still don't care much about bugs in NVidia+Arm. But that's gotta lose some kudos from whomever proposes it and will be met with fierce refusal from the Chromium community.

The bottom line is: not many people care about nouveau on Arm, given the only platform that actually uses it today is the Synquacer.

Nouveau is also used on non-PCI NVidia ARM SoCs with integrated graphics.

And the TX2 Workstation!

Graeme

...

...
I may be wrong, there may be a thriving community for NVidia on Arm out there. If there is, MHO is that we should talk to them instead of putting pressure on Chrimum to lift the ban. If not, there's no pressure to be put in the first place.

If, however, you propose we put pressure on nouveau specifically, for both Arm and x86, then I think it should come from the other side (x86) first, and Arm's not a first-class citizen on nouveau anyway. All in all, we're at the very bottom of the priority stack, there's no pressure we can put on anything, but Linaro could do the heavy lifting of validation matrix and help the nouveau community to identify and validate their fixes.

Thanks Renato

Renato Golin

10:41 a.m.

On Tue, 8 Jan 2019 at 10:27, Graeme Gregory graeme.gregory@linaro.org wrote:

...

...
I think one of the complaints is that there is a double standard here.

I get your point, but that's pretty normal in upstream communities, especially as support starts to drop.

Sometimes, it's the only way to get people to do the actual work to bring support back to first-class citizen.

I'm guilty of that, as I'm sure most of us in this list are, too.

...

...
Nouveau is also used on non-PCI NVidia ARM SoCs with integrated graphics.

And the TX2 Workstation!

Not an extensive list, considering NVidia's abysmal support for their Arm boards and the price tag on the TX2 box.

The Synquacer is probably the best bet if anyone starts working on fixing those bugs.

--renato

Alex Bennée

3:32 p.m.

Renato Golin renato.golin@linaro.org writes:

...

Not an extensive list, considering NVidia's abysmal support for their Arm boards and the price tag on the TX2 box.

The Synquacer is probably the best bet if anyone starts working on fixing those bugs.

I think the SynQuacer hardware bugs get in the way.

I tried to get reliable graphics running for some time on my SynQuacer both with the driver hacks and the EL2 bus-width work around. In the end I could never get anything truly stable so I just took out the card and ran it headless. Once I got the RAM in the right slots the machine has been solid and as a bonus uses a lot less juice without the card ;-)

-- Alex Bennée

Daniel Thompson

1:37 p.m.

On Tue, Jan 08, 2019 at 11:25:56AM +0100, Ard Biesheuvel wrote:

...

On Tue, 8 Jan 2019 at 10:36, Renato Golin renato.golin@linaro.org wrote:

...
Hi Ard,

I don't have the whole context, but my read of that email is:

The claim is that "some people had some issues with indeterminate

hardware and indeterminate versions of mesa", however... 2. He "did run the WebGL CTS suite, but that resulted in some hangs from the the max-texture-size-equivalent test, and some browser-level weirdness after some tests where later tests all fail"

I don't think those two statements are compatible. He can reproduce lots of failures on his own machine, probably just didn't have time to investigate all of them in detail.

Furthermore, he claims the failures are "due to what [he has] to assume is a browser bug" without any evidence to support it, and later on claims the driver is fine because "accelerated WebGL [...] in practice worked just fine (at least in my usage of it)".

To me, it smells like someone complaining that a broken piece of software is black-listed and shouldn't have because everyone know it's broken anyway, but it kinda works, so it's fine.

I think one of the complaints is that there is a double standard here.

I was curious enough about this to at least run the test suite recommended in the bug tracker.

My Intel/x86_64/Intel/F29 laptop and my Socionext/AArch64/nVidia/buster Developerbox work pretty much the same. Which is to say that the test suite causes the graphics stacks of both machines to lock up entirely (including rejecting any attempt to switch virtual terminals).

Regrettably that makes it difficult to access and share the test results prior to failure (I don't *think* the two machines fail at the same point although I did neglect to photograph the display so I can't be sure).

Daniel.

...

...
If that's a fair reading, I personally support blacklisting, and I second Chromium's suggestion to make the driver a first-class citizen as a way to remove it from the blacklist.

Who would do such work is a question that, to me, has no easy answer...

Linaro has no GPU working group and NVidia is not a member, so

working on their drivers, even if open source, if we had the expertise, would be free lunch. 2. NVidia doesn't care about OSS drivers (much) because they already have their own proprietary ones on the platforms they care about. 3. Arm can't work on OSS NVidia drivers as that would compete with Mali.

Someone could hire community developers to do that work, or at least to validate it on Arm and create a list of bugs that need to be fixed, with more details than just "works for me". Linaro could do the validation matrix but would have to do it for both Arm and x86, and then hope the nouveau community would pick up the tab and fix them all. We'd also have to provide access to hardware for them to test, etc.

An alternative crappy solution would be to IFDEF the inclusion in the blacklist *exclusively* for Arm, given even we still don't care much about bugs in NVidia+Arm. But that's gotta lose some kudos from whomever proposes it and will be met with fierce refusal from the Chromium community.

The bottom line is: not many people care about nouveau on Arm, given the only platform that actually uses it today is the Synquacer.

Nouveau is also used on non-PCI NVidia ARM SoCs with integrated graphics.

...
I may be wrong, there may be a thriving community for NVidia on Arm out there. If there is, MHO is that we should talk to them instead of putting pressure on Chrimum to lift the ban. If not, there's no pressure to be put in the first place.

If, however, you propose we put pressure on nouveau specifically, for both Arm and x86, then I think it should come from the other side (x86) first, and Arm's not a first-class citizen on nouveau anyway. All in all, we're at the very bottom of the priority stack, there's no pressure we can put on anything, but Linaro could do the heavy lifting of validation matrix and help the nouveau community to identify and validate their fixes.

Thanks Renato

Rob Clark

2:05 p.m.

On Tue, Jan 8, 2019 at 8:37 AM Daniel Thompson daniel.thompson@linaro.org wrote:

...

On Tue, Jan 08, 2019 at 11:25:56AM +0100, Ard Biesheuvel wrote:

...
On Tue, 8 Jan 2019 at 10:36, Renato Golin renato.golin@linaro.org wrote:

...
Hi Ard,

I don't have the whole context, but my read of that email is:

The claim is that "some people had some issues with indeterminate

hardware and indeterminate versions of mesa", however... 2. He "did run the WebGL CTS suite, but that resulted in some hangs from the the max-texture-size-equivalent test, and some browser-level weirdness after some tests where later tests all fail"

I don't think those two statements are compatible. He can reproduce lots of failures on his own machine, probably just didn't have time to investigate all of them in detail.

Furthermore, he claims the failures are "due to what [he has] to assume is a browser bug" without any evidence to support it, and later on claims the driver is fine because "accelerated WebGL [...] in practice worked just fine (at least in my usage of it)".

To me, it smells like someone complaining that a broken piece of software is black-listed and shouldn't have because everyone know it's broken anyway, but it kinda works, so it's fine.

I think one of the complaints is that there is a double standard here.

I was curious enough about this to at least run the test suite recommended in the bug tracker.

My Intel/x86_64/Intel/F29 laptop and my Socionext/AArch64/nVidia/buster Developerbox work pretty much the same. Which is to say that the test suite causes the graphics stacks of both machines to lock up entirely (including rejecting any attempt to switch virtual terminals).

Regrettably that makes it difficult to access and share the test results prior to failure (I don't *think* the two machines fail at the same point although I did neglect to photograph the display so I can't be sure).

If you hadn't already, this tends to be the sort of thing where having another machine to ssh from can be useful..

BR, -R

...

Daniel.

...
...
If that's a fair reading, I personally support blacklisting, and I second Chromium's suggestion to make the driver a first-class citizen as a way to remove it from the blacklist.

Who would do such work is a question that, to me, has no easy answer...

Linaro has no GPU working group and NVidia is not a member, so

working on their drivers, even if open source, if we had the expertise, would be free lunch. 2. NVidia doesn't care about OSS drivers (much) because they already have their own proprietary ones on the platforms they care about. 3. Arm can't work on OSS NVidia drivers as that would compete with Mali.

Someone could hire community developers to do that work, or at least to validate it on Arm and create a list of bugs that need to be fixed, with more details than just "works for me". Linaro could do the validation matrix but would have to do it for both Arm and x86, and then hope the nouveau community would pick up the tab and fix them all. We'd also have to provide access to hardware for them to test, etc.

An alternative crappy solution would be to IFDEF the inclusion in the blacklist *exclusively* for Arm, given even we still don't care much about bugs in NVidia+Arm. But that's gotta lose some kudos from whomever proposes it and will be met with fierce refusal from the Chromium community.

The bottom line is: not many people care about nouveau on Arm, given the only platform that actually uses it today is the Synquacer.

Nouveau is also used on non-PCI NVidia ARM SoCs with integrated graphics.

...
I may be wrong, there may be a thriving community for NVidia on Arm out there. If there is, MHO is that we should talk to them instead of putting pressure on Chrimum to lift the ban. If not, there's no pressure to be put in the first place.

If, however, you propose we put pressure on nouveau specifically, for both Arm and x86, then I think it should come from the other side (x86) first, and Arm's not a first-class citizen on nouveau anyway. All in all, we're at the very bottom of the priority stack, there's no pressure we can put on anything, but Linaro could do the heavy lifting of validation matrix and help the nouveau community to identify and validate their fixes.

Thanks Renato

Daniel Thompson

4:39 p.m.

On Tue, Jan 08, 2019 at 09:05:23AM -0500, Rob Clark wrote:

...

On Tue, Jan 8, 2019 at 8:37 AM Daniel Thompson daniel.thompson@linaro.org wrote:

...
On Tue, Jan 08, 2019 at 11:25:56AM +0100, Ard Biesheuvel wrote:

...
On Tue, 8 Jan 2019 at 10:36, Renato Golin renato.golin@linaro.org wrote:

...
Hi Ard,

I don't have the whole context, but my read of that email is:

The claim is that "some people had some issues with indeterminate

hardware and indeterminate versions of mesa", however... 2. He "did run the WebGL CTS suite, but that resulted in some hangs from the the max-texture-size-equivalent test, and some browser-level weirdness after some tests where later tests all fail"

I don't think those two statements are compatible. He can reproduce lots of failures on his own machine, probably just didn't have time to investigate all of them in detail.

Furthermore, he claims the failures are "due to what [he has] to assume is a browser bug" without any evidence to support it, and later on claims the driver is fine because "accelerated WebGL [...] in practice worked just fine (at least in my usage of it)".

To me, it smells like someone complaining that a broken piece of software is black-listed and shouldn't have because everyone know it's broken anyway, but it kinda works, so it's fine.

I think one of the complaints is that there is a double standard here.

I was curious enough about this to at least run the test suite recommended in the bug tracker.

My Intel/x86_64/Intel/F29 laptop and my Socionext/AArch64/nVidia/buster Developerbox work pretty much the same. Which is to say that the test suite causes the graphics stacks of both machines to lock up entirely (including rejecting any attempt to switch virtual terminals).

Regrettably that makes it difficult to access and share the test results prior to failure (I don't *think* the two machines fail at the same point although I did neglect to photograph the display so I can't be sure).

If you hadn't already, this tends to be the sort of thing where having another machine to ssh from can be useful..

Ah yes... I decided the results here risked being unfair to the laptop so I didn't put them in the e-mail.

On Developerbox networking survived to allow debug but on the laptop even the network was taken out in the crash... However since network on the laptop was hooked up via a type C combo port and the Developerbox uses the SoC built-in ethernet it's not really apples to apples.

Daniel.

Leif Lindholm

2:18 p.m.

On Tue, Jan 08, 2019 at 01:37:07PM +0000, Daniel Thompson wrote:

...

On Tue, Jan 08, 2019 at 11:25:56AM +0100, Ard Biesheuvel wrote:

...
On Tue, 8 Jan 2019 at 10:36, Renato Golin renato.golin@linaro.org wrote:

...
Hi Ard,

I don't have the whole context, but my read of that email is:

The claim is that "some people had some issues with indeterminate

hardware and indeterminate versions of mesa", however... 2. He "did run the WebGL CTS suite, but that resulted in some hangs from the the max-texture-size-equivalent test, and some browser-level weirdness after some tests where later tests all fail"

I don't think those two statements are compatible. He can reproduce lots of failures on his own machine, probably just didn't have time to investigate all of them in detail.

Furthermore, he claims the failures are "due to what [he has] to assume is a browser bug" without any evidence to support it, and later on claims the driver is fine because "accelerated WebGL [...] in practice worked just fine (at least in my usage of it)".

To me, it smells like someone complaining that a broken piece of software is black-listed and shouldn't have because everyone know it's broken anyway, but it kinda works, so it's fine.

I think one of the complaints is that there is a double standard here.

I was curious enough about this to at least run the test suite recommended in the bug tracker.

My Intel/x86_64/Intel/F29 laptop and my Socionext/AArch64/nVidia/buster Developerbox work pretty much the same. Which is to say that the test suite causes the graphics stacks of both machines to lock up entirely (including rejecting any attempt to switch virtual terminals).

Regrettably that makes it difficult to access and share the test results prior to failure (I don't *think* the two machines fail at the same point although I did neglect to photograph the display so I can't be sure).

Presumably this is a similar issue to what we previously observed with nouveau on developerbox/macchiatobin, and such ssh access should still be possible?

/ Leif

Bero Rosenkränzer

2:25 p.m.

Hi,

On Tue, 8 Jan 2019 at 09:52, Ard Biesheuvel ard.biesheuvel@linaro.org wrote:

...

Chromium [as of 71] now defaults to software rendering when it detects the nouveau driver.

There seem to be some real issues with nouveau and the Chromium codebase. Qt WebEngine (which is based on Chromium) has started blacklisting nouveau a long time ago because of threading issues https://bugreports.qt.io/browse/QTBUG-41242 https://bugs.freedesktop.org/show_bug.cgi?id=91632

The Qt bug report seems to have a reproducer attached - but it looks like it's something hard to fix.

...

This is bad for the ARM ecosystem, since it is the only driver we have for NVIDIA hardware, and it has fewer issues than AMD GPU drivers + hardware running on ARM systenms.

What's the problem with AMD GPU drivers on ARM? Just missing workarounds for things like the SynQuacer PCIE bug? On x86, AMD drivers tend to work better than NVIDIA, so maybe the simple (and acceptable, given NVIDIA refuses to join) fix would be improving AMD drivers and telling NVIDIA to go to hell...

...

It can be addressed by setting the 'ignore-gpu-blacklist' option, but this requires an explicit action from the user, which is unfortunate and undesirable.

Is there any way we can influence this?

We can probably get some Distros to patch it out -- but they will (and should) be reluctant while there's a good chance it fixes a real problem...

ttyl bero

Ard Biesheuvel

2:28 p.m.

Hi Bero,

On Tue, 8 Jan 2019 at 15:25, Bero Rosenkränzer Bernhard.Rosenkranzer@linaro.org wrote: ...

...

...
This is bad for the ARM ecosystem, since it is the only driver we have for NVIDIA hardware, and it has fewer issues than AMD GPU drivers + hardware running on ARM systenms.

What's the problem with AMD GPU drivers on ARM? Just missing workarounds for things like the SynQuacer PCIE bug? On x86, AMD drivers tend to work better than NVIDIA, so maybe the simple (and acceptable, given NVIDIA refuses to join) fix would be improving AMD drivers and telling NVIDIA to go to hell...

AMD gfx cards produce lots of visual corruption under Linux on all of the arm64 boards I tried (including Seattle which has properly working PCIe), and I have never managed to get anyone interested in looking into it (although I didn't try /that/ hard tbh)

...

...
It can be addressed by setting the 'ignore-gpu-blacklist' option, but this requires an explicit action from the user, which is unfortunate and undesirable.

Is there any way we can influence this?

We can probably get some Distros to patch it out -- but they will (and should) be reluctant while there's a good chance it fixes a real problem...

I am not saying they should patch it out. I am just asking aloud whether we think we have a stake in this, and whether we can contribute to a solution in any way.

Jeremy Linton

4:24 p.m.

Hi,

On 01/08/2019 08:28 AM, Ard Biesheuvel wrote:

...

Hi Bero,

On Tue, 8 Jan 2019 at 15:25, Bero Rosenkränzer Bernhard.Rosenkranzer@linaro.org wrote: ...

...
...
This is bad for the ARM ecosystem, since it is the only driver we have for NVIDIA hardware, and it has fewer issues than AMD GPU drivers + hardware running on ARM systenms.

What's the problem with AMD GPU drivers on ARM? Just missing workarounds for things like the SynQuacer PCIE bug? On x86, AMD drivers tend to work better than NVIDIA, so maybe the simple (and acceptable, given NVIDIA refuses to join) fix would be improving AMD drivers and telling NVIDIA to go to hell...

AMD gfx cards produce lots of visual corruption under Linux on all of the arm64 boards I tried (including Seattle which has properly working PCIe), and I have never managed to get anyone interested in looking into it (although I didn't try /that/ hard tbh)

Presumably amdgpu, and a more recent board? I had good luck with a hd5450 and the radeon driver although in its default configuration it expects to bootstrap the card using PCI PIO. Nouveau is hardly free from failures either, particularly if you happen to be running a 64k kernel.

Given that AMD has been a bit more open with their documentation, has an aarch64 UEFI/GOP driver on their site (https://www.amd.com/en/support/kb/release-notes/rn-aar), and doesn't seem to suffer as much when running !4k kernels, fixing it sounds well worth the effort.

Ard Biesheuvel

4:33 p.m.

On Tue, 8 Jan 2019 at 17:24, Jeremy Linton jeremy.linton@arm.com wrote:

...

Hi,

On 01/08/2019 08:28 AM, Ard Biesheuvel wrote:

...
Hi Bero,

On Tue, 8 Jan 2019 at 15:25, Bero Rosenkränzer Bernhard.Rosenkranzer@linaro.org wrote: ...

...
...
This is bad for the ARM ecosystem, since it is the only driver we have for NVIDIA hardware, and it has fewer issues than AMD GPU drivers + hardware running on ARM systenms.

What's the problem with AMD GPU drivers on ARM? Just missing workarounds for things like the SynQuacer PCIE bug? On x86, AMD drivers tend to work better than NVIDIA, so maybe the simple (and acceptable, given NVIDIA refuses to join) fix would be improving AMD drivers and telling NVIDIA to go to hell...

AMD gfx cards produce lots of visual corruption under Linux on all of the arm64 boards I tried (including Seattle which has properly working PCIe), and I have never managed to get anyone interested in looking into it (although I didn't try /that/ hard tbh)

Presumably amdgpu, and a more recent board? I had good luck with a hd5450 and the radeon driver although in its default configuration it expects to bootstrap the card using PCI PIO.

HD5450 is what I used as well. Which h/w did you use? Also, is there anything wrong with using PCI PIO?

...

Nouveau is hardly free from failures either, particularly if you happen to be running a 64k kernel.

Oh absolutely. It's just that in the configurations I use (AMD Seattle and Socionext SynQuacer with GeForce 210 and 710), I haven't seen any stability issues in quite some time, and I use these systems as desktop machines all the time.

...

Given that AMD has been a bit more open with their documentation, has an aarch64 UEFI/GOP driver on their site (https://www.amd.com/en/support/kb/release-notes/rn-aar), and doesn't seem to suffer as much when running !4k kernels, fixing it sounds well worth the effort.

Between you and me (and everyone else on cc), that GOP driver is in pretty bad shape tbh. The AMD guys did show up at the plugfest, though, and were eager to get it fixed.

So yes, I would love to see this improve as well.

Jeremy Linton

4:53 p.m.

On 01/08/2019 10:33 AM, Ard Biesheuvel wrote:

...

On Tue, 8 Jan 2019 at 17:24, Jeremy Linton jeremy.linton@arm.com wrote:

...
Hi,

On 01/08/2019 08:28 AM, Ard Biesheuvel wrote:

...
Hi Bero,

On Tue, 8 Jan 2019 at 15:25, Bero Rosenkränzer Bernhard.Rosenkranzer@linaro.org wrote: ...

...
...
This is bad for the ARM ecosystem, since it is the only driver we have for NVIDIA hardware, and it has fewer issues than AMD GPU drivers + hardware running on ARM systenms.

What's the problem with AMD GPU drivers on ARM? Just missing workarounds for things like the SynQuacer PCIE bug? On x86, AMD drivers tend to work better than NVIDIA, so maybe the simple (and acceptable, given NVIDIA refuses to join) fix would be improving AMD drivers and telling NVIDIA to go to hell...

AMD gfx cards produce lots of visual corruption under Linux on all of the arm64 boards I tried (including Seattle which has properly working PCIe), and I have never managed to get anyone interested in looking into it (although I didn't try /that/ hard tbh)

Presumably amdgpu, and a more recent board? I had good luck with a hd5450 and the radeon driver although in its default configuration it expects to bootstrap the card using PCI PIO.

HD5450 is what I used as well. Which h/w did you use? Also, is there anything wrong with using PCI PIO?

The 5450 spent probably close to two years in a junoR2. The PIO requirement meant it didn't bootstrap without futzing in most other ARM machines I tried it in, because.... well... few systems seems to make sure their firmware/ACPI tables/PIO/etc are functional.

If someone were to take this up seriously, I would hope they would focus on some newer boards..

Grant Likely

5:07 p.m.

[Looping in Carsten who also has an AMD board in his TX2]

On 08/01/2019 16:24, Jeremy Linton wrote:

...

Hi,

On 01/08/2019 08:28 AM, Ard Biesheuvel wrote:

...
Hi Bero,

On Tue, 8 Jan 2019 at 15:25, Bero Rosenkränzer Bernhard.Rosenkranzer@linaro.org wrote: ...

...
...
This is bad for the ARM ecosystem, since it is the only driver we have for NVIDIA hardware, and it has fewer issues than AMD GPU drivers + hardware running on ARM systenms.

What's the problem with AMD GPU drivers on ARM? Just missing workarounds for things like the SynQuacer PCIE bug? On x86, AMD drivers tend to work better than NVIDIA, so maybe the simple (and acceptable, given NVIDIA refuses to join) fix would be improving AMD drivers and telling NVIDIA to go to hell...

AMD gfx cards produce lots of visual corruption under Linux on all of the arm64 boards I tried (including Seattle which has properly working PCIe), and I have never managed to get anyone interested in looking into it (although I didn't try /that/ hard tbh)

Presumably amdgpu, and a more recent board? I had good luck with a hd5450 and the radeon driver although in its default configuration it expects to bootstrap the card using PCI PIO. Nouveau is hardly free from failures either, particularly if you happen to be running a 64k kernel.

Given that AMD has been a bit more open with their documentation, has an aarch64 UEFI/GOP driver on their site (https://www.amd.com/en/support/kb/release-notes/rn-aar), and doesn't seem to suffer as much when running !4k kernels, fixing it sounds well worth the effort.

Carsten Haitzler

6:35 p.m.

On 08/01/2019 17:07, Grant Likely wrote:

FYI I have a Radeon RX550 with amdgpu on my thunder-x2. yes - it's a server ARM (aarch64) system, but it works a charm. 2 screens attached. I did have to do the following:

1. patch kernel DRM code to force uncached mappings (the code apparently assumes WC x86-style):

--- ./include/drm/drm_cache.h~ 2018-08-12 21:41:04.000000000 +0100 +++ ./include/drm/drm_cache.h 2018-11-16 11:06:16.976842816 +0000 @@ -48,7 +48,7 @@ #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3) return false; #else - return true; + return false; #endif }

2. ensure I have a proper Xorg conf file for it to stop trying to do things with the dumb FB on board:

Section "ServerFlags" Option "AutoAddGPU" "false" EndSection

Section "Device" Identifier "amdgpu" Driver "amdgpu" BusID "137:0:0" Option "DRI" "2" Option "TearFree" "on" EndSection

I put that in /usr/share/X11/xorg.conf.d/10-amdgpu.conf

The kernel patch is just about the only really "advanced" thing I had to do. I currently don't really know how I should hack that up to be "consumable in general" (Should we do this all the time on ARM/aarch64 by default? Only on some SoCs/CPUs? Should we create kernel/module parameters to do the above?).

...

[Looping in Carsten who also has an AMD board in his TX2]

g.

On 08/01/2019 16:24, Jeremy Linton wrote:

...
Hi,

On 01/08/2019 08:28 AM, Ard Biesheuvel wrote:

...
Hi Bero,

On Tue, 8 Jan 2019 at 15:25, Bero Rosenkränzer Bernhard.Rosenkranzer@linaro.org wrote: ...

...
...
This is bad for the ARM ecosystem, since it is the only driver we have for NVIDIA hardware, and it has fewer issues than AMD GPU drivers + hardware running on ARM systenms.

What's the problem with AMD GPU drivers on ARM? Just missing workarounds for things like the SynQuacer PCIE bug? On x86, AMD drivers tend to work better than NVIDIA, so maybe the simple (and acceptable, given NVIDIA refuses to join) fix would be improving AMD drivers and telling NVIDIA to go to hell...

AMD gfx cards produce lots of visual corruption under Linux on all of the arm64 boards I tried (including Seattle which has properly working PCIe), and I have never managed to get anyone interested in looking into it (although I didn't try /that/ hard tbh)

Presumably amdgpu, and a more recent board? I had good luck with a hd5450 and the radeon driver although in its default configuration it expects to bootstrap the card using PCI PIO. Nouveau is hardly free from failures either, particularly if you happen to be running a 64k kernel.

Given that AMD has been a bit more open with their documentation, has an aarch64 UEFI/GOP driver on their site (https://www.amd.com/en/support/kb/release-notes/rn-aar), and doesn't seem to suffer as much when running !4k kernels, fixing it sounds well worth the effort.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Bero Rosenkränzer

8:32 p.m.

Hi,

...

The kernel patch is just about the only really "advanced" thing I had to do. I currently don't really know how I should hack that up to be "consumable in general" (Should we do this all the time on ARM/aarch64 by default? Only on some SoCs/CPUs? Should we create kernel/module parameters to do the above?).

Given the function it touches is highly arch specific already (it comes down to "false on PPC without Coherent Cache, false on Loongson3, true for everything else"), I'm pretty sure adding another #elif defined(CONFIG_ARM) or so would be acceptable. Not sure if there's any SOCs that should be exempted though...

drm_arch_can_wc_memory() seems to be used only in the amdgpu and radeon drivers - so SOCs that use Freedreno etc. or boxes with nouveau won't be affected (for now).

ttyl bero

Rob Clark

8:50 p.m.

On Tue, Jan 8, 2019 at 3:32 PM Bero Rosenkränzer Bernhard.Rosenkranzer@linaro.org wrote:

...

Hi,

...
The kernel patch is just about the only really "advanced" thing I had to do. I currently don't really know how I should hack that up to be "consumable in general" (Should we do this all the time on ARM/aarch64 by default? Only on some SoCs/CPUs? Should we create kernel/module parameters to do the above?).

Given the function it touches is highly arch specific already (it comes down to "false on PPC without Coherent Cache, false on Loongson3, true for everything else"), I'm pretty sure adding another #elif defined(CONFIG_ARM) or so would be acceptable. Not sure if there's any SOCs that should be exempted though...

drm_arch_can_wc_memory() seems to be used only in the amdgpu and radeon drivers - so SOCs that use Freedreno etc. or boxes with nouveau won't be affected (for now).

at least so far adreno is not sitting behind a (real) pci bus.. I guess the issue on radeon/amdgpu was really about WC + PCI?

BR, -R

...

ttyl bero

Carsten Haitzler

9 Jan 9 Jan

1:35 p.m.

On 08/01/2019 20:50, Rob Clark wrote:

...

On Tue, Jan 8, 2019 at 3:32 PM Bero Rosenkränzer Bernhard.Rosenkranzer@linaro.org wrote:

...
Hi,

...
The kernel patch is just about the only really "advanced" thing I had to do. I currently don't really know how I should hack that up to be "consumable in general" (Should we do this all the time on ARM/aarch64 by default? Only on some SoCs/CPUs? Should we create kernel/module parameters to do the above?).

Given the function it touches is highly arch specific already (it comes down to "false on PPC without Coherent Cache, false on Loongson3, true for everything else"), I'm pretty sure adding another #elif defined(CONFIG_ARM) or so would be acceptable. Not sure if there's any SOCs that should be exempted though...

drm_arch_can_wc_memory() seems to be used only in the amdgpu and radeon drivers - so SOCs that use Freedreno etc. or boxes with nouveau won't be affected (for now).

at least so far adreno is not sitting behind a (real) pci bus.. I guess the issue on radeon/amdgpu was really about WC + PCI?

it was for me - i just wasn't so sure if blanket returning false for arm (and aar64) was not going to have some unintended victims. i guess given the consensus here. it's worth adding those #ifdefs and submitting a patch upstream.

...

BR, -R

...
ttyl bero

Ard Biesheuvel

10:41 a.m.

(adding Will who was part of a similar discussion before)

On Tue, 8 Jan 2019 at 19:35, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...

On 08/01/2019 17:07, Grant Likely wrote:

FYI I have a Radeon RX550 with amdgpu on my thunder-x2. yes - it's a server ARM (aarch64) system, but it works a charm. 2 screens attached. I did have to do the following:

patch kernel DRM code to force uncached mappings (the code apparently

assumes WC x86-style):

--- ./include/drm/drm_cache.h~ 2018-08-12 21:41:04.000000000 +0100 +++ ./include/drm/drm_cache.h 2018-11-16 11:06:16.976842816 +0000 @@ -48,7 +48,7 @@ #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3) return false; #else
  return true;
  return false;
#endif }

OK, so this is rather interesting. First of all, this is the exact change we apply to the nouveau driver to work on SynQuacer, i.e., demote all normal-non cacheable mappings of memory exposed by the PCIe controller via a BAR to device mappings. On SynQuacer, we need this because of a known silicon bug in the integration of the PCIe IP.

However, the fact that even on TX2, you need device mappings to map RAM exposed via PCIe is rather troubling, and it has come up in the past as well. The problem is that the GPU driver stack on Linux, including VDPAU libraries and other userland pieces all assume that memory exposed via PCIe has proper memory semantics, including the ability to perform unaligned accesses on it or use DC ZVA instructions to clear it. As we all know, these driver stacks are rather complex, and adding awareness to each level in the stack regarding whether a certain piece of memory is real memory or PCI memory is going to be cumbersome.

When we discussed this in the past, an ARM h/w engineer pointed out that normal-nc is fundamentally incompatible with AMBA or AXI or whatever we use on ARM to integrate these components at the silicon level. If that means we can only use device mappings, it means we will need to make intrusive changes to a *lot* of code to ensure it doesn't use memcpy() or do other things that device mappings don't tolerate on ARM.

So, can we get the right people from the ARM side involved to clarify this once and for all?

...

ensure I have a proper Xorg conf file for it to stop trying to do

things with the dumb FB on board:

Section "ServerFlags" Option "AutoAddGPU" "false" EndSection

Section "Device" Identifier "amdgpu" Driver "amdgpu" BusID "137:0:0" Option "DRI" "2" Option "TearFree" "on" EndSection

I put that in /usr/share/X11/xorg.conf.d/10-amdgpu.conf

This is needed to ignore the BMC display controller, I take it?

...

The kernel patch is just about the only really "advanced" thing I had to do. I currently don't really know how I should hack that up to be "consumable in general" (Should we do this all the time on ARM/aarch64 by default? Only on some SoCs/CPUs? Should we create kernel/module parameters to do the above?).

...
[Looping in Carsten who also has an AMD board in his TX2]

g.

On 08/01/2019 16:24, Jeremy Linton wrote:

...
Hi,

On 01/08/2019 08:28 AM, Ard Biesheuvel wrote:

...
Hi Bero,

On Tue, 8 Jan 2019 at 15:25, Bero Rosenkränzer Bernhard.Rosenkranzer@linaro.org wrote: ...

...
...
This is bad for the ARM ecosystem, since it is the only driver we have for NVIDIA hardware, and it has fewer issues than AMD GPU drivers + hardware running on ARM systenms.

What's the problem with AMD GPU drivers on ARM? Just missing workarounds for things like the SynQuacer PCIE bug? On x86, AMD drivers tend to work better than NVIDIA, so maybe the simple (and acceptable, given NVIDIA refuses to join) fix would be improving AMD drivers and telling NVIDIA to go to hell...

AMD gfx cards produce lots of visual corruption under Linux on all of the arm64 boards I tried (including Seattle which has properly working PCIe), and I have never managed to get anyone interested in looking into it (although I didn't try /that/ hard tbh)

Presumably amdgpu, and a more recent board? I had good luck with a hd5450 and the radeon driver although in its default configuration it expects to bootstrap the card using PCI PIO. Nouveau is hardly free from failures either, particularly if you happen to be running a 64k kernel.

Given that AMD has been a bit more open with their documentation, has an aarch64 UEFI/GOP driver on their site (https://www.amd.com/en/support/kb/release-notes/rn-aar), and doesn't seem to suffer as much when running !4k kernels, fixing it sounds well worth the effort.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Will Deacon

11:05 a.m.

On Wed, Jan 09, 2019 at 11:41:00AM +0100, Ard Biesheuvel wrote:

...

(adding Will who was part of a similar discussion before)

On Tue, 8 Jan 2019 at 19:35, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...
On 08/01/2019 17:07, Grant Likely wrote:

FYI I have a Radeon RX550 with amdgpu on my thunder-x2. yes - it's a server ARM (aarch64) system, but it works a charm. 2 screens attached. I did have to do the following:

patch kernel DRM code to force uncached mappings (the code apparently

assumes WC x86-style):

--- ./include/drm/drm_cache.h~ 2018-08-12 21:41:04.000000000 +0100 +++ ./include/drm/drm_cache.h 2018-11-16 11:06:16.976842816 +0000 @@ -48,7 +48,7 @@ #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3) return false; #else
  return true;
  return false;
#endif }
OK, so this is rather interesting. First of all, this is the exact change we apply to the nouveau driver to work on SynQuacer, i.e., demote all normal-non cacheable mappings of memory exposed by the PCIe controller via a BAR to device mappings. On SynQuacer, we need this because of a known silicon bug in the integration of the PCIe IP.

However, the fact that even on TX2, you need device mappings to map RAM exposed via PCIe is rather troubling, and it has come up in the past as well. The problem is that the GPU driver stack on Linux, including VDPAU libraries and other userland pieces all assume that memory exposed via PCIe has proper memory semantics, including the ability to perform unaligned accesses on it or use DC ZVA instructions to clear it. As we all know, these driver stacks are rather complex, and adding awareness to each level in the stack regarding whether a certain piece of memory is real memory or PCI memory is going to be cumbersome.

When we discussed this in the past, an ARM h/w engineer pointed out that normal-nc is fundamentally incompatible with AMBA or AXI or whatever we use on ARM to integrate these components at the silicon level.

FWIW, I still don't understand exactly what the point being made was in that thread, but I do know that many of the assertions along the way were either vague or incorrect. Yes, it's possible to integrate different buses in a way that doesn't work, but I don't see anything "fundamental" about it.

...

If that means we can only use device mappings, it means we will need to make intrusive changes to a *lot* of code to ensure it doesn't use memcpy() or do other things that device mappings don't tolerate on ARM.

Even if we got it working, it would probably be horribly slow.

...

So, can we get the right people from the ARM side involved to clarify this once and for all?

Last time I looked at this code, the problem actually seemed to be that the DRM core ends up trying to remap the CPU pages in ttm_set_pages_uc(). This is a NOP for !x86, so I think we end up with the CPU using a cacheable mapping but the device using a non-cacheable mapping, which could explain the hang.

At the time, implementing set_pages_uc() to remap the linear mapping wasn't feasible because it would preclude the use of block mappings, but now that we're using page mappings by default maybe you could give it a try.

Will

Ard Biesheuvel

11:54 a.m.

On Wed, 9 Jan 2019 at 12:05, Will Deacon will.deacon@arm.com wrote:

...

On Wed, Jan 09, 2019 at 11:41:00AM +0100, Ard Biesheuvel wrote:

...
(adding Will who was part of a similar discussion before)

On Tue, 8 Jan 2019 at 19:35, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...
On 08/01/2019 17:07, Grant Likely wrote:

FYI I have a Radeon RX550 with amdgpu on my thunder-x2. yes - it's a server ARM (aarch64) system, but it works a charm. 2 screens attached. I did have to do the following:

patch kernel DRM code to force uncached mappings (the code apparently

assumes WC x86-style):

--- ./include/drm/drm_cache.h~ 2018-08-12 21:41:04.000000000 +0100 +++ ./include/drm/drm_cache.h 2018-11-16 11:06:16.976842816 +0000 @@ -48,7 +48,7 @@ #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3) return false; #else
  return true;
  return false;
#endif }
OK, so this is rather interesting. First of all, this is the exact change we apply to the nouveau driver to work on SynQuacer, i.e., demote all normal-non cacheable mappings of memory exposed by the PCIe controller via a BAR to device mappings. On SynQuacer, we need this because of a known silicon bug in the integration of the PCIe IP.

However, the fact that even on TX2, you need device mappings to map RAM exposed via PCIe is rather troubling, and it has come up in the past as well. The problem is that the GPU driver stack on Linux, including VDPAU libraries and other userland pieces all assume that memory exposed via PCIe has proper memory semantics, including the ability to perform unaligned accesses on it or use DC ZVA instructions to clear it. As we all know, these driver stacks are rather complex, and adding awareness to each level in the stack regarding whether a certain piece of memory is real memory or PCI memory is going to be cumbersome.

When we discussed this in the past, an ARM h/w engineer pointed out that normal-nc is fundamentally incompatible with AMBA or AXI or whatever we use on ARM to integrate these components at the silicon level.
FWIW, I still don't understand exactly what the point being made was in that thread, but I do know that many of the assertions along the way were either vague or incorrect. Yes, it's possible to integrate different buses in a way that doesn't work, but I don't see anything "fundamental" about it.

...

...
If that means we can only use device mappings, it means we will need to make intrusive changes to a *lot* of code to ensure it doesn't use memcpy() or do other things that device mappings don't tolerate on ARM.

Even if we got it working, it would probably be horribly slow.

...
So, can we get the right people from the ARM side involved to clarify this once and for all?

Last time I looked at this code, the problem actually seemed to be that the DRM core ends up trying to remap the CPU pages in ttm_set_pages_uc(). This is a NOP for !x86, so I think we end up with the CPU using a cacheable mapping but the device using a non-cacheable mapping, which could explain the hang.

At the time, implementing set_pages_uc() to remap the linear mapping wasn't feasible because it would preclude the use of block mappings, but now that we're using page mappings by default maybe you could give it a try.

OK, so I jumped to the conclusion that the similarity of the hack is due to the similarity of the issue it addresses, but this may not be the case.

Carsten, could you please explain /why/ this first change is required? It appears to only affect GTT mappings, and not VRAM mappings, both of which are essentially PCIe memory, only the former may be backed by system memory under the hood.

In any case, this is getting off-topic so I propose we trim the cc list a bit if we continue this discussion.

Carsten Haitzler

1:55 p.m.

On 09/01/2019 11:54, Ard Biesheuvel wrote:

...

On Wed, 9 Jan 2019 at 12:05, Will Deacon will.deacon@arm.com wrote:

...
On Wed, Jan 09, 2019 at 11:41:00AM +0100, Ard Biesheuvel wrote:

...
(adding Will who was part of a similar discussion before)

On Tue, 8 Jan 2019 at 19:35, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...
On 08/01/2019 17:07, Grant Likely wrote:

FYI I have a Radeon RX550 with amdgpu on my thunder-x2. yes - it's a server ARM (aarch64) system, but it works a charm. 2 screens attached. I did have to do the following:

patch kernel DRM code to force uncached mappings (the code apparently

assumes WC x86-style):

--- ./include/drm/drm_cache.h~ 2018-08-12 21:41:04.000000000 +0100 +++ ./include/drm/drm_cache.h 2018-11-16 11:06:16.976842816 +0000 @@ -48,7 +48,7 @@ #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3) return false; #else
  return true;
  return false;
#endif }
OK, so this is rather interesting. First of all, this is the exact change we apply to the nouveau driver to work on SynQuacer, i.e., demote all normal-non cacheable mappings of memory exposed by the PCIe controller via a BAR to device mappings. On SynQuacer, we need this because of a known silicon bug in the integration of the PCIe IP.

However, the fact that even on TX2, you need device mappings to map RAM exposed via PCIe is rather troubling, and it has come up in the past as well. The problem is that the GPU driver stack on Linux, including VDPAU libraries and other userland pieces all assume that memory exposed via PCIe has proper memory semantics, including the ability to perform unaligned accesses on it or use DC ZVA instructions to clear it. As we all know, these driver stacks are rather complex, and adding awareness to each level in the stack regarding whether a certain piece of memory is real memory or PCI memory is going to be cumbersome.

When we discussed this in the past, an ARM h/w engineer pointed out that normal-nc is fundamentally incompatible with AMBA or AXI or whatever we use on ARM to integrate these components at the silicon level.
FWIW, I still don't understand exactly what the point being made was in that thread, but I do know that many of the assertions along the way were either vague or incorrect. Yes, it's possible to integrate different buses in a way that doesn't work, but I don't see anything "fundamental" about it.
OK

...
...
If that means we can only use device mappings, it means we will need to make intrusive changes to a *lot* of code to ensure it doesn't use memcpy() or do other things that device mappings don't tolerate on ARM.

Even if we got it working, it would probably be horribly slow.

...
So, can we get the right people from the ARM side involved to clarify this once and for all?

Last time I looked at this code, the problem actually seemed to be that the DRM core ends up trying to remap the CPU pages in ttm_set_pages_uc(). This is a NOP for !x86, so I think we end up with the CPU using a cacheable mapping but the device using a non-cacheable mapping, which could explain the hang.

At the time, implementing set_pages_uc() to remap the linear mapping wasn't feasible because it would preclude the use of block mappings, but now that we're using page mappings by default maybe you could give it a try.

OK, so I jumped to the conclusion that the similarity of the hack is due to the similarity of the issue it addresses, but this may not be the case.

Carsten, could you please explain /why/ this first change is required? It appears to only affect GTT mappings, and not VRAM mappings, both of which are essentially PCIe memory, only the former may be backed by system memory under the hood.

In any case, this is getting off-topic so I propose we trim the cc list a bit if we continue this discussion.

Well the only 2 drivers to use drm_arch_can_wc_memory() are radeon and amdgpu. so we're limiting the affects to those. they use them to set up memory mappings and flags. if you look at the code int he amdgpu driver relevant to this:

bo->flags = bp->flags;

#ifdef CONFIG_X86_32 /* XXX: Write-combined CPU mappings of GTT seem broken on 32-bit * See https://bugs.freedesktop.org/show_bug.cgi?id=84627 */ bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC; #elif defined(CONFIG_X86) && !defined(CONFIG_X86_PAT) /* Don't try to enable write-combining when it can't work, or things * may be slow * See https://bugs.freedesktop.org/show_bug.cgi?id=88758 */

#ifndef CONFIG_COMPILE_TEST #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance \ thanks to write-combining #endif

if (bo->flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC) DRM_INFO_ONCE("Please enable CONFIG_MTRR and CONFIG_X86_PAT for " "better performance thanks to write-combining\n"); bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC; #else /* For architectures that don't support WC memory, * mask out the WC flag from the BO */ if (!drm_arch_can_wc_memory()) bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC; #endif

You can see that it's all about WC and even on certain x86 setups it's not working so it gets turned off. Not so on ARM. My understanding is that ARM can't do "WC" in a guaranteed way like x86, so turning it off is the right thing to do anyway, and no one bothered to do it because no one really tests AMD devices on ARM (much) or doesn't bother getting patches back upstream. Yes - it only affects the GTT but that is totally sufficient for things to go from totally broken (nothing even comes up to totally working for me. I haven't looked but i assume it's using that mapping for control registers or something because my issues were with the command queue sequence numbers and so on not large volumes of data like textures themselves etc.

Bero Rosenkränzer

2:32 p.m.

On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...

My understanding is that ARM can't do "WC" in a guaranteed way like x86, so turning it off is the right thing to do anyway,

My understanding too.

FWIW I've added the fix to the OpenMandriva distro kernel https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8... Let's see if any user starts screaming ;)

ttyl bero

Carsten Haitzler

2:34 p.m.

On 09/01/2019 14:33, Bero Rosenkränzer wrote:

On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler Carsten.Haitzler@arm.com mailto:Carsten.Haitzler@arm.com wrote:

My understanding is that ARM can't do "WC" in a guaranteed way like x86, so turning it off is the right thing to do anyway,

My understanding too.

FWIW I've added the fix to the OpenMandriva distro kernel https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8... Let's see if any user starts screaming ;)

ttyl bero

let's see,. i have put in a patch to the internal kernel patch review before i send off to dri-devel. it's exactly your patch there just with a commit log explaining why.

Ard Biesheuvel

2:36 p.m.

On Wed, 9 Jan 2019 at 15:34, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...

On 09/01/2019 14:33, Bero Rosenkränzer wrote:

On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

My understanding is that ARM can't do "WC" in a guaranteed way like x86, so turning it off is the right thing to do anyway,

My understanding too.

FWIW I've added the fix to the OpenMandriva distro kernel https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8... Let's see if any user starts screaming ;)

ttyl bero

let's see,. i have put in a patch to the internal kernel patch review before i send off to dri-devel. it's exactly your patch there just with a commit log explaining why.

So what exactly is it about x86 style wc that ARM cannot do?

Carsten Haitzler

2:38 p.m.

On 09/01/2019 14:36, Ard Biesheuvel wrote:

On Wed, 9 Jan 2019 at 15:34, Carsten Haitzler Carsten.Haitzler@arm.com mailto:Carsten.Haitzler@arm.com wrote:

On 09/01/2019 14:33, Bero Rosenkränzer wrote:

On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler Carsten.Haitzler@arm.com mailto:Carsten.Haitzler@arm.com wrote:

My understanding is that ARM can't do "WC" in a guaranteed way like x86, so turning it off is the right thing to do anyway,

My understanding too.

FWIW I've added the fix to the OpenMandriva distro kernel https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8... Let's see if any user starts screaming ;)

ttyl bero

let's see,. i have put in a patch to the internal kernel patch review before i send off to dri-devel. it's exactly your patch there just with a commit log explaining why.

So what exactly is it about x86 style wc that ARM cannot do?

...

From Pavel Shamis here at ARM:

"Short version. X86 has well define behavior for WC memory – it combines multiples consecutive stores (has to be aligned to the cache line ) in 64B cache line writes over PCIe. On Arm WC corresponds to Normal NC. Arm uarch does not do combining to cache line size. On some uarch we do 16B combining but not cache line. The first uarch that will be doing cache line size combining is Aries.

It is important to note that WC is an opportunistic optimization and the software/hardware should not make an assumption that it always “combines” (true for x86 and arm)"

Ard Biesheuvel

2:48 p.m.

On Wed, 9 Jan 2019 at 15:38, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...

On 09/01/2019 14:36, Ard Biesheuvel wrote:

On Wed, 9 Jan 2019 at 15:34, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

On 09/01/2019 14:33, Bero Rosenkränzer wrote:

On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

My understanding is that ARM can't do "WC" in a guaranteed way like x86, so turning it off is the right thing to do anyway,

My understanding too.

FWIW I've added the fix to the OpenMandriva distro kernel https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8... Let's see if any user starts screaming ;)

ttyl bero

let's see,. i have put in a patch to the internal kernel patch review before i send off to dri-devel. it's exactly your patch there just with a commit log explaining why.

So what exactly is it about x86 style wc that ARM cannot do?

From Pavel Shamis here at ARM:

"Short version.

X86 has well define behavior for WC memory – it combines multiples consecutive stores (has to be aligned to the cache line ) in 64B cache line writes over PCIe.

On Arm WC corresponds to Normal NC. Arm uarch does not do combining to cache line size. On some uarch we do 16B combining but not cache line.

The first uarch that will be doing cache line size combining is Aries.

It is important to note that WC is an opportunistic optimization and the software/hardware should not make an assumption that it always “combines” (true for x86 and arm)"

OK, so that only means that ARM WC mappings may behave more like x86 uncached mappings than x86 WC mappings. It does not explain why things break if we use them.

The problem with using uncached mappings here is that it breaks use cases that expect memory semantics, for unaligned access or DC ZVA instructions. At least VDPAU on nouveau breaks due to this, and likely many more other use cases as well.

So *please*, do not send that patch to dri-devel. Let's instead fix the root cause, which may be related to the thing pointed out by Will, i.e., that ttm_set_pages_uc() is not implemented correctly.

Carsten Haitzler

3:30 p.m.

On 09/01/2019 14:48, Ard Biesheuvel wrote:

...

On Wed, 9 Jan 2019 at 15:38, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...
On 09/01/2019 14:36, Ard Biesheuvel wrote:

On Wed, 9 Jan 2019 at 15:34, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

On 09/01/2019 14:33, Bero Rosenkränzer wrote:

On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

My understanding is that ARM can't do "WC" in a guaranteed way like x86, so turning it off is the right thing to do anyway,

My understanding too.

FWIW I've added the fix to the OpenMandriva distro kernel https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8... Let's see if any user starts screaming ;)

ttyl bero

let's see,. i have put in a patch to the internal kernel patch review before i send off to dri-devel. it's exactly your patch there just with a commit log explaining why.

So what exactly is it about x86 style wc that ARM cannot do?

From Pavel Shamis here at ARM:

"Short version.

X86 has well define behavior for WC memory – it combines multiples consecutive stores (has to be aligned to the cache line ) in 64B cache line writes over PCIe.

On Arm WC corresponds to Normal NC. Arm uarch does not do combining to cache line size. On some uarch we do 16B combining but not cache line.

The first uarch that will be doing cache line size combining is Aries.

It is important to note that WC is an opportunistic optimization and the software/hardware should not make an assumption that it always “combines” (true for x86 and arm)"

OK, so that only means that ARM WC mappings may behave more like x86 uncached mappings than x86 WC mappings. It does not explain why things break if we use them.

The problem with using uncached mappings here is that it breaks use cases that expect memory semantics, for unaligned access or DC ZVA instructions. At least VDPAU on nouveau breaks due to this, and likely many more other use cases as well.

For amdgpu though it works and this is and AMD+Radeon only code path. At least it works on the only ARM system I have an AMD GPU plugged into. you need the same fix for SynQuacer. Gettign a fix upstream like this will alleaviet a reasonable amount of pain for end-users even if not perfect.

I do not plan on going any further with this patch because it's for my tx2 and that is my ONLY workstation at work and it takes like 10 minutes per reboot cycle. I have many things to do and getting my gfx card to a working state was the primary focus. Spending days just rebooting to try things with something I am not familiar with (thwe ttm mappings) is not something I have time for. Looking at the history of other bugs that affect WC/UC mappings in radeon/madgpu shows that this is precisely the kind of fix that has been done multiple times in the past for x86 and obviously some MIPS and PPC systems. there's mountains of precedent that this is a quick and simple fix that has been implemented many time in the past, so from that point of view I think its a decent fix in and of itself when it comes to time vs. reward.

It may not be perfect, but it is better than it was and other MIPS/PPC and even x86 32bit systems already need this kind of fix. In the same way it seems ARM needs it too and no one to date has bothered upstream. I'd rather things improve for at least some set of people than they do not improve at all for an undefined amount of time. Note that working is an improvement to "fast but doesn't work" in my book. :) Don't get me wrong. Looking for a better fix in the meantime,if one could exist, is a positive thing. It's not something I can get stuck into as above.

...

So *please*, do not send that patch to dri-devel. Let's instead fix the root cause, which may be related to the thing pointed out by Will, i.e., that ttm_set_pages_uc() is not implemented correctly.

Ard Biesheuvel

7:07 p.m.

On Wed, 9 Jan 2019 at 16:30, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...

On 09/01/2019 14:48, Ard Biesheuvel wrote:

...
On Wed, 9 Jan 2019 at 15:38, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...
On 09/01/2019 14:36, Ard Biesheuvel wrote:

On Wed, 9 Jan 2019 at 15:34, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

On 09/01/2019 14:33, Bero Rosenkränzer wrote:

On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

My understanding is that ARM can't do "WC" in a guaranteed way like x86, so turning it off is the right thing to do anyway,

My understanding too.

FWIW I've added the fix to the OpenMandriva distro kernel https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8... Let's see if any user starts screaming ;)

ttyl bero

let's see,. i have put in a patch to the internal kernel patch review before i send off to dri-devel. it's exactly your patch there just with a commit log explaining why.

So what exactly is it about x86 style wc that ARM cannot do?

From Pavel Shamis here at ARM:

"Short version.

X86 has well define behavior for WC memory – it combines multiples consecutive stores (has to be aligned to the cache line ) in 64B cache line writes over PCIe.

On Arm WC corresponds to Normal NC. Arm uarch does not do combining to cache line size. On some uarch we do 16B combining but not cache line.

The first uarch that will be doing cache line size combining is Aries.

It is important to note that WC is an opportunistic optimization and the software/hardware should not make an assumption that it always “combines” (true for x86 and arm)"

OK, so that only means that ARM WC mappings may behave more like x86 uncached mappings than x86 WC mappings. It does not explain why things break if we use them.

The problem with using uncached mappings here is that it breaks use cases that expect memory semantics, for unaligned access or DC ZVA instructions. At least VDPAU on nouveau breaks due to this, and likely many more other use cases as well.

For amdgpu though it works and this is and AMD+Radeon only code path. At least it works on the only ARM system I have an AMD GPU plugged into. you need the same fix for SynQuacer. Gettign a fix upstream like this will alleaviet a reasonable amount of pain for end-users even if not perfect.

I do not plan on going any further with this patch because it's for my tx2 and that is my ONLY workstation at work and it takes like 10 minutes per reboot cycle. I have many things to do and getting my gfx card to a working state was the primary focus. Spending days just rebooting to try things with something I am not familiar with (thwe ttm mappings) is not something I have time for. Looking at the history of other bugs that affect WC/UC mappings in radeon/madgpu shows that this is precisely the kind of fix that has been done multiple times in the past for x86 and obviously some MIPS and PPC systems. there's mountains of precedent that this is a quick and simple fix that has been implemented many time in the past, so from that point of view I think its a decent fix in and of itself when it comes to time vs. reward.

I can confirm that this change fixes all the issues I observed on AMD Seattle with HD5450 and HD7450 cards which use the Radeon driver (not the amdpgu one)

So I will attempt to dig into this a bit further myself, and hopefully find something that carries over to amdgpu as well, so I may ask you to test something if I do.

...

It may not be perfect, but it is better than it was and other MIPS/PPC and even x86 32bit systems already need this kind of fix. In the same way it seems ARM needs it too and no one to date has bothered upstream. I'd rather things improve for at least some set of people than they do not improve at all for an undefined amount of time. Note that working is an improvement to "fast but doesn't work" in my book. :) Don't get me wrong. Looking for a better fix in the meantime,if one could exist, is a positive thing. It's not something I can get stuck into as above.

I'd just like to see if we can fix properly before we upstream a hack.

Ard Biesheuvel

9:10 p.m.

On Wed, 9 Jan 2019 at 20:07, Ard Biesheuvel ard.biesheuvel@linaro.org wrote:

...

On Wed, 9 Jan 2019 at 16:30, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...
On 09/01/2019 14:48, Ard Biesheuvel wrote:

...
On Wed, 9 Jan 2019 at 15:38, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...
On 09/01/2019 14:36, Ard Biesheuvel wrote:

On Wed, 9 Jan 2019 at 15:34, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

On 09/01/2019 14:33, Bero Rosenkränzer wrote:

On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

My understanding is that ARM can't do "WC" in a guaranteed way like x86, so turning it off is the right thing to do anyway,

My understanding too.

FWIW I've added the fix to the OpenMandriva distro kernel https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8... Let's see if any user starts screaming ;)

ttyl bero

let's see,. i have put in a patch to the internal kernel patch review before i send off to dri-devel. it's exactly your patch there just with a commit log explaining why.

So what exactly is it about x86 style wc that ARM cannot do?

From Pavel Shamis here at ARM:

"Short version.

X86 has well define behavior for WC memory – it combines multiples consecutive stores (has to be aligned to the cache line ) in 64B cache line writes over PCIe.

On Arm WC corresponds to Normal NC. Arm uarch does not do combining to cache line size. On some uarch we do 16B combining but not cache line.

The first uarch that will be doing cache line size combining is Aries.

It is important to note that WC is an opportunistic optimization and the software/hardware should not make an assumption that it always “combines” (true for x86 and arm)"

OK, so that only means that ARM WC mappings may behave more like x86 uncached mappings than x86 WC mappings. It does not explain why things break if we use them.

The problem with using uncached mappings here is that it breaks use cases that expect memory semantics, for unaligned access or DC ZVA instructions. At least VDPAU on nouveau breaks due to this, and likely many more other use cases as well.

For amdgpu though it works and this is and AMD+Radeon only code path. At least it works on the only ARM system I have an AMD GPU plugged into. you need the same fix for SynQuacer. Gettign a fix upstream like this will alleaviet a reasonable amount of pain for end-users even if not perfect.

I do not plan on going any further with this patch because it's for my tx2 and that is my ONLY workstation at work and it takes like 10 minutes per reboot cycle. I have many things to do and getting my gfx card to a working state was the primary focus. Spending days just rebooting to try things with something I am not familiar with (thwe ttm mappings) is not something I have time for. Looking at the history of other bugs that affect WC/UC mappings in radeon/madgpu shows that this is precisely the kind of fix that has been done multiple times in the past for x86 and obviously some MIPS and PPC systems. there's mountains of precedent that this is a quick and simple fix that has been implemented many time in the past, so from that point of view I think its a decent fix in and of itself when it comes to time vs. reward.

I can confirm that this change fixes all the issues I observed on AMD Seattle with HD5450 and HD7450 cards which use the Radeon driver (not the amdpgu one)

So I will attempt to dig into this a bit further myself, and hopefully find something that carries over to amdgpu as well, so I may ask you to test something if I do.

...
It may not be perfect, but it is better than it was and other MIPS/PPC and even x86 32bit systems already need this kind of fix. In the same way it seems ARM needs it too and no one to date has bothered upstream. I'd rather things improve for at least some set of people than they do not improve at all for an undefined amount of time. Note that working is an improvement to "fast but doesn't work" in my book. :) Don't get me wrong. Looking for a better fix in the meantime,if one could exist, is a positive thing. It's not something I can get stuck into as above.

I'd just like to see if we can fix properly before we upstream a hack.

OK, so it seems that if drm_arch_can_wc_memory() returns true, the TTM code also prefers uncached WC mappings over cacheable mappings for system memory, resulting in mismatched attributes. I am not sure what the rationale is for this, but I'd be much happier having a fix that turns normal-nc mappings into cacheable ones than into device ones.

Could you please give the following patch a spin (and revert the other change)?

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -180,7 +180,9 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain) places[c].fpfn = 0; places[c].lpfn = 0; places[c].flags = TTM_PL_FLAG_SYSTEM; - if (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC) + if (!IS_ENABLED(CONFIG_ARM64) && + !IS_ENABLED(CONFIG_ARM) && + (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC)) places[c].flags |= TTM_PL_FLAG_WC | TTM_PL_FLAG_UNCACHED; else

Carsten Haitzler

10 Jan 10 Jan

4:14 p.m.

On 09/01/2019 21:10, Ard Biesheuvel wrote:

Patch fails against my mainline Linux tree. which section does it modify?

if (domain & AMDGPU_GEM_DOMAIN_GTT) {

if (domain & AMDGPU_GEM_DOMAIN_CPU) {

it looks like it should be the latter. You'd need similar changes to radeon as well right?

Ard Biesheuvel

4:16 p.m.

On Thu, 10 Jan 2019 at 17:14, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...

On 09/01/2019 21:10, Ard Biesheuvel wrote:

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -180,7 +180,9 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain) places[c].fpfn = 0; places[c].lpfn = 0; places[c].flags = TTM_PL_FLAG_SYSTEM;
          if (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC)
          if (!IS_ENABLED(CONFIG_ARM64) &&
              !IS_ENABLED(CONFIG_ARM) &&
              (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC))
                  places[c].flags |= TTM_PL_FLAG_WC |
                          TTM_PL_FLAG_UNCACHED;
          else
Patch fails against my mainline Linux tree. which section does it modify?
    if (domain & AMDGPU_GEM_DOMAIN_GTT) {
or
    if (domain & AMDGPU_GEM_DOMAIN_CPU) {
it looks like it should be the latter. You'd need similar changes to radeon as well right?

This is the radeon change that fixes things for me carried over to amdgpu.

Instead, you could test the patch I posted, which amounts to the same thing (i.e., ignore the WC flag, since afaict, it is only ever used to downgrade from cached mappings, rather than upgrade from uncached)

Carsten Haitzler

4:28 p.m.

On 10/01/2019 16:16, Ard Biesheuvel wrote:

...

On Thu, 10 Jan 2019 at 17:14, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...
On 09/01/2019 21:10, Ard Biesheuvel wrote:

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -180,7 +180,9 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain) places[c].fpfn = 0; places[c].lpfn = 0; places[c].flags = TTM_PL_FLAG_SYSTEM;
          if (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC)
          if (!IS_ENABLED(CONFIG_ARM64) &&
              !IS_ENABLED(CONFIG_ARM) &&
              (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC))
                  places[c].flags |= TTM_PL_FLAG_WC |
                          TTM_PL_FLAG_UNCACHED;
          else
Patch fails against my mainline Linux tree. which section does it modify?
    if (domain & AMDGPU_GEM_DOMAIN_GTT) {
or
    if (domain & AMDGPU_GEM_DOMAIN_CPU) {
it looks like it should be the latter. You'd need similar changes to radeon as well right?
This is the radeon change that fixes things for me carried over to amdgpu.

Instead, you could test the patch I posted, which amounts to the same thing (i.e., ignore the WC flag, since afaict, it is only ever used to downgrade from cached mappings, rather than upgrade from uncached)

That's precisely what I was doing (copy and paste to a tmp txt file, fix the spurious line wrap after that) and the patch failed, so I asked what it was meant to patch to make sure I manually edit the right if () in the amdgpu driver since patching fails.

I also know the radeon code has similar code in it too and that also probably would need fixes if that driver also were to be fixed as the shared function that I patched no longer alters behavior over in radeon land if all I do is apply the patch you pasted and undo my patch as you indicated. I asked to make sure we're on the same page on this.

Carsten Haitzler

5:59 p.m.

On 09/01/2019 21:10, Ard Biesheuvel wrote:

On Wed, 9 Jan 2019 at 20:07, Ard Biesheuvel ard.biesheuvel@linaro.org mailto:ard.biesheuvel@linaro.org wrote:

On Wed, 9 Jan 2019 at 16:30, Carsten Haitzler Carsten.Haitzler@arm.com mailto:Carsten.Haitzler@arm.com wrote:

On 09/01/2019 14:48, Ard Biesheuvel wrote:

On Wed, 9 Jan 2019 at 15:38, Carsten Haitzler Carsten.Haitzler@arm.com mailto:Carsten.Haitzler@arm.com wrote:

On 09/01/2019 14:36, Ard Biesheuvel wrote:

On Wed, 9 Jan 2019 at 15:34, Carsten Haitzler Carsten.Haitzler@arm.com mailto:Carsten.Haitzler@arm.com wrote:

On 09/01/2019 14:33, Bero Rosenkränzer wrote:

On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler Carsten.Haitzler@arm.com mailto:Carsten.Haitzler@arm.com wrote:

My understanding is that ARM can't do "WC" in a guaranteed way like x86, so turning it off is the right thing to do anyway,

My understanding too.

FWIW I've added the fix to the OpenMandriva distro kernel https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8... Let's see if any user starts screaming ;)

ttyl bero

let's see,. i have put in a patch to the internal kernel patch review before i send off to dri-devel. it's exactly your patch there just with a commit log explaining why.

So what exactly is it about x86 style wc that ARM cannot do?

...

From Pavel Shamis here at ARM:

"Short version.

X86 has well define behavior for WC memory – it combines multiples consecutive stores (has to be aligned to the cache line ) in 64B cache line writes over PCIe.

On Arm WC corresponds to Normal NC. Arm uarch does not do combining to cache line size. On some uarch we do 16B combining but not cache line.

The first uarch that will be doing cache line size combining is Aries.

It is important to note that WC is an opportunistic optimization and the software/hardware should not make an assumption that it always “combines” (true for x86 and arm)"

OK, so that only means that ARM WC mappings may behave more like x86 uncached mappings than x86 WC mappings. It does not explain why things break if we use them.

I can confirm that this change fixes all the issues I observed on AMD Seattle with HD5450 and HD7450 cards which use the Radeon driver (not the amdpgu one)

So I will attempt to dig into this a bit further myself, and hopefully find something that carries over to amdgpu as well, so I may ask you to test something if I do.

I'd just like to see if we can fix properly before we upstream a hack.

Could you please give the following patch a spin (and revert the other change)?

Took me an hour to get a reboot to work.. (the usual fun problems). It seems to work. Well GUI is up and running in the past 10 mins about as good as it was before. So amdgpu driver verified at least for me.

Carsten Haitzler

11:36 a.m.

On 09/01/2019 19:07, Ard Biesheuvel wrote:

...

On Wed, 9 Jan 2019 at 16:30, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...
On 09/01/2019 14:48, Ard Biesheuvel wrote:

...
On Wed, 9 Jan 2019 at 15:38, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...
On 09/01/2019 14:36, Ard Biesheuvel wrote:

On Wed, 9 Jan 2019 at 15:34, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

On 09/01/2019 14:33, Bero Rosenkränzer wrote:

On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

My understanding is that ARM can't do "WC" in a guaranteed way like x86, so turning it off is the right thing to do anyway,

My understanding too.

FWIW I've added the fix to the OpenMandriva distro kernel https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8... Let's see if any user starts screaming ;)

ttyl bero

let's see,. i have put in a patch to the internal kernel patch review before i send off to dri-devel. it's exactly your patch there just with a commit log explaining why.

So what exactly is it about x86 style wc that ARM cannot do?

From Pavel Shamis here at ARM:

"Short version.

X86 has well define behavior for WC memory – it combines multiples consecutive stores (has to be aligned to the cache line ) in 64B cache line writes over PCIe.

On Arm WC corresponds to Normal NC. Arm uarch does not do combining to cache line size. On some uarch we do 16B combining but not cache line.

The first uarch that will be doing cache line size combining is Aries.

It is important to note that WC is an opportunistic optimization and the software/hardware should not make an assumption that it always “combines” (true for x86 and arm)"

OK, so that only means that ARM WC mappings may behave more like x86 uncached mappings than x86 WC mappings. It does not explain why things break if we use them.

The problem with using uncached mappings here is that it breaks use cases that expect memory semantics, for unaligned access or DC ZVA instructions. At least VDPAU on nouveau breaks due to this, and likely many more other use cases as well.

For amdgpu though it works and this is and AMD+Radeon only code path. At least it works on the only ARM system I have an AMD GPU plugged into. you need the same fix for SynQuacer. Gettign a fix upstream like this will alleaviet a reasonable amount of pain for end-users even if not perfect.

I do not plan on going any further with this patch because it's for my tx2 and that is my ONLY workstation at work and it takes like 10 minutes per reboot cycle. I have many things to do and getting my gfx card to a working state was the primary focus. Spending days just rebooting to try things with something I am not familiar with (thwe ttm mappings) is not something I have time for. Looking at the history of other bugs that affect WC/UC mappings in radeon/madgpu shows that this is precisely the kind of fix that has been done multiple times in the past for x86 and obviously some MIPS and PPC systems. there's mountains of precedent that this is a quick and simple fix that has been implemented many time in the past, so from that point of view I think its a decent fix in and of itself when it comes to time vs. reward.

I can confirm that this change fixes all the issues I observed on AMD Seattle with HD5450 and HD7450 cards which use the Radeon driver (not the amdpgu one)

Hooray. Another happy user. :) I suspect Bero will report success too. At least this is at worst the "tip of the iceberg" of the problem and that patch fixes it with a sledgehammer. At best it's the exact right fix. :) that's another topic. I see another mail with some patch so I'll continue there.

...

So I will attempt to dig into this a bit further myself, and hopefully find something that carries over to amdgpu as well, so I may ask you to test something if I do.

...
It may not be perfect, but it is better than it was and other MIPS/PPC and even x86 32bit systems already need this kind of fix. In the same way it seems ARM needs it too and no one to date has bothered upstream. I'd rather things improve for at least some set of people than they do not improve at all for an undefined amount of time. Note that working is an improvement to "fast but doesn't work" in my book. :) Don't get me wrong. Looking for a better fix in the meantime,if one could exist, is a positive thing. It's not something I can get stuck into as above.

I'd just like to see if we can fix properly before we upstream a hack.

If we find a significantly better fix in short order - sure. If this is going to drag out into weeks and weeks of back and forth, I think we should consider getting a fix out until something better can be found. Just keep in mind, for every day no fix is available someone somewhere is yelling at some system that doesn't work and they don't know why. They may not know C or how to even compile things... but they are unhappy. :)

Will Deacon

11:49 a.m.

On Thu, Jan 10, 2019 at 11:36:41AM +0000, Carsten Haitzler wrote:

...

On 09/01/2019 19:07, Ard Biesheuvel wrote:

...
I can confirm that this change fixes all the issues I observed on AMD Seattle with HD5450 and HD7450 cards which use the Radeon driver (not the amdpgu one)

Hooray. Another happy user. :) I suspect Bero will report success too. At least this is at worst the "tip of the iceberg" of the problem and that patch fixes it with a sledgehammer. At best it's the exact right fix. :) that's another topic. I see another mail with some patch so I'll continue there.

Thanks.

...

...
So I will attempt to dig into this a bit further myself, and hopefully find something that carries over to amdgpu as well, so I may ask you to test something if I do.

...
It may not be perfect, but it is better than it was and other MIPS/PPC and even x86 32bit systems already need this kind of fix. In the same way it seems ARM needs it too and no one to date has bothered upstream. I'd rather things improve for at least some set of people than they do not improve at all for an undefined amount of time. Note that working is an improvement to "fast but doesn't work" in my book. :) Don't get me wrong. Looking for a better fix in the meantime,if one could exist, is a positive thing. It's not something I can get stuck into as above.

I'd just like to see if we can fix properly before we upstream a hack.

If we find a significantly better fix in short order - sure. If this is going to drag out into weeks and weeks of back and forth, I think we should consider getting a fix out until something better can be found. Just keep in mind, for every day no fix is available someone somewhere is yelling at some system that doesn't work and they don't know why. They may not know C or how to even compile things... but they are unhappy. :)

This patch perpetuates the unfounded accusation that the Arm architecture is fundamenatally incompatible with write-combining and PCI. If we don't bother to diagnose the reported failures correctly, removing hacks such as this when we are forced to understand the problem properly tends to be considerably more effort in my experience, particularly if the same hack has been adopted by other drivers or subsystems.

So I don't think this patch is anything more than a short-term hack, which isn't something we should commit to maintaining upstream. I'm over the moon that it allows you to use your workstation effectively, but please let's try to root-cause this (as Ard is doing) before we rush something in that we're unable to reason about.

I know you're not a fan of rebooting, but I'd appreciate it if you could please help with testing (or throw me an AMD card for a few days so I can do it myself).

Thanks,

Will

Carsten Haitzler

3:29 p.m.

On 10/01/2019 11:49, Will Deacon wrote:

...

On Thu, Jan 10, 2019 at 11:36:41AM +0000, Carsten Haitzler wrote:

...
On 09/01/2019 19:07, Ard Biesheuvel wrote:

...
I can confirm that this change fixes all the issues I observed on AMD Seattle with HD5450 and HD7450 cards which use the Radeon driver (not the amdpgu one)

Hooray. Another happy user. :) I suspect Bero will report success too. At least this is at worst the "tip of the iceberg" of the problem and that patch fixes it with a sledgehammer. At best it's the exact right fix. :) that's another topic. I see another mail with some patch so I'll continue there.

Thanks.

...
...
So I will attempt to dig into this a bit further myself, and hopefully find something that carries over to amdgpu as well, so I may ask you to test something if I do.

...
It may not be perfect, but it is better than it was and other MIPS/PPC and even x86 32bit systems already need this kind of fix. In the same way it seems ARM needs it too and no one to date has bothered upstream. I'd rather things improve for at least some set of people than they do not improve at all for an undefined amount of time. Note that working is an improvement to "fast but doesn't work" in my book. :) Don't get me wrong. Looking for a better fix in the meantime,if one could exist, is a positive thing. It's not something I can get stuck into as above.

I'd just like to see if we can fix properly before we upstream a hack.

If we find a significantly better fix in short order - sure. If this is going to drag out into weeks and weeks of back and forth, I think we should consider getting a fix out until something better can be found. Just keep in mind, for every day no fix is available someone somewhere is yelling at some system that doesn't work and they don't know why. They may not know C or how to even compile things... but they are unhappy. :)

This patch perpetuates the unfounded accusation that the Arm architecture is fundamenatally incompatible with write-combining and PCI. If we don't bother to diagnose the reported failures correctly, removing hacks such as this when we are forced to understand the problem properly tends to be considerably more effort in my experience, particularly if the same hack has been adopted by other drivers or subsystems.

So I don't think this patch is anything more than a short-term hack, which isn't something we should commit to maintaining upstream. I'm over the moon that it allows you to use your workstation effectively, but please let's try to root-cause this (as Ard is doing) before we rush something in that we're unable to reason about.

I know you're not a fan of rebooting, but I'd appreciate it if you could please help with testing (or throw me an AMD card for a few days so I can do it myself).

I don't have any spare cards. :( Remember that this is my primary machine for everything, not a test machine.

I just had my tx2 (ThunderX2) lock up. It took 2 hours to have a working machine again as it refused to reboot (admittedly with lunch in between hoping a cool-down might fix things, but without it'd still have taken an hour or so to fix). Anything involving reboots is risk on my tx2. Going through the BIOS means multiple minutes of waiting for stuff hoping it finally boots, during which time I don't have a workstation to do email or anything else on. I should time the boot but I think it's about 5-10mins (and I'm too scared to reboot just to time it).

It's not about not being a fan or not of rebooting. It's about not sitting around potentially for hours or even days waiting on reboots and being unable to work on anything else. If I had a spare machine I could let boot on the side and get on with stuff on my main... I'd be fine. :) It's probably OK to reboot a bit and take the risk on my machine, but not frequently. :)

Renato Golin

3:36 p.m.

On Thu, 10 Jan 2019 at 15:29, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...

Going through the BIOS means multiple minutes of waiting for stuff hoping it finally boots, during which time I don't have a workstation to do email or anything else on. I should time the boot but I think it's about 5-10mins (and I'm too scared to reboot just to time it).

On our TX2 server, boot times were 3 minutes, but firmware upgrade (7.0) made it 8 minutes. Turns out this is to do with the DDR tests and here's how to move it back to 3 minutes:

https://bugs.linaro.org/show_bug.cgi?id=4132

Before UEFI menu there's a message to "press 's' for setup". Do it, then:

CAVM_CN99xx# env set dbg_speed_up_ddr_lvl 1 CAVM_CN99xx# env save

Hope it helps, --renato

Carsten Haitzler

5:57 p.m.

On 10/01/2019 15:36, Renato Golin wrote:

On Thu, 10 Jan 2019 at 15:29, Carsten Haitzler Carsten.Haitzler@arm.com mailto:Carsten.Haitzler@arm.com wrote:

Going through the BIOS means multiple minutes of waiting for stuff hoping it finally boots, during which time I don't have a workstation to do email or anything else on. I should time the boot but I think it's about 5-10mins (and I'm too scared to reboot just to time it).

On our TX2 server, boot times were 3 minutes, but firmware upgrade (7.0) made it 8 minutes. Turns out this is to do with the DDR tests and here's how to move it back to 3 minutes:

https://bugs.linaro.org/show_bug.cgi?id=4132

Before UEFI menu there's a message to "press 's' for setup". Do it, then:

CAVM_CN99xx# env set dbg_speed_up_ddr_lvl 1 CAVM_CN99xx# env save

Hope it helps, --renato

It does seem to cut the "BIOS time" down to just under 4mins. Thanks! A full reboot cycle though from hitting "reboot" in my GUI to seeing my GUI logged on again is about 6mins still. The kernel spends something like 30 secs on boot just enumerating CPUs for example. It seems unable to bring up CPUs in parallel...

Carsten Haitzler

9 Jan 9 Jan

1:41 p.m.

On 09/01/2019 11:05, Will Deacon wrote:

...

On Wed, Jan 09, 2019 at 11:41:00AM +0100, Ard Biesheuvel wrote:

...
(adding Will who was part of a similar discussion before)

On Tue, 8 Jan 2019 at 19:35, Carsten Haitzler Carsten.Haitzler@arm.com wrote:

...
On 08/01/2019 17:07, Grant Likely wrote:

FYI I have a Radeon RX550 with amdgpu on my thunder-x2. yes - it's a server ARM (aarch64) system, but it works a charm. 2 screens attached. I did have to do the following:

patch kernel DRM code to force uncached mappings (the code apparently

assumes WC x86-style):

--- ./include/drm/drm_cache.h~ 2018-08-12 21:41:04.000000000 +0100 +++ ./include/drm/drm_cache.h 2018-11-16 11:06:16.976842816 +0000 @@ -48,7 +48,7 @@ #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3) return false; #else
  return true;
  return false;
#endif }
OK, so this is rather interesting. First of all, this is the exact change we apply to the nouveau driver to work on SynQuacer, i.e., demote all normal-non cacheable mappings of memory exposed by the PCIe controller via a BAR to device mappings. On SynQuacer, we need this because of a known silicon bug in the integration of the PCIe IP.

However, the fact that even on TX2, you need device mappings to map RAM exposed via PCIe is rather troubling, and it has come up in the past as well. The problem is that the GPU driver stack on Linux, including VDPAU libraries and other userland pieces all assume that memory exposed via PCIe has proper memory semantics, including the ability to perform unaligned accesses on it or use DC ZVA instructions to clear it. As we all know, these driver stacks are rather complex, and adding awareness to each level in the stack regarding whether a certain piece of memory is real memory or PCI memory is going to be cumbersome.

When we discussed this in the past, an ARM h/w engineer pointed out that normal-nc is fundamentally incompatible with AMBA or AXI or whatever we use on ARM to integrate these components at the silicon level.
FWIW, I still don't understand exactly what the point being made was in that thread, but I do know that many of the assertions along the way were either vague or incorrect. Yes, it's possible to integrate different buses in a way that doesn't work, but I don't see anything "fundamental" about it.

...
If that means we can only use device mappings, it means we will need to make intrusive changes to a *lot* of code to ensure it doesn't use memcpy() or do other things that device mappings don't tolerate on ARM.

Even if we got it working, it would probably be horribly slow.

I got my AMD GPU working with the above.. and it's not horribly slow. it's a universe better than nouveau with an nv gpu. i get silky smooth 60fps across my 2 screens etc. etc. - nouveau could only do 30 (not performance really- likely buffer swap/sync/state bugs). i've run glmark on this and it's ballpark around right for the same thing on x86 within like 10 or 15% or so.

...

...
So, can we get the right people from the ARM side involved to clarify this once and for all?

Last time I looked at this code, the problem actually seemed to be that the DRM core ends up trying to remap the CPU pages in ttm_set_pages_uc(). This is a NOP for !x86, so I think we end up with the CPU using a cacheable mapping but the device using a non-cacheable mapping, which could explain the hang.

At the time, implementing set_pages_uc() to remap the linear mapping wasn't feasible because it would preclude the use of block mappings, but now that we're using page mappings by default maybe you could give it a try.

well also assuming a pci mapping can do WC (x86 style) when art least on ARM it clearly can't guarantee that is a very simple fix to go from not working to working. i haven't found it to be slow either. :)

2413

days inactive

2415

days old

cross-distro@lists.linaro.org

41 comments

participants

tags (0)

participants (12)

Alex Bennée
Ard Biesheuvel
Bero Rosenkränzer
Carsten Haitzler
Daniel Thompson
Graeme Gregory
Grant Likely
Jeremy Linton
Leif Lindholm
Renato Golin
Rob Clark
Will Deacon