Am 15.02.21 um 10:06 schrieb Simon Ser:
On Monday, February 15th, 2021 at 9:58 AM, Christian König christian.koenig@amd.com wrote:
we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.
On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.
For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?
After all the system memory access pattern is a PCIe extension and as such something generic.
Intel also needs uncached system memory if I'm not mistaken?
No idea, that's why I'm asking. Could be that this is also interesting for I+A systems.
Where are the buffers allocated? If GBM, then it needs to allocate memory that can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable modifier is picked.
If this is about communicating buffer constraints between different components of the stack, there were a few proposals about it. The most recent one is [1].
Well the problem here is on a different level of the stack.
See resolution, pitch etc:.. can easily communicated in userspace without involvement of the kernel. The worst thing which can happen is that you draw garbage into your own application window.
But if you get the caching attributes in the page tables (both CPU as well as IOMMU, device etc...) wrong then ARM for example has the tendency to just spontaneously reboot
X86 is fortunately a bit more gracefully and you only end up with random data corruption, but that is only marginally better.
So to sum it up that is not something which we can leave in the hands of userspace.
I think that exporters in the DMA-buf framework should have the ability to tell importers if the system memory snooping is necessary or not.
Userspace components can then of course tell the exporter what the importer needs, but validation if that stuff is correct and doesn't crash the system must happen in the kernel.
Regards, Christian.
Simon
Am Montag, dem 15.02.2021 um 10:34 +0100 schrieb Christian König:
Am 15.02.21 um 10:06 schrieb Simon Ser:
On Monday, February 15th, 2021 at 9:58 AM, Christian König christian.koenig@amd.com wrote:
we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.
On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.
For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?
After all the system memory access pattern is a PCIe extension and as such something generic.
Intel also needs uncached system memory if I'm not mistaken?
No idea, that's why I'm asking. Could be that this is also interesting for I+A systems.
Where are the buffers allocated? If GBM, then it needs to allocate memory that can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable modifier is picked.
If this is about communicating buffer constraints between different components of the stack, there were a few proposals about it. The most recent one is [1].
Well the problem here is on a different level of the stack.
See resolution, pitch etc:.. can easily communicated in userspace without involvement of the kernel. The worst thing which can happen is that you draw garbage into your own application window.
But if you get the caching attributes in the page tables (both CPU as well as IOMMU, device etc...) wrong then ARM for example has the tendency to just spontaneously reboot
X86 is fortunately a bit more gracefully and you only end up with random data corruption, but that is only marginally better.
So to sum it up that is not something which we can leave in the hands of userspace.
I think that exporters in the DMA-buf framework should have the ability to tell importers if the system memory snooping is necessary or not.
There is already a coarse-grained way to do so: the dma_coherent property in struct device, which you can check at dmabuf attach time.
However it may not be enough for the requirements of a GPU where the engines could differ in their dma coherency requirements. For that you need to either have fake struct devices for the individual engines or come up with a more fine-grained way to communicate those requirements.
Userspace components can then of course tell the exporter what the importer needs, but validation if that stuff is correct and doesn't crash the system must happen in the kernel.
What exactly do you mean by "scanout requires non-coherent memory"? Does the scanout requestor always set the no-snoop PCI flag, so you get garbage if some writes to memory are still stuck in the caches, or is it some other requirement?
Regards, Lucas
Am 15.02.21 um 12:53 schrieb Lucas Stach:
Am Montag, dem 15.02.2021 um 10:34 +0100 schrieb Christian König:
Am 15.02.21 um 10:06 schrieb Simon Ser:
On Monday, February 15th, 2021 at 9:58 AM, Christian König christian.koenig@amd.com wrote:
we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.
On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.
For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?
After all the system memory access pattern is a PCIe extension and as such something generic.
Intel also needs uncached system memory if I'm not mistaken?
No idea, that's why I'm asking. Could be that this is also interesting for I+A systems.
Where are the buffers allocated? If GBM, then it needs to allocate memory that can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable modifier is picked.
If this is about communicating buffer constraints between different components of the stack, there were a few proposals about it. The most recent one is [1].
Well the problem here is on a different level of the stack.
See resolution, pitch etc:.. can easily communicated in userspace without involvement of the kernel. The worst thing which can happen is that you draw garbage into your own application window.
But if you get the caching attributes in the page tables (both CPU as well as IOMMU, device etc...) wrong then ARM for example has the tendency to just spontaneously reboot
X86 is fortunately a bit more gracefully and you only end up with random data corruption, but that is only marginally better.
So to sum it up that is not something which we can leave in the hands of userspace.
I think that exporters in the DMA-buf framework should have the ability to tell importers if the system memory snooping is necessary or not.
There is already a coarse-grained way to do so: the dma_coherent property in struct device, which you can check at dmabuf attach time.
However it may not be enough for the requirements of a GPU where the engines could differ in their dma coherency requirements. For that you need to either have fake struct devices for the individual engines or come up with a more fine-grained way to communicate those requirements.
Yeah, that won't work. We need this on a per buffer level.
Userspace components can then of course tell the exporter what the importer needs, but validation if that stuff is correct and doesn't crash the system must happen in the kernel.
What exactly do you mean by "scanout requires non-coherent memory"? Does the scanout requestor always set the no-snoop PCI flag, so you get garbage if some writes to memory are still stuck in the caches, or is it some other requirement?
Snooping the CPU caches introduces some extra latency, so what can happen is that the response to the PCIe read comes to late for the scanout. The result is an underflow and flickering whenever something is in the cache which needs to be flushed first.
On the other hand when the don't snoop the CPU caches we at least get garbage/stale data on the screen. That wouldn't be that worse, but the big problem is that we have also seen machine check exceptions when don't snoop and the cache is dirty.
So this should better be coherent or you can crash the box. ARM seems to be really susceptible for this, x86 is fortunately much more graceful and I'm not sure about other architectures.
Regards, Christian.
Regards, Lucas
Am Montag, dem 15.02.2021 um 13:04 +0100 schrieb Christian König:
Am 15.02.21 um 12:53 schrieb Lucas Stach:
Am Montag, dem 15.02.2021 um 10:34 +0100 schrieb Christian König:
Am 15.02.21 um 10:06 schrieb Simon Ser:
On Monday, February 15th, 2021 at 9:58 AM, Christian König christian.koenig@amd.com wrote:
we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.
On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.
For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?
After all the system memory access pattern is a PCIe extension and as such something generic.
Intel also needs uncached system memory if I'm not mistaken?
No idea, that's why I'm asking. Could be that this is also interesting for I+A systems.
Where are the buffers allocated? If GBM, then it needs to allocate memory that can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable modifier is picked.
If this is about communicating buffer constraints between different components of the stack, there were a few proposals about it. The most recent one is [1].
Well the problem here is on a different level of the stack.
See resolution, pitch etc:.. can easily communicated in userspace without involvement of the kernel. The worst thing which can happen is that you draw garbage into your own application window.
But if you get the caching attributes in the page tables (both CPU as well as IOMMU, device etc...) wrong then ARM for example has the tendency to just spontaneously reboot
X86 is fortunately a bit more gracefully and you only end up with random data corruption, but that is only marginally better.
So to sum it up that is not something which we can leave in the hands of userspace.
I think that exporters in the DMA-buf framework should have the ability to tell importers if the system memory snooping is necessary or not.
There is already a coarse-grained way to do so: the dma_coherent property in struct device, which you can check at dmabuf attach time.
However it may not be enough for the requirements of a GPU where the engines could differ in their dma coherency requirements. For that you need to either have fake struct devices for the individual engines or come up with a more fine-grained way to communicate those requirements.
Yeah, that won't work. We need this on a per buffer level.
Userspace components can then of course tell the exporter what the importer needs, but validation if that stuff is correct and doesn't crash the system must happen in the kernel.
What exactly do you mean by "scanout requires non-coherent memory"? Does the scanout requestor always set the no-snoop PCI flag, so you get garbage if some writes to memory are still stuck in the caches, or is it some other requirement?
Snooping the CPU caches introduces some extra latency, so what can happen is that the response to the PCIe read comes to late for the scanout. The result is an underflow and flickering whenever something is in the cache which needs to be flushed first.
Okay, that confirms my theory on why this is needed. So things don't totally explode if you don't do it, but to in order to guarantee access latency you need to take the no-snoop path, which means your device effectively gets dma-noncoherent.
On the other hand when the don't snoop the CPU caches we at least get garbage/stale data on the screen. That wouldn't be that worse, but the big problem is that we have also seen machine check exceptions when don't snoop and the cache is dirty.
If you attach to the dma-buf with a struct device which is non-coherent it's the exporters job to flush any dirty caches. Unfortunately the DRM caching of the dma-buf attachments in the DRM framework will get a bit in the way here, so a DRM specific flush might be be needed. :/ Maybe moving the whole buffer to uncached sysmem location on first attach of a non-coherent importer would be enough?
So this should better be coherent or you can crash the box. ARM seems to be really susceptible for this, x86 is fortunately much more graceful and I'm not sure about other architectures.
ARM really dislikes pagetable setups with different attributes pointing to the same physical page, however you should be fine as long as all cached aliases are properly flushed from the cache before access via a different alias.
Regards, Lucas
Am 15.02.21 um 13:16 schrieb Lucas Stach:
[SNIP]
Userspace components can then of course tell the exporter what the importer needs, but validation if that stuff is correct and doesn't crash the system must happen in the kernel.
What exactly do you mean by "scanout requires non-coherent memory"? Does the scanout requestor always set the no-snoop PCI flag, so you get garbage if some writes to memory are still stuck in the caches, or is it some other requirement?
Snooping the CPU caches introduces some extra latency, so what can happen is that the response to the PCIe read comes to late for the scanout. The result is an underflow and flickering whenever something is in the cache which needs to be flushed first.
Okay, that confirms my theory on why this is needed. So things don't totally explode if you don't do it, but to in order to guarantee access latency you need to take the no-snoop path, which means your device effectively gets dma-noncoherent.
Exactly. My big question at the moment is if this is something AMD specific or do we have the same issue on other devices as well?
On the other hand when the don't snoop the CPU caches we at least get garbage/stale data on the screen. That wouldn't be that worse, but the big problem is that we have also seen machine check exceptions when don't snoop and the cache is dirty.
If you attach to the dma-buf with a struct device which is non-coherent it's the exporters job to flush any dirty caches. Unfortunately the DRM caching of the dma-buf attachments in the DRM framework will get a bit in the way here, so a DRM specific flush might be be needed. :/ Maybe moving the whole buffer to uncached sysmem location on first attach of a non-coherent importer would be enough?
Could work in theory, but problem is that for this to do I have to tear down all CPU mappings and attachments of other devices.
Apart from the problem that we don't have the infrastructure for that we don't know at import time that a buffer might be used for scan out. I would need to re-import it during fb creation or something like this.
Our current concept for AMD GPUs is rather that we try to use uncached memory as much as possible. So for the specific use case just checking if the exporter is AMDGPU and has the flag set should be enough for not.
So this should better be coherent or you can crash the box. ARM seems to be really susceptible for this, x86 is fortunately much more graceful and I'm not sure about other architectures.
ARM really dislikes pagetable setups with different attributes pointing to the same physical page, however you should be fine as long as all cached aliases are properly flushed from the cache before access via a different alias.
Yeah, can totally confirm that and had to learn it the hard way.
Regards, Christian.
Regards, Lucas
From: Christian König
Sent: 15 February 2021 12:05
...
Snooping the CPU caches introduces some extra latency, so what can happen is that the response to the PCIe read comes to late for the scanout. The result is an underflow and flickering whenever something is in the cache which needs to be flushed first.
Aren't you going to get the same problem if any other endpoints are doing memory reads? Possibly even ones that don't require a cache snoop and flush.
What about just the cpu doing a real memory transfer?
Or a combination of the two above happening just before your request.
If you don't have a big enough fifo you'll lose.
I did 'fix' a similar(ish) issue with video DMA latency on an embedded system based the on SA1100/SA1101 by significantly reducing the clock to the VGA panel whenever the cpu was doing 'slow io'. (Interleaving an uncached cpu DRAM write between the slow io cycles also fixed it.) But the video was the only DMA device and that was an embedded system. Given the application note about video latency didn't mention what was actually happening, I'm not sure how many people actually got it working!
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Am 15.02.21 um 15:41 schrieb David Laight:
From: Christian König
Sent: 15 February 2021 12:05
...
Snooping the CPU caches introduces some extra latency, so what can happen is that the response to the PCIe read comes to late for the scanout. The result is an underflow and flickering whenever something is in the cache which needs to be flushed first.
Aren't you going to get the same problem if any other endpoints are doing memory reads?
The PCIe device in this case is part of the SoC, so we have a high priority channel to memory.
Because of this the hardware designer assumed they have a guaranteed memory latency.
Possibly even ones that don't require a cache snoop and flush.
What about just the cpu doing a real memory transfer?
Or a combination of the two above happening just before your request.
If you don't have a big enough fifo you'll lose.
I did 'fix' a similar(ish) issue with video DMA latency on an embedded system based the on SA1100/SA1101 by significantly reducing the clock to the VGA panel whenever the cpu was doing 'slow io'. (Interleaving an uncached cpu DRAM write between the slow io cycles also fixed it.) But the video was the only DMA device and that was an embedded system. Given the application note about video latency didn't mention what was actually happening, I'm not sure how many people actually got it working!
Yeah, I'm also not sure if AMD doesn't solve this with deeper fifos or more prefetching in future designs.
But you gave me at least one example where somebody had similar problems.
Thanks for the feedback, Christian.
David
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) _______________________________________________ Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
linaro-mm-sig@lists.linaro.org