[PATCH] usb: ehci: Enable support for 64bit EHCI host controllers in arm64

List overview All Threads
Download

newer

older

want a arndale board

next boot: 44 pass, 11 fail...

Mark Brown

14 May 2014 14 May '14

8:30 p.m.

From: Liviu Dudau Liviu.Dudau@arm.com

arm64 architecture handles correctly 64bit DMAs and can enable support for 64bit EHCI host controllers.

Signed-off-by: Liviu Dudau Liviu.Dudau@arm.com Signed-off-by: Ryan Harkin ryan.harkin@linaro.org Signed-off-by: Mark Brown broonie@linaro.org --- drivers/usb/host/ehci-hcd.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index 81cda09b47e3..e704d403beae 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c @@ -590,11 +590,17 @@ static int ehci_run (struct usb_hcd *hcd) */ hcc_params = ehci_readl(ehci, &ehci->caps->hcc_params); if (HCC_64BIT_ADDR(hcc_params)) { - ehci_writel(ehci, 0, &ehci->regs->segment); -#if 0 -// this is deeply broken on almost all architectures +#if CONFIG_ARM64 + ehci_writel(ehci, ehci->periodic_dma >> 32, + &ehci->regs->segment); + /* + * this is deeply broken on almost all architectures + * but arm64 can use it so enable it + */ if (!dma_set_mask(hcd->self.controller, DMA_BIT_MASK(64))) ehci_info(ehci, "enabled 64bit DMA\n"); +#else + ehci_writel(ehci, 0, &ehci->regs->segment); #endif }

-- 2.0.0.rc2

Show replies by date

Alan Stern

15 May 15 May

2:11 p.m.

On Wed, 14 May 2014, Mark Brown wrote:

...

From: Liviu Dudau Liviu.Dudau@arm.com

arm64 architecture handles correctly 64bit DMAs and can enable support for 64bit EHCI host controllers.

Signed-off-by: Liviu Dudau Liviu.Dudau@arm.com Signed-off-by: Ryan Harkin ryan.harkin@linaro.org Signed-off-by: Mark Brown broonie@linaro.org

Did you folks tested this for all sorts of host controllers? I have no way to verify that it works, and last I heard, many (or even most) controllers don't work right with 64-bit DMA.

...

diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index 81cda09b47e3..e704d403beae 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c @@ -590,11 +590,17 @@ static int ehci_run (struct usb_hcd *hcd) */ hcc_params = ehci_readl(ehci, &ehci->caps->hcc_params); if (HCC_64BIT_ADDR(hcc_params)) {
ehci_writel(ehci, 0, &ehci->regs->segment);
-#if 0 -// this is deeply broken on almost all architectures +#if CONFIG_ARM64
ehci_writel(ehci, ehci->periodic_dma >> 32,
	    &ehci->regs->segment);
/*
 * this is deeply broken on almost all architectures
 * but arm64 can use it so enable it
 */
if (!dma_set_mask(hcd->self.controller, DMA_BIT_MASK(64))) ehci_info(ehci, "enabled 64bit DMA\n");
+#else
ehci_writel(ehci, 0, &ehci->regs->segment);

It's silly to put this line in a separate #else section. The upper 32 bits of ehci->periodic_dma are bound to be 0 anyway, because it was allocated before the DMA mask was changed.

...

#endif }

Alan Stern

Liviu Dudau

3:17 p.m.

On Thu, May 15, 2014 at 03:11:48PM +0100, Alan Stern wrote:

...

On Wed, 14 May 2014, Mark Brown wrote:

...
From: Liviu Dudau Liviu.Dudau@arm.com

arm64 architecture handles correctly 64bit DMAs and can enable support for 64bit EHCI host controllers.

Signed-off-by: Liviu Dudau Liviu.Dudau@arm.com Signed-off-by: Ryan Harkin ryan.harkin@linaro.org Signed-off-by: Mark Brown broonie@linaro.org

Did you folks tested this for all sorts of host controllers? I have no way to verify that it works, and last I heard, many (or even most) controllers don't work right with 64-bit DMA.

I have tested it with a host controller that is capable of 64-bit DMA and without this change it doesn't work. At the moment it is the only known USB host controller enabled for arm64. And 64-bit DMA works fine on arm64.

...

...
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index 81cda09b47e3..e704d403beae 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c @@ -590,11 +590,17 @@ static int ehci_run (struct usb_hcd *hcd) */ hcc_params = ehci_readl(ehci, &ehci->caps->hcc_params); if (HCC_64BIT_ADDR(hcc_params)) {
ehci_writel(ehci, 0, &ehci->regs->segment);
-#if 0 -// this is deeply broken on almost all architectures +#if CONFIG_ARM64
ehci_writel(ehci, ehci->periodic_dma >> 32,
	    &ehci->regs->segment);
/*
 * this is deeply broken on almost all architectures
 * but arm64 can use it so enable it
 */
if (!dma_set_mask(hcd->self.controller, DMA_BIT_MASK(64))) ehci_info(ehci, "enabled 64bit DMA\n");
+#else
ehci_writel(ehci, 0, &ehci->regs->segment);
It's silly to put this line in a separate #else section. The upper 32 bits of ehci->periodic_dma are bound to be 0 anyway, because it was allocated before the DMA mask was changed.

Well, I don't want to enable 64-bit DMA for *all* the platforms, so there needs to be an #else. While it is true that ehci->periodic_dma variable has the top 32 bits zero, that cannot be guaranteed to be true for the physical register holding that value, so I guess the write is not superfluous.

Best regards, Liviu

...

...
#endif }

Alan Stern

Mark Brown

3:23 p.m.

On Thu, May 15, 2014 at 04:17:44PM +0100, Liviu Dudau wrote:

...

On Thu, May 15, 2014 at 03:11:48PM +0100, Alan Stern wrote:

...
On Wed, 14 May 2014, Mark Brown wrote:

...

...
Did you folks tested this for all sorts of host controllers? I have no way to verify that it works, and last I heard, many (or even most) controllers don't work right with 64-bit DMA.

...

I have tested it with a host controller that is capable of 64-bit DMA and without this change it doesn't work. At the moment it is the only known USB host controller enabled for arm64. And 64-bit DMA works fine on arm64.

We should probably conditionalise the device configuration on dma_set_mask() succeeding - any device that can't handle 64 bit DMA should be set up to constrain things appropriately.

Alan Stern

3:38 p.m.

On Thu, 15 May 2014, Mark Brown wrote:

...

On Thu, May 15, 2014 at 04:17:44PM +0100, Liviu Dudau wrote:

...
On Thu, May 15, 2014 at 03:11:48PM +0100, Alan Stern wrote:

...
On Wed, 14 May 2014, Mark Brown wrote:

...
...
Did you folks tested this for all sorts of host controllers? I have no way to verify that it works, and last I heard, many (or even most) controllers don't work right with 64-bit DMA.

...
I have tested it with a host controller that is capable of 64-bit DMA and without this change it doesn't work. At the moment it is the only known USB host controller enabled for arm64. And 64-bit DMA works fine on arm64.

We should probably conditionalise the device configuration on dma_set_mask() succeeding - any device that can't handle 64 bit DMA should be set up to constrain things appropriately.

Last I heard, there were EHCI controllers that claimed to handle 64-bit DMA but got it wrong. I don't know to what extent this may still be true.

Alan Stern

3:36 p.m.

On Thu, 15 May 2014, Liviu Dudau wrote:

...

On Thu, May 15, 2014 at 03:11:48PM +0100, Alan Stern wrote:

...
On Wed, 14 May 2014, Mark Brown wrote:

...
From: Liviu Dudau Liviu.Dudau@arm.com

arm64 architecture handles correctly 64bit DMAs and can enable support for 64bit EHCI host controllers.

Signed-off-by: Liviu Dudau Liviu.Dudau@arm.com Signed-off-by: Ryan Harkin ryan.harkin@linaro.org Signed-off-by: Mark Brown broonie@linaro.org

Did you folks tested this for all sorts of host controllers? I have no way to verify that it works, and last I heard, many (or even most) controllers don't work right with 64-bit DMA.

I have tested it with a host controller that is capable of 64-bit DMA and without this change it doesn't work.

What do you mean it doesn't work? Can't the host controller use 32-bit DMA?

...

At the moment it is the only known USB host controller enabled for arm64. And 64-bit DMA works fine on arm64.

What do you mean? What happens if you plug in a PCI card containing, say, a Sony EHCI host controller on an arm64 system?

...

...
...
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index 81cda09b47e3..e704d403beae 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c @@ -590,11 +590,17 @@ static int ehci_run (struct usb_hcd *hcd) */ hcc_params = ehci_readl(ehci, &ehci->caps->hcc_params); if (HCC_64BIT_ADDR(hcc_params)) {
ehci_writel(ehci, 0, &ehci->regs->segment);
-#if 0 -// this is deeply broken on almost all architectures +#if CONFIG_ARM64
ehci_writel(ehci, ehci->periodic_dma >> 32,
	    &ehci->regs->segment);
/*
 * this is deeply broken on almost all architectures
 * but arm64 can use it so enable it
 */
if (!dma_set_mask(hcd->self.controller, DMA_BIT_MASK(64))) ehci_info(ehci, "enabled 64bit DMA\n");
+#else
ehci_writel(ehci, 0, &ehci->regs->segment);
It's silly to put this line in a separate #else section. The upper 32 bits of ehci->periodic_dma are bound to be 0 anyway, because it was allocated before the DMA mask was changed.
Well, I don't want to enable 64-bit DMA for *all* the platforms, so there needs to be an #else.

No, there doesn't. Just leave the

ehci_writel(ehci, 0, &ehci->regs->segment);

line above the "#if CONFIG_ARM64", the way it is in the original code, and get rid of

ehci_writel(ehci, ehci->periodic_dma >> 32, &ehci->regs->segment);

...

While it is true that ehci->periodic_dma variable has the top 32 bits zero, that cannot be guaranteed to be true for the physical register holding that value, so I guess the write is not superfluous.

That's why you can write 0 to the register instead of writing ehci->periodic_dma >> 32.

Don't forget, the controller uses that same ehci->regs->segment register for ehci->qtd_pool, ehci->qh_pool, ehci->itd_pool, and ehci->sitd_pool as well as ehci->periodic. If those DMA pools were allocated in different regions of memory (that is, regions whose upper 32 bits were different), the controller wouldn't be able to access them. The only way to insure they will all be allocated in the same 4-GB region is if they are allocated in the first 4 GB.

Alan Stern

Liviu Dudau

4:53 p.m.

On Thu, May 15, 2014 at 04:36:25PM +0100, Alan Stern wrote:

...

On Thu, 15 May 2014, Liviu Dudau wrote:

...
On Thu, May 15, 2014 at 03:11:48PM +0100, Alan Stern wrote:

...
On Wed, 14 May 2014, Mark Brown wrote:

...
From: Liviu Dudau Liviu.Dudau@arm.com

arm64 architecture handles correctly 64bit DMAs and can enable support for 64bit EHCI host controllers.

Signed-off-by: Liviu Dudau Liviu.Dudau@arm.com Signed-off-by: Ryan Harkin ryan.harkin@linaro.org Signed-off-by: Mark Brown broonie@linaro.org

Did you folks tested this for all sorts of host controllers? I have no way to verify that it works, and last I heard, many (or even most) controllers don't work right with 64-bit DMA.

I have tested it with a host controller that is capable of 64-bit DMA and without this change it doesn't work.

What do you mean it doesn't work? Can't the host controller use 32-bit DMA?

It could if arm64 would restrict the DMA addresses to 32-bit, but it doesn't and I end up on my platform with USB DMA buffers allocated >4GB address.

...

...
At the moment it is the only known USB host controller enabled for arm64. And 64-bit DMA works fine on arm64.

What do you mean? What happens if you plug in a PCI card containing, say, a Sony EHCI host controller on an arm64 system?

I don't have one, so I don't know for sure. I will try to find a PCI card that can do 32-bit and 64-bit DMA.

...

...
...
...
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index 81cda09b47e3..e704d403beae 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c @@ -590,11 +590,17 @@ static int ehci_run (struct usb_hcd *hcd) */ hcc_params = ehci_readl(ehci, &ehci->caps->hcc_params); if (HCC_64BIT_ADDR(hcc_params)) {
ehci_writel(ehci, 0, &ehci->regs->segment);
-#if 0 -// this is deeply broken on almost all architectures +#if CONFIG_ARM64
ehci_writel(ehci, ehci->periodic_dma >> 32,
	    &ehci->regs->segment);
/*
 * this is deeply broken on almost all architectures
 * but arm64 can use it so enable it
 */
if (!dma_set_mask(hcd->self.controller, DMA_BIT_MASK(64))) ehci_info(ehci, "enabled 64bit DMA\n");
+#else
ehci_writel(ehci, 0, &ehci->regs->segment);
It's silly to put this line in a separate #else section. The upper 32 bits of ehci->periodic_dma are bound to be 0 anyway, because it was allocated before the DMA mask was changed.
Well, I don't want to enable 64-bit DMA for *all* the platforms, so there needs to be an #else.
No, there doesn't. Just leave the
ehci_writel(ehci, 0, &ehci->regs->segment);
line above the "#if CONFIG_ARM64", the way it is in the original code, and get rid of
ehci_writel(ehci, ehci->periodic_dma >> 32,
	    &ehci->regs->segment);

Actually I need this line because my period_dma addresses have top 32-bit values non-zero.

...

...
While it is true that ehci->periodic_dma variable has the top 32 bits zero, that cannot be guaranteed to be true for the physical register holding that value, so I guess the write is not superfluous.

That's why you can write 0 to the register instead of writing ehci->periodic_dma >> 32.

Don't forget, the controller uses that same ehci->regs->segment register for ehci->qtd_pool, ehci->qh_pool, ehci->itd_pool, and ehci->sitd_pool as well as ehci->periodic. If those DMA pools were allocated in different regions of memory (that is, regions whose upper 32 bits were different), the controller wouldn't be able to access them. The only way to insure they will all be allocated in the same 4-GB region is if they are allocated in the first 4 GB.

My platform creates all the USB buffers outside the 4GB zone. Need to go back to the code to understand if that is due to design or misconfiguration.

Best regards, Liviu

...

Alan Stern

Alan Stern

6:21 p.m.

On Thu, 15 May 2014, Liviu Dudau wrote:

...

On Thu, May 15, 2014 at 04:36:25PM +0100, Alan Stern wrote:

...
On Thu, 15 May 2014, Liviu Dudau wrote:

...
On Thu, May 15, 2014 at 03:11:48PM +0100, Alan Stern wrote:

...
On Wed, 14 May 2014, Mark Brown wrote:

...
From: Liviu Dudau Liviu.Dudau@arm.com

arm64 architecture handles correctly 64bit DMAs and can enable support for 64bit EHCI host controllers.

Signed-off-by: Liviu Dudau Liviu.Dudau@arm.com Signed-off-by: Ryan Harkin ryan.harkin@linaro.org Signed-off-by: Mark Brown broonie@linaro.org

Did you folks tested this for all sorts of host controllers? I have no way to verify that it works, and last I heard, many (or even most) controllers don't work right with 64-bit DMA.

I have tested it with a host controller that is capable of 64-bit DMA and without this change it doesn't work.

What do you mean it doesn't work? Can't the host controller use 32-bit DMA?

It could if arm64 would restrict the DMA addresses to 32-bit, but it doesn't and I end up on my platform with USB DMA buffers allocated >4GB address.

...
...
At the moment it is the only known USB host controller enabled for arm64. And 64-bit DMA works fine on arm64.

What do you mean? What happens if you plug in a PCI card containing, say, a Sony EHCI host controller on an arm64 system?

I don't have one, so I don't know for sure. I will try to find a PCI card that can do 32-bit and 64-bit DMA.

What about a PCI card that can only do 32-bit DMA?

...

...
No, there doesn't. Just leave the
ehci_writel(ehci, 0, &ehci->regs->segment);
line above the "#if CONFIG_ARM64", the way it is in the original code, and get rid of
ehci_writel(ehci, ehci->periodic_dma >> 32,
	    &ehci->regs->segment);
Actually I need this line because my period_dma addresses have top 32-bit values non-zero.

That seems like a bug.

...

...
Don't forget, the controller uses that same ehci->regs->segment register for ehci->qtd_pool, ehci->qh_pool, ehci->itd_pool, and ehci->sitd_pool as well as ehci->periodic. If those DMA pools were allocated in different regions of memory (that is, regions whose upper 32 bits were different), the controller wouldn't be able to access them. The only way to insure they will all be allocated in the same 4-GB region is if they are allocated in the first 4 GB.

My platform creates all the USB buffers outside the 4GB zone. Need to go back to the code to understand if that is due to design or misconfiguration.

If the platform doesn't guarantee that all those pools and buffers will lie in the same 4-GB region, it's a misconfiguration.

Alan Stern

Catalin Marinas

16 May 16 May

12:55 p.m.

On Thu, May 15, 2014 at 05:53:53PM +0100, Liviu Dudau wrote:

...

On Thu, May 15, 2014 at 04:36:25PM +0100, Alan Stern wrote:

...
On Thu, 15 May 2014, Liviu Dudau wrote:

...
On Thu, May 15, 2014 at 03:11:48PM +0100, Alan Stern wrote:

...
On Wed, 14 May 2014, Mark Brown wrote:

...
arm64 architecture handles correctly 64bit DMAs and can enable support for 64bit EHCI host controllers.

Did you folks tested this for all sorts of host controllers? I have no way to verify that it works, and last I heard, many (or even most) controllers don't work right with 64-bit DMA.

I have tested it with a host controller that is capable of 64-bit DMA and without this change it doesn't work.

What do you mean it doesn't work? Can't the host controller use 32-bit DMA?

It could if arm64 would restrict the DMA addresses to 32-bit, but it doesn't and I end up on my platform with USB DMA buffers allocated >4GB address.

dma_alloc_coherent() on arm64 should return 32-bit addresses if the coherent_dma_mask is set to 32-bit. Which kernel version is this?

-- Catalin

Liviu Dudau

1:21 p.m.

On Fri, May 16, 2014 at 01:55:01PM +0100, Catalin Marinas wrote:

...

On Thu, May 15, 2014 at 05:53:53PM +0100, Liviu Dudau wrote:

...
On Thu, May 15, 2014 at 04:36:25PM +0100, Alan Stern wrote:

...
On Thu, 15 May 2014, Liviu Dudau wrote:

...
On Thu, May 15, 2014 at 03:11:48PM +0100, Alan Stern wrote:

...
On Wed, 14 May 2014, Mark Brown wrote:

...
arm64 architecture handles correctly 64bit DMAs and can enable support for 64bit EHCI host controllers.

Did you folks tested this for all sorts of host controllers? I have no way to verify that it works, and last I heard, many (or even most) controllers don't work right with 64-bit DMA.

I have tested it with a host controller that is capable of 64-bit DMA and without this change it doesn't work.

What do you mean it doesn't work? Can't the host controller use 32-bit DMA?

It could if arm64 would restrict the DMA addresses to 32-bit, but it doesn't and I end up on my platform with USB DMA buffers allocated >4GB address.

dma_alloc_coherent() on arm64 should return 32-bit addresses if the coherent_dma_mask is set to 32-bit. Which kernel version is this?

The initial patch was created before the removal of DMA32_ZONE from arm64. I need to revisit the latest mainline kernel on my board to determine if this patch is still needed.

Best regards, Liviu

...

-- Catalin

Jon Medhurst (Tixy)

5:08 p.m.

On Fri, 2014-05-16 at 13:55 +0100, Catalin Marinas wrote: [...]

...

...
It could if arm64 would restrict the DMA addresses to 32-bit, but it doesn't and I end up on my platform with USB DMA buffers allocated >4GB address.

dma_alloc_coherent() on arm64 should return 32-bit addresses if the coherent_dma_mask is set to 32-bit.

Not if you have CONFIG_DMA_CMA. Unless I have misread the code, enabling CMA means memory comes from a common pool carved out at boot with no way for drivers to specify it's restrictions [1]. It's what I've spent most of the week trying to work around in a clean way, and have finally given up.

[1] There is a partial, stalled attempt at doing device specific CMA allocation: http://lkml.org/lkml/2014/2/28/237

-- Tixy

Catalin Marinas

5:40 p.m.

On Fri, May 16, 2014 at 06:08:45PM +0100, Jon Medhurst (Tixy) wrote:

...

On Fri, 2014-05-16 at 13:55 +0100, Catalin Marinas wrote: [...]

...
...
It could if arm64 would restrict the DMA addresses to 32-bit, but it doesn't and I end up on my platform with USB DMA buffers allocated >4GB address.

dma_alloc_coherent() on arm64 should return 32-bit addresses if the coherent_dma_mask is set to 32-bit.

Not if you have CONFIG_DMA_CMA. Unless I have misread the code, enabling CMA means memory comes from a common pool carved out at boot with no way for drivers to specify it's restrictions [1]. It's what I've spent most of the week trying to work around in a clean way, and have finally given up.

The easiest "hack" would be to pass a limit dma_contiguous_reserve() in arm64_memblock_init(), something like this:

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 51d5352e6ad5..558434c69612 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -162,7 +162,7 @@ void __init arm64_memblock_init(void) }

early_init_fdt_scan_reserved_mem(); - dma_contiguous_reserve(0); + dma_contiguous_reserve(dma_to_phys(NULL, DMA_BIT_MASK(32)) + 1);

memblock_allow_resize(); memblock_dump_all();

probably with a check for IS_ENABLED(CONFIG_ZONE_DMA) (we do this for swiotlb initialisation).

At some point, if we have some system topology description we could decide whether we need to limit the above based on the dma coherent masks.

-- Catalin

Jon Medhurst (Tixy)

19 May 19 May

8:21 a.m.

On Fri, 2014-05-16 at 18:40 +0100, Catalin Marinas wrote:

...

On Fri, May 16, 2014 at 06:08:45PM +0100, Jon Medhurst (Tixy) wrote:

...
On Fri, 2014-05-16 at 13:55 +0100, Catalin Marinas wrote: [...]

...
...
It could if arm64 would restrict the DMA addresses to 32-bit, but it doesn't and I end up on my platform with USB DMA buffers allocated >4GB address.

dma_alloc_coherent() on arm64 should return 32-bit addresses if the coherent_dma_mask is set to 32-bit.

Not if you have CONFIG_DMA_CMA. Unless I have misread the code, enabling CMA means memory comes from a common pool carved out at boot with no way for drivers to specify it's restrictions [1]. It's what I've spent most of the week trying to work around in a clean way, and have finally given up.

The easiest "hack" would be to pass a limit dma_contiguous_reserve() in arm64_memblock_init(), something like this:

That is by far the easiest but I dismissed it because it affects all platforms built from that source tree; and if the hack were extended to include a kernel config option, that still may not work on a single kernel binary expected to work on multiple platforms. Basically, I've tried various was to do it 'properly' and after failing am resorting to truly hideous hacks to the (out-of-tree driver) code - so at least other platforms won't be impacted.

-- Tixy > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 51d5352e6ad5..558434c69612 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -162,7 +162,7 @@ void __init arm64_memblock_init(void) > } > > early_init_fdt_scan_reserved_mem(); > - dma_contiguous_reserve(0); > + dma_contiguous_reserve(dma_to_phys(NULL, DMA_BIT_MASK(32)) + 1); > > memblock_allow_resize(); > memblock_dump_all(); > > probably with a check for IS_ENABLED(CONFIG_ZONE_DMA) (we do this for > swiotlb initialisation). > > At some point, if we have some system topology description we could > decide whether we need to limit the above based on the dma coherent > masks. >

Catalin Marinas

22 May 22 May

3:59 p.m.

On Mon, May 19, 2014 at 09:21:17AM +0100, Jon Medhurst (Tixy) wrote:

...

On Fri, 2014-05-16 at 18:40 +0100, Catalin Marinas wrote:

...
On Fri, May 16, 2014 at 06:08:45PM +0100, Jon Medhurst (Tixy) wrote:

...
On Fri, 2014-05-16 at 13:55 +0100, Catalin Marinas wrote: [...]

...
...
It could if arm64 would restrict the DMA addresses to 32-bit, but it doesn't and I end up on my platform with USB DMA buffers allocated >4GB address.

dma_alloc_coherent() on arm64 should return 32-bit addresses if the coherent_dma_mask is set to 32-bit.

Not if you have CONFIG_DMA_CMA. Unless I have misread the code, enabling CMA means memory comes from a common pool carved out at boot with no way for drivers to specify it's restrictions [1]. It's what I've spent most of the week trying to work around in a clean way, and have finally given up.

The easiest "hack" would be to pass a limit dma_contiguous_reserve() in arm64_memblock_init(), something like this:

That is by far the easiest but I dismissed it because it affects all platforms built from that source tree; and if the hack were extended to include a kernel config option, that still may not work on a single kernel binary expected to work on multiple platforms. Basically, I've tried various was to do it 'properly' and after failing am resorting to truly hideous hacks to the (out-of-tree driver) code - so at least other platforms won't be impacted.

Can you set a specific reserved memory region in the DT to be used by CMA (via linux,cma-default), or it's just for the default size?

On arm64 we enabled CONFIG_DMA_CMA by default but compared to swiotlb it doesn't honour GFP_DMA. The arm32 port sets the CMA limit to arm_dma_limit which is 32-bit or a SoC-define one. So I'm tempted to default to 32-bit as well if it can be overridden via DT.

-- Catalin

Jon Medhurst (Tixy)

4:49 p.m.

On Thu, 2014-05-22 at 16:59 +0100, Catalin Marinas wrote:

...

On Mon, May 19, 2014 at 09:21:17AM +0100, Jon Medhurst (Tixy) wrote:

...
On Fri, 2014-05-16 at 18:40 +0100, Catalin Marinas wrote:

...
On Fri, May 16, 2014 at 06:08:45PM +0100, Jon Medhurst (Tixy) wrote:

...
On Fri, 2014-05-16 at 13:55 +0100, Catalin Marinas wrote: [...]

...
...
It could if arm64 would restrict the DMA addresses to 32-bit, but it doesn't and I end up on my platform with USB DMA buffers allocated >4GB address.

dma_alloc_coherent() on arm64 should return 32-bit addresses if the coherent_dma_mask is set to 32-bit.

Not if you have CONFIG_DMA_CMA. Unless I have misread the code, enabling CMA means memory comes from a common pool carved out at boot with no way for drivers to specify it's restrictions [1]. It's what I've spent most of the week trying to work around in a clean way, and have finally given up.

The easiest "hack" would be to pass a limit dma_contiguous_reserve() in arm64_memblock_init(), something like this:

That is by far the easiest but I dismissed it because it affects all platforms built from that source tree; and if the hack were extended to include a kernel config option, that still may not work on a single kernel binary expected to work on multiple platforms. Basically, I've tried various was to do it 'properly' and after failing am resorting to truly hideous hacks to the (out-of-tree driver) code - so at least other platforms won't be impacted.

Can you set a specific reserved memory region in the DT to be used by CMA (via linux,cma-default), or it's just for the default size?

The bindings and infrastructure got half merged in 3.15 but the patches to actually make this usable in drivers are stalled... http://lkml.org/lkml/2014/5/14/201

...

On arm64 we enabled CONFIG_DMA_CMA by default but compared to swiotlb it doesn't honour GFP_DMA. The arm32 port sets the CMA limit to arm_dma_limit which is 32-bit or a SoC-define one. So I'm tempted to default to 32-bit as well if it can be overridden via DT.

I believe the CMA pool could have been over-ridden if the stalled patches been accepted. I believe the specific patch for that is http://lkml.org/lkml/2014/2/28/278

The bindings for all this CMA stuf is in Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt

-- Tixy

Arnd Bergmann

19 May 19 May

8:32 a.m.

On Friday 16 May 2014 13:55:01 Catalin Marinas wrote:

...

On Thu, May 15, 2014 at 05:53:53PM +0100, Liviu Dudau wrote:

...
On Thu, May 15, 2014 at 04:36:25PM +0100, Alan Stern wrote:

...
On Thu, 15 May 2014, Liviu Dudau wrote:

...
On Thu, May 15, 2014 at 03:11:48PM +0100, Alan Stern wrote:

...
On Wed, 14 May 2014, Mark Brown wrote:

...
arm64 architecture handles correctly 64bit DMAs and can enable support for 64bit EHCI host controllers.

Did you folks tested this for all sorts of host controllers? I have no way to verify that it works, and last I heard, many (or even most) controllers don't work right with 64-bit DMA.

I have tested it with a host controller that is capable of 64-bit DMA and without this change it doesn't work.

What do you mean it doesn't work? Can't the host controller use 32-bit DMA?

It could if arm64 would restrict the DMA addresses to 32-bit, but it doesn't and I end up on my platform with USB DMA buffers allocated >4GB address.

dma_alloc_coherent() on arm64 should return 32-bit addresses if the coherent_dma_mask is set to 32-bit. Which kernel version is this?

The more important question is what happens to high buffers allocated elsewhere that get passed into dma_map_sg by a device driver. Depending on the DT properties of the device and its parents, this needs to do one of three things:

a) translate the 64-bit virtual address into a 64-bit bus address b) create an IOMMU entry for the 64-bit address and pass the 32-bit IOMMU address to the driver c) use the swiotlb code to create a bounce buffer at a 32-bit DMA address and copy the data around

It's definitely wrong to just hardcode a DMA mask in the driver because that code doesn't know which of the three cases is being used. Moreover, you can't do it using an #ifdef CONFIG_ARM64, because it's completely independent of the architecture, and we need to do the exact same logic on ARM32 and any other architecture.

Arnd

Catalin Marinas

9:03 a.m.

On Mon, May 19, 2014 at 09:32:43AM +0100, Arnd Bergmann wrote:

...

On Friday 16 May 2014 13:55:01 Catalin Marinas wrote:

...
On Thu, May 15, 2014 at 05:53:53PM +0100, Liviu Dudau wrote:

...
On Thu, May 15, 2014 at 04:36:25PM +0100, Alan Stern wrote:

...
On Thu, 15 May 2014, Liviu Dudau wrote:

...
On Thu, May 15, 2014 at 03:11:48PM +0100, Alan Stern wrote:

...
On Wed, 14 May 2014, Mark Brown wrote: > arm64 architecture handles correctly 64bit DMAs and can enable support > for 64bit EHCI host controllers.

Did you folks tested this for all sorts of host controllers? I have no way to verify that it works, and last I heard, many (or even most) controllers don't work right with 64-bit DMA.

I have tested it with a host controller that is capable of 64-bit DMA and without this change it doesn't work.

What do you mean it doesn't work? Can't the host controller use 32-bit DMA?

It could if arm64 would restrict the DMA addresses to 32-bit, but it doesn't and I end up on my platform with USB DMA buffers allocated >4GB address.

dma_alloc_coherent() on arm64 should return 32-bit addresses if the coherent_dma_mask is set to 32-bit. Which kernel version is this?

The more important question is what happens to high buffers allocated elsewhere that get passed into dma_map_sg by a device driver. Depending on the DT properties of the device and its parents, this needs to do one of three things:

a) translate the 64-bit virtual address into a 64-bit bus address b) create an IOMMU entry for the 64-bit address and pass the 32-bit IOMMU address to the driver c) use the swiotlb code to create a bounce buffer at a 32-bit DMA address and copy the data around

It's definitely wrong to just hardcode a DMA mask in the driver because that code doesn't know which of the three cases is being used. Moreover, you can't do it using an #ifdef CONFIG_ARM64, because it's completely independent of the architecture, and we need to do the exact same logic on ARM32 and any other architecture.

I agree.

The problem we currently have is system topology description to pass the DMA mask and in a hierarchical way. I can see Santosh's patches introducing dma-ranges but the coherent dma mask still set as 32-bit. We can use the dma-ranges to infer a mask but that's only specific to the device and the driver doesn't know whether it goes through an iommu or not.

-- Catalin

Arnd Bergmann

9:44 a.m.

On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:

...

On Mon, May 19, 2014 at 09:32:43AM +0100, Arnd Bergmann wrote:

...
On Friday 16 May 2014 13:55:01 Catalin Marinas wrote:

...
On Thu, May 15, 2014 at 05:53:53PM +0100, Liviu Dudau wrote:

...
On Thu, May 15, 2014 at 04:36:25PM +0100, Alan Stern wrote:

...
On Thu, 15 May 2014, Liviu Dudau wrote:

...
On Thu, May 15, 2014 at 03:11:48PM +0100, Alan Stern wrote: > On Wed, 14 May 2014, Mark Brown wrote: > > arm64 architecture handles correctly 64bit DMAs and can enable support > > for 64bit EHCI host controllers. > > Did you folks tested this for all sorts of host controllers? I have no > way to verify that it works, and last I heard, many (or even most) > controllers don't work right with 64-bit DMA.

I have tested it with a host controller that is capable of 64-bit DMA and without this change it doesn't work.

What do you mean it doesn't work? Can't the host controller use 32-bit DMA?

It could if arm64 would restrict the DMA addresses to 32-bit, but it doesn't and I end up on my platform with USB DMA buffers allocated >4GB address.

dma_alloc_coherent() on arm64 should return 32-bit addresses if the coherent_dma_mask is set to 32-bit. Which kernel version is this?

The more important question is what happens to high buffers allocated elsewhere that get passed into dma_map_sg by a device driver. Depending on the DT properties of the device and its parents, this needs to do one of three things:

a) translate the 64-bit virtual address into a 64-bit bus address b) create an IOMMU entry for the 64-bit address and pass the 32-bit IOMMU address to the driver c) use the swiotlb code to create a bounce buffer at a 32-bit DMA address and copy the data around

It's definitely wrong to just hardcode a DMA mask in the driver because that code doesn't know which of the three cases is being used. Moreover, you can't do it using an #ifdef CONFIG_ARM64, because it's completely independent of the architecture, and we need to do the exact same logic on ARM32 and any other architecture.

I agree.

The problem we currently have is system topology description to pass the DMA mask and in a hierarchical way. I can see Santosh's patches introducing dma-ranges but the coherent dma mask still set as 32-bit. We can use the dma-ranges to infer a mask but that's only specific to the device and the driver doesn't know whether it goes through an iommu or not.

We definitely have to fix this very quickly, before people start building real arm64 systems and shipping them.

We should not merge any hacks that attempt to work around the problem, but try to come to a conclusion how to handle them properly. My hope was that we could just always set the dma mask to whatever the DT says it should be to keep the burden from device drivers, unless they want to restrict it further (e.g. when the specific peripheral hardware has a bug that prevents us from using high addresses, even though the SoC in theory supports it everywhere).

Rob Herring argued that we should always mimic PCI and call dma_set_mask() in drivers but default to a 32-bit mask otherwise, independent of whether the hardware can do more or less than that, IIRC.

While we currently don't have a set of swiotlb DMA ops on ARM32, we do have it on ARM64, and I think we should be using them properly. It should really not be hard to implement a proper dma_set_mask() function for ARM64 that gets is able to set up the swiotlb based on the dma-ranges properties and always returns success but leaves the mask unchanged.

Arnd

Catalin Marinas

3:56 p.m.

On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote:

...

On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 09:32:43AM +0100, Arnd Bergmann wrote:

...
The more important question is what happens to high buffers allocated elsewhere that get passed into dma_map_sg by a device driver. Depending on the DT properties of the device and its parents, this needs to do one of three things:

a) translate the 64-bit virtual address into a 64-bit bus address b) create an IOMMU entry for the 64-bit address and pass the 32-bit IOMMU address to the driver c) use the swiotlb code to create a bounce buffer at a 32-bit DMA address and copy the data around

It's definitely wrong to just hardcode a DMA mask in the driver because that code doesn't know which of the three cases is being used. Moreover, you can't do it using an #ifdef CONFIG_ARM64, because it's completely independent of the architecture, and we need to do the exact same logic on ARM32 and any other architecture.

I agree.

The problem we currently have is system topology description to pass the DMA mask and in a hierarchical way. I can see Santosh's patches introducing dma-ranges but the coherent dma mask still set as 32-bit. We can use the dma-ranges to infer a mask but that's only specific to the device and the driver doesn't know whether it goes through an iommu or not.

We definitely have to fix this very quickly, before people start building real arm64 systems and shipping them.

We should not merge any hacks that attempt to work around the problem, but try to come to a conclusion how to handle them properly. My hope was that we could just always set the dma mask to whatever the DT says it should be to keep the burden from device drivers, unless they want to restrict it further (e.g. when the specific peripheral hardware has a bug that prevents us from using high addresses, even though the SoC in theory supports it everywhere).

I agree.

...

Rob Herring argued that we should always mimic PCI and call dma_set_mask() in drivers but default to a 32-bit mask otherwise, independent of whether the hardware can do more or less than that, IIRC.

Can we not default to something built up from dma-ranges? Or 32-bit if dma-ranges property is missing?

...

While we currently don't have a set of swiotlb DMA ops on ARM32, we do have it on ARM64, and I think we should be using them properly. It should really not be hard to implement a proper dma_set_mask() function for ARM64 that gets is able to set up the swiotlb based on the dma-ranges properties and always returns success but leaves the mask unchanged.

The swiotlb bounce buffer needs to be pre-allocated at boot, otherwise we don't have any guarantees. Since we can't honour random masks anyway, we stick to ZONE_DMA which is currently in the 4G limit. But the driver calls dma_set_mask() too late for any further swiotlb setup.

With IOMMU we can be more flexible around dma_set_mask(), can be done at run-time.

-- Catalin

Arnd Bergmann

4:55 p.m.

On Monday 19 May 2014 16:56:08 Catalin Marinas wrote:

...

On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote:

...
On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 09:32:43AM +0100, Arnd Bergmann wrote:

...
The more important question is what happens to high buffers allocated elsewhere that get passed into dma_map_sg by a device driver. Depending on the DT properties of the device and its parents, this needs to do one of three things:

a) translate the 64-bit virtual address into a 64-bit bus address b) create an IOMMU entry for the 64-bit address and pass the 32-bit IOMMU address to the driver c) use the swiotlb code to create a bounce buffer at a 32-bit DMA address and copy the data around

It's definitely wrong to just hardcode a DMA mask in the driver because that code doesn't know which of the three cases is being used. Moreover, you can't do it using an #ifdef CONFIG_ARM64, because it's completely independent of the architecture, and we need to do the exact same logic on ARM32 and any other architecture.

I agree.

The problem we currently have is system topology description to pass the DMA mask and in a hierarchical way. I can see Santosh's patches introducing dma-ranges but the coherent dma mask still set as 32-bit. We can use the dma-ranges to infer a mask but that's only specific to the device and the driver doesn't know whether it goes through an iommu or not.

We definitely have to fix this very quickly, before people start building real arm64 systems and shipping them.

We should not merge any hacks that attempt to work around the problem, but try to come to a conclusion how to handle them properly. My hope was that we could just always set the dma mask to whatever the DT says it should be to keep the burden from device drivers, unless they want to restrict it further (e.g. when the specific peripheral hardware has a bug that prevents us from using high addresses, even though the SoC in theory supports it everywhere).

I agree.

...
Rob Herring argued that we should always mimic PCI and call dma_set_mask() in drivers but default to a 32-bit mask otherwise, independent of whether the hardware can do more or less than that, IIRC.

Can we not default to something built up from dma-ranges? Or 32-bit if dma-ranges property is missing?

We probably want to default to 32-bit for arm32 in the absence of dma-ranges. For arm64, I'd prefer if we could always mandate dma-ranges to be present for each bus, just like we mandate ranges to be present. I hope it's not too late for that.

dma_set_mask should definitely look at the dma-ranges properties, and the helper that Santosh just introduced should give us all the information we need. We just need to decide on the correct behavior.

...

...
While we currently don't have a set of swiotlb DMA ops on ARM32, we do have it on ARM64, and I think we should be using them properly. It should really not be hard to implement a proper dma_set_mask() function for ARM64 that gets is able to set up the swiotlb based on the dma-ranges properties and always returns success but leaves the mask unchanged.

The swiotlb bounce buffer needs to be pre-allocated at boot, otherwise we don't have any guarantees. Since we can't honour random masks anyway, we stick to ZONE_DMA which is currently in the 4G limit. But the driver calls dma_set_mask() too late for any further swiotlb setup.

With IOMMU we can be more flexible around dma_set_mask(), can be done at run-time.

What we can do with swiotlb is to check if the mask is smaller than ZONE_DMA. If it ever is, we have to fail dma_set_mask and hope the driver can fall back to PIO mode or it will have to fail its probe() function.

For dma_set_coherent_mask(), we also have to fail any call that tries to set a mask larger than what the device hardware can do. Unlike that, dma_set_mask() can succeed with any mask, we just have to enable swiotlb if the mask that the driver wants is larger than what the hardware can do.

Arnd

Catalin Marinas

20 May 20 May

11:02 a.m.

On Mon, May 19, 2014 at 05:55:56PM +0100, Arnd Bergmann wrote:

...

On Monday 19 May 2014 16:56:08 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote:

...
On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 09:32:43AM +0100, Arnd Bergmann wrote:

...
The more important question is what happens to high buffers allocated elsewhere that get passed into dma_map_sg by a device driver. Depending on the DT properties of the device and its parents, this needs to do one of three things:

a) translate the 64-bit virtual address into a 64-bit bus address b) create an IOMMU entry for the 64-bit address and pass the 32-bit IOMMU address to the driver c) use the swiotlb code to create a bounce buffer at a 32-bit DMA address and copy the data around

It's definitely wrong to just hardcode a DMA mask in the driver because that code doesn't know which of the three cases is being used. Moreover, you can't do it using an #ifdef CONFIG_ARM64, because it's completely independent of the architecture, and we need to do the exact same logic on ARM32 and any other architecture.

I agree.

The problem we currently have is system topology description to pass the DMA mask and in a hierarchical way. I can see Santosh's patches introducing dma-ranges but the coherent dma mask still set as 32-bit. We can use the dma-ranges to infer a mask but that's only specific to the device and the driver doesn't know whether it goes through an iommu or not.

We definitely have to fix this very quickly, before people start building real arm64 systems and shipping them.

We should not merge any hacks that attempt to work around the problem, but try to come to a conclusion how to handle them properly. My hope was that we could just always set the dma mask to whatever the DT says it should be to keep the burden from device drivers, unless they want to restrict it further (e.g. when the specific peripheral hardware has a bug that prevents us from using high addresses, even though the SoC in theory supports it everywhere).

I agree.

...
Rob Herring argued that we should always mimic PCI and call dma_set_mask() in drivers but default to a 32-bit mask otherwise, independent of whether the hardware can do more or less than that, IIRC.

Can we not default to something built up from dma-ranges? Or 32-bit if dma-ranges property is missing?

We probably want to default to 32-bit for arm32 in the absence of dma-ranges. For arm64, I'd prefer if we could always mandate dma-ranges to be present for each bus, just like we mandate ranges to be present. I hope it's not too late for that.

dma_set_mask should definitely look at the dma-ranges properties, and the helper that Santosh just introduced should give us all the information we need. We just need to decide on the correct behavior.

Last time I looked at Santosh's patches I thought the dma-ranges is per device rather than per bus. We could make it per bus only and let the device call dma_set_mask() explicitly if it wants to restrict it further.

...

...
...
While we currently don't have a set of swiotlb DMA ops on ARM32, we do have it on ARM64, and I think we should be using them properly. It should really not be hard to implement a proper dma_set_mask() function for ARM64 that gets is able to set up the swiotlb based on the dma-ranges properties and always returns success but leaves the mask unchanged.

The swiotlb bounce buffer needs to be pre-allocated at boot, otherwise we don't have any guarantees. Since we can't honour random masks anyway, we stick to ZONE_DMA which is currently in the 4G limit. But the driver calls dma_set_mask() too late for any further swiotlb setup.

With IOMMU we can be more flexible around dma_set_mask(), can be done at run-time.

What we can do with swiotlb is to check if the mask is smaller than ZONE_DMA. If it ever is, we have to fail dma_set_mask and hope the driver can fall back to PIO mode or it will have to fail its probe() function.

dma_set_(coherent_)mask check swiotlb_dma_supported() which returns false if io_tlb_end goes beyond the device mask. So we just need to ensure that io_tlb is allocated within ZONE_DMA.

...

For dma_set_coherent_mask(), we also have to fail any call that tries to set a mask larger than what the device hardware can do. Unlike that, dma_set_mask() can succeed with any mask, we just have to enable swiotlb if the mask that the driver wants is larger than what the hardware can do.

Currently we can't satisfy any arbitrarily small dma mask even with swiotlb since the bounce buffer is just guaranteed to be in ZONE_DMA. Swiotlb allows for smaller masks but we need to reserve the io_tlb buffer early during boot and at smaller addresses. For example, swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA) and if the coherent_dma_mask isn't matched, it frees the pages and falls back to the io_tlb buffer. However, I don't think it's worth going for masks smaller than 32-bit on arm64.

CMA is pretty similar to swiotlb with regards to pre-allocated buffers for coherent dma. We currently don't limit it for arm64 but I think we should just limit it to ZONE_DMA because we can't tell what masks the devices need. We could parse the DT for dma-ranges but we can still have explicit dma_set_coherent_mask() calls to make it smaller.

Yet another issue is what we actually mean by ZONE_DMA. If we have devices with different dma_pfn_offset (as per Santosh's patches), ZONE_DMA would mean different things for them since phys_to_dma() may no longer be the same for a single SoC.

-- Catalin

Arnd Bergmann

11:08 a.m.

On Tuesday 20 May 2014 12:02:46 Catalin Marinas wrote:

...

On Mon, May 19, 2014 at 05:55:56PM +0100, Arnd Bergmann wrote:

...
On Monday 19 May 2014 16:56:08 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote:

...
On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:

We probably want to default to 32-bit for arm32 in the absence of dma-ranges. For arm64, I'd prefer if we could always mandate dma-ranges to be present for each bus, just like we mandate ranges to be present. I hope it's not too late for that.

dma_set_mask should definitely look at the dma-ranges properties, and the helper that Santosh just introduced should give us all the information we need. We just need to decide on the correct behavior.

Last time I looked at Santosh's patches I thought the dma-ranges is per device rather than per bus. We could make it per bus only and let the device call dma_set_mask() explicitly if it wants to restrict it further.

Can you check again? I've read the code again yesterday to check this, and I concluded it was correctly doing this per bus.

...

...
...
...
While we currently don't have a set of swiotlb DMA ops on ARM32, we do have it on ARM64, and I think we should be using them properly. It should really not be hard to implement a proper dma_set_mask() function for ARM64 that gets is able to set up the swiotlb based on the dma-ranges properties and always returns success but leaves the mask unchanged.

The swiotlb bounce buffer needs to be pre-allocated at boot, otherwise we don't have any guarantees. Since we can't honour random masks anyway, we stick to ZONE_DMA which is currently in the 4G limit. But the driver calls dma_set_mask() too late for any further swiotlb setup.

With IOMMU we can be more flexible around dma_set_mask(), can be done at run-time.

What we can do with swiotlb is to check if the mask is smaller than ZONE_DMA. If it ever is, we have to fail dma_set_mask and hope the driver can fall back to PIO mode or it will have to fail its probe() function.

dma_set_(coherent_)mask check swiotlb_dma_supported() which returns false if io_tlb_end goes beyond the device mask. So we just need to ensure that io_tlb is allocated within ZONE_DMA.

Makes sense for dma_set_mask. Why do you do the same thing for coherent_mask? Shouldn't that check against ZONE_DMA instead?

...

...
For dma_set_coherent_mask(), we also have to fail any call that tries to set a mask larger than what the device hardware can do. Unlike that, dma_set_mask() can succeed with any mask, we just have to enable swiotlb if the mask that the driver wants is larger than what the hardware can do.

Currently we can't satisfy any arbitrarily small dma mask even with swiotlb since the bounce buffer is just guaranteed to be in ZONE_DMA. Swiotlb allows for smaller masks but we need to reserve the io_tlb buffer early during boot and at smaller addresses. For example, swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA) and if the coherent_dma_mask isn't matched, it frees the pages and falls back to the io_tlb buffer. However, I don't think it's worth going for masks smaller than 32-bit on arm64.

Is that safe for noncoherent systems? I'd expect the io_tlb buffer to be cached there, which means we can't use it for coherent allocations.

...

CMA is pretty similar to swiotlb with regards to pre-allocated buffers for coherent dma. We currently don't limit it for arm64 but I think we should just limit it to ZONE_DMA because we can't tell what masks the devices need. We could parse the DT for dma-ranges but we can still have explicit dma_set_coherent_mask() calls to make it smaller.

Yet another issue is what we actually mean by ZONE_DMA. If we have devices with different dma_pfn_offset (as per Santosh's patches), ZONE_DMA would mean different things for them since phys_to_dma() may no longer be the same for a single SoC.

I never figured out how that works.

Arnd

Catalin Marinas

2:12 p.m.

On Tue, May 20, 2014 at 12:08:58PM +0100, Arnd Bergmann wrote:

...

On Tuesday 20 May 2014 12:02:46 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 05:55:56PM +0100, Arnd Bergmann wrote:

...
On Monday 19 May 2014 16:56:08 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote:

...
On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:

We probably want to default to 32-bit for arm32 in the absence of dma-ranges. For arm64, I'd prefer if we could always mandate dma-ranges to be present for each bus, just like we mandate ranges to be present. I hope it's not too late for that.

dma_set_mask should definitely look at the dma-ranges properties, and the helper that Santosh just introduced should give us all the information we need. We just need to decide on the correct behavior.

Last time I looked at Santosh's patches I thought the dma-ranges is per device rather than per bus. We could make it per bus only and let the device call dma_set_mask() explicitly if it wants to restrict it further.

Can you check again? I've read the code again yesterday to check this, and I concluded it was correctly doing this per bus.

You are right, I missed the fact that of_dma_get_range() checks the dma-ranges property in the node's parent.

So what we need is setting the default dma mask based on the size information in dma-ranges to something like this:

mask = rounddown_pow_of_two(size) - 1; dma_set_mask(dev, mask); /* or dma_set_mask_and_coherent() */

...

...
...
...
...
While we currently don't have a set of swiotlb DMA ops on ARM32, we do have it on ARM64, and I think we should be using them properly. It should really not be hard to implement a proper dma_set_mask() function for ARM64 that gets is able to set up the swiotlb based on the dma-ranges properties and always returns success but leaves the mask unchanged.

The swiotlb bounce buffer needs to be pre-allocated at boot, otherwise we don't have any guarantees. Since we can't honour random masks anyway, we stick to ZONE_DMA which is currently in the 4G limit. But the driver calls dma_set_mask() too late for any further swiotlb setup.

With IOMMU we can be more flexible around dma_set_mask(), can be done at run-time.

What we can do with swiotlb is to check if the mask is smaller than ZONE_DMA. If it ever is, we have to fail dma_set_mask and hope the driver can fall back to PIO mode or it will have to fail its probe() function.

dma_set_(coherent_)mask check swiotlb_dma_supported() which returns false if io_tlb_end goes beyond the device mask. So we just need to ensure that io_tlb is allocated within ZONE_DMA.

Makes sense for dma_set_mask. Why do you do the same thing for coherent_mask? Shouldn't that check against ZONE_DMA instead?

It depends on the meaning of coherent_dma_mask. As I understand it, that's the dma mask use for dma_alloc_coherent() to return a memory buffer that the device is able to access. I don't see it much different from the dma_mask used by the streaming API. I guess some hardware could have different masks here if they have cache coherency only for certain memory ranges (and the coherent_dma_mask would be smaller than dma_mask).

...

...
...
For dma_set_coherent_mask(), we also have to fail any call that tries to set a mask larger than what the device hardware can do. Unlike that, dma_set_mask() can succeed with any mask, we just have to enable swiotlb if the mask that the driver wants is larger than what the hardware can do.

Currently we can't satisfy any arbitrarily small dma mask even with swiotlb since the bounce buffer is just guaranteed to be in ZONE_DMA. Swiotlb allows for smaller masks but we need to reserve the io_tlb buffer early during boot and at smaller addresses. For example, swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA) and if the coherent_dma_mask isn't matched, it frees the pages and falls back to the io_tlb buffer. However, I don't think it's worth going for masks smaller than 32-bit on arm64.

Is that safe for noncoherent systems? I'd expect the io_tlb buffer to be cached there, which means we can't use it for coherent allocations.

For non-coherent systems, the current arm64 approach is to take the address returned by swiotlb_alloc_coherent() and remap it as Normal NonCacheable (see the arm64 __dma_alloc_noncoherent()). The CPU should only access the dma buffer via the non-cacheable mapping.

Of course, another approach would be to change the cacheability attributes of the io_tlb buffer (or the CMA one) but current approach has the advantage that the io_tlb buffer can be used for both coherent and non-coherent devices.

-- Catalin

Arnd Bergmann

7:37 p.m.

On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote:

...

On Tue, May 20, 2014 at 12:08:58PM +0100, Arnd Bergmann wrote:

...
On Tuesday 20 May 2014 12:02:46 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 05:55:56PM +0100, Arnd Bergmann wrote:

...
On Monday 19 May 2014 16:56:08 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote:

...
On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:

We probably want to default to 32-bit for arm32 in the absence of dma-ranges. For arm64, I'd prefer if we could always mandate dma-ranges to be present for each bus, just like we mandate ranges to be present. I hope it's not too late for that.

dma_set_mask should definitely look at the dma-ranges properties, and the helper that Santosh just introduced should give us all the information we need. We just need to decide on the correct behavior.

Last time I looked at Santosh's patches I thought the dma-ranges is per device rather than per bus. We could make it per bus only and let the device call dma_set_mask() explicitly if it wants to restrict it further.

Can you check again? I've read the code again yesterday to check this, and I concluded it was correctly doing this per bus.

You are right, I missed the fact that of_dma_get_range() checks the dma-ranges property in the node's parent.

So what we need is setting the default dma mask based on the size information in dma-ranges to something like this:

mask = rounddown_pow_of_two(size) - 1; dma_set_mask(dev, mask); /* or dma_set_mask_and_coherent() */

I don't think we should be calling dma_set_mask here, since that would just go and parse the same property again once we fix the implementation.

Since this is a low-level helper, we can probably just assign the dma mask directly.

...

...
...
...
...
...
While we currently don't have a set of swiotlb DMA ops on ARM32, we do have it on ARM64, and I think we should be using them properly. It should really not be hard to implement a proper dma_set_mask() function for ARM64 that gets is able to set up the swiotlb based on the dma-ranges properties and always returns success but leaves the mask unchanged.

The swiotlb bounce buffer needs to be pre-allocated at boot, otherwise we don't have any guarantees. Since we can't honour random masks anyway, we stick to ZONE_DMA which is currently in the 4G limit. But the driver calls dma_set_mask() too late for any further swiotlb setup.

With IOMMU we can be more flexible around dma_set_mask(), can be done at run-time.

What we can do with swiotlb is to check if the mask is smaller than ZONE_DMA. If it ever is, we have to fail dma_set_mask and hope the driver can fall back to PIO mode or it will have to fail its probe() function.

dma_set_(coherent_)mask check swiotlb_dma_supported() which returns false if io_tlb_end goes beyond the device mask. So we just need to ensure that io_tlb is allocated within ZONE_DMA.

Makes sense for dma_set_mask. Why do you do the same thing for coherent_mask? Shouldn't that check against ZONE_DMA instead?

It depends on the meaning of coherent_dma_mask. As I understand it, that's the dma mask use for dma_alloc_coherent() to return a memory buffer that the device is able to access. I don't see it much different from the dma_mask used by the streaming API. I guess some hardware could have different masks here if they have cache coherency only for certain memory ranges (and the coherent_dma_mask would be smaller than dma_mask).

The point I was trying to make is a different one: For the streaming mapping, you can allow any mask as long as the swiotlb is able to create a bounce buffer. This does not work for the coherent mapping, because that by definition cannot use bounce buffers.

...

...
...
...
For dma_set_coherent_mask(), we also have to fail any call that tries to set a mask larger than what the device hardware can do. Unlike that, dma_set_mask() can succeed with any mask, we just have to enable swiotlb if the mask that the driver wants is larger than what the hardware can do.

Currently we can't satisfy any arbitrarily small dma mask even with swiotlb since the bounce buffer is just guaranteed to be in ZONE_DMA. Swiotlb allows for smaller masks but we need to reserve the io_tlb buffer early during boot and at smaller addresses. For example, swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA) and if the coherent_dma_mask isn't matched, it frees the pages and falls back to the io_tlb buffer. However, I don't think it's worth going for masks smaller than 32-bit on arm64.

Is that safe for noncoherent systems? I'd expect the io_tlb buffer to be cached there, which means we can't use it for coherent allocations.

For non-coherent systems, the current arm64 approach is to take the address returned by swiotlb_alloc_coherent() and remap it as Normal NonCacheable (see the arm64 __dma_alloc_noncoherent()). The CPU should only access the dma buffer via the non-cacheable mapping.

Doesn't that require the allocated page to be mapped using small pages in the linear mapping? As far as I understand it, you are not allowed to create a noncacheable mapping for a page that also has a cachable mapping, or did that change between armv7 and armv8?

For CMA, we use the trick to change the mapping in place, but that doesn't work for normal pages, which are mapped using huge tlbs.

...

Of course, another approach would be to change the cacheability attributes of the io_tlb buffer (or the CMA one) but current approach has the advantage that the io_tlb buffer can be used for both coherent and non-coherent devices.

That definitely shouldn't be done without performance testing. However I wonder if they were done in the first place for the swiotlb: It's possible that it's cheaper to have the bounce buffer for noncoherent devices be mapped noncachable so we do cache bypassing copy into the bounce buffer and out of it and save the extra flushes.

Arnd

Catalin Marinas

21 May 21 May

1:56 p.m.

On Tue, May 20, 2014 at 08:37:10PM +0100, Arnd Bergmann wrote:

...

On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote:

...
On Tue, May 20, 2014 at 12:08:58PM +0100, Arnd Bergmann wrote:

...
On Tuesday 20 May 2014 12:02:46 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 05:55:56PM +0100, Arnd Bergmann wrote:

...
On Monday 19 May 2014 16:56:08 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote: > On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:

We probably want to default to 32-bit for arm32 in the absence of dma-ranges. For arm64, I'd prefer if we could always mandate dma-ranges to be present for each bus, just like we mandate ranges to be present. I hope it's not too late for that.

dma_set_mask should definitely look at the dma-ranges properties, and the helper that Santosh just introduced should give us all the information we need. We just need to decide on the correct behavior.

Last time I looked at Santosh's patches I thought the dma-ranges is per device rather than per bus. We could make it per bus only and let the device call dma_set_mask() explicitly if it wants to restrict it further.

Can you check again? I've read the code again yesterday to check this, and I concluded it was correctly doing this per bus.

You are right, I missed the fact that of_dma_get_range() checks the dma-ranges property in the node's parent.

So what we need is setting the default dma mask based on the size information in dma-ranges to something like this:

mask = rounddown_pow_of_two(size) - 1; dma_set_mask(dev, mask); /* or dma_set_mask_and_coherent() */

I don't think we should be calling dma_set_mask here, since that would just go and parse the same property again once we fix the implementation.

Since this is a low-level helper, we can probably just assign the dma mask directly.

I was thinking of calling it in of_dma_configure() as that's where we already set the default dma_mask and coherent_dma_mask. Default to 32-bit if no dma-ranges are present.

...

...
...
...
...
...
> While we currently don't have a set of swiotlb DMA ops on ARM32, we do > have it on ARM64, and I think we should be using them properly. It should > really not be hard to implement a proper dma_set_mask() function for > ARM64 that gets is able to set up the swiotlb based on the dma-ranges > properties and always returns success but leaves the mask unchanged.

The swiotlb bounce buffer needs to be pre-allocated at boot, otherwise we don't have any guarantees. Since we can't honour random masks anyway, we stick to ZONE_DMA which is currently in the 4G limit. But the driver calls dma_set_mask() too late for any further swiotlb setup.

With IOMMU we can be more flexible around dma_set_mask(), can be done at run-time.

What we can do with swiotlb is to check if the mask is smaller than ZONE_DMA. If it ever is, we have to fail dma_set_mask and hope the driver can fall back to PIO mode or it will have to fail its probe() function.

dma_set_(coherent_)mask check swiotlb_dma_supported() which returns false if io_tlb_end goes beyond the device mask. So we just need to ensure that io_tlb is allocated within ZONE_DMA.

Makes sense for dma_set_mask. Why do you do the same thing for coherent_mask? Shouldn't that check against ZONE_DMA instead?

It depends on the meaning of coherent_dma_mask. As I understand it, that's the dma mask use for dma_alloc_coherent() to return a memory buffer that the device is able to access. I don't see it much different from the dma_mask used by the streaming API. I guess some hardware could have different masks here if they have cache coherency only for certain memory ranges (and the coherent_dma_mask would be smaller than dma_mask).

The point I was trying to make is a different one: For the streaming mapping, you can allow any mask as long as the swiotlb is able to create a bounce buffer.

dma_mask is a hardware parameter and is used by the streaming DMA API implementation (which could be swiotlb) to decide whether to bounce or not. The streaming DMA API can take any CPU address as input, independent of the dma_mask, and bounce accordingly. If it doesn't have a bounce buffer matching the dma_mask, dma_supported() would return false.

...

This does not work for the coherent mapping, because that by definition cannot use bounce buffers.

Yes but most of the time these masks have the same value.

...

...
...
...
...
For dma_set_coherent_mask(), we also have to fail any call that tries to set a mask larger than what the device hardware can do. Unlike that, dma_set_mask() can succeed with any mask, we just have to enable swiotlb if the mask that the driver wants is larger than what the hardware can do.

Currently we can't satisfy any arbitrarily small dma mask even with swiotlb since the bounce buffer is just guaranteed to be in ZONE_DMA. Swiotlb allows for smaller masks but we need to reserve the io_tlb buffer early during boot and at smaller addresses. For example, swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA) and if the coherent_dma_mask isn't matched, it frees the pages and falls back to the io_tlb buffer. However, I don't think it's worth going for masks smaller than 32-bit on arm64.

Is that safe for noncoherent systems? I'd expect the io_tlb buffer to be cached there, which means we can't use it for coherent allocations.

For non-coherent systems, the current arm64 approach is to take the address returned by swiotlb_alloc_coherent() and remap it as Normal NonCacheable (see the arm64 __dma_alloc_noncoherent()). The CPU should only access the dma buffer via the non-cacheable mapping.

Doesn't that require the allocated page to be mapped using small pages in the linear mapping?

Not in the linear mapping but in the vmalloc space (for arm64).

...

As far as I understand it, you are not allowed to create a noncacheable mapping for a page that also has a cachable mapping, or did that change between armv7 and armv8?

No change between ARMv7 and ARMv8. In fact, ARMv7 has clarified what "unpredictable" means when you mix different attributes. Basically, we can mix the same memory type (Normal Cacheable vs Normal NonCacheable) but if the cacheability is different, we need to do some cache maintenance before accessing the non-cacheable mapping the first time (clear potentially dirty lines that could be evicted randomly). Any speculatively loaded cache lines via the cacheable mapping would not be hit when accessed via the non-cacheable one.

...

For CMA, we use the trick to change the mapping in place, but that doesn't work for normal pages, which are mapped using huge tlbs.

That's true for arm32 and we could do the same for arm64, though I don't think its worth because we could break very huge mappings (e.g. 1GB). We have plenty of VA space and we just create a new non-coherent mapping (this latter part could be optimised, even generically, to use huge pages where possible).

...

...
Of course, another approach would be to change the cacheability attributes of the io_tlb buffer (or the CMA one) but current approach has the advantage that the io_tlb buffer can be used for both coherent and non-coherent devices.

That definitely shouldn't be done without performance testing. However I wonder if they were done in the first place for the swiotlb: It's possible that it's cheaper to have the bounce buffer for noncoherent devices be mapped noncachable so we do cache bypassing copy into the bounce buffer and out of it and save the extra flushes.

I doesn't do this AFAICT but it's an interesting idea. Well, to be optimised when we can benchmark this on hardware.

One issue though is when some (or all) devices are coherent. In this case, even if you use a bounce buffer, it needs to be mapped coherently at the CPU level (e.g. the device has its own cache that can be snooped, so CPU accesses need to be cacheable).

-- Catalin

Arnd Bergmann

2:43 p.m.

On Wednesday 21 May 2014 14:56:36 Catalin Marinas wrote:

...

On Tue, May 20, 2014 at 08:37:10PM +0100, Arnd Bergmann wrote:

...
On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote:

...
On Tue, May 20, 2014 at 12:08:58PM +0100, Arnd Bergmann wrote:

...
On Tuesday 20 May 2014 12:02:46 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 05:55:56PM +0100, Arnd Bergmann wrote:

...
On Monday 19 May 2014 16:56:08 Catalin Marinas wrote: > On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote: > > On Monday 19 May 2014 10:03:40 Catalin Marinas wrote: We probably want to default to 32-bit for arm32 in the absence of dma-ranges. For arm64, I'd prefer if we could always mandate dma-ranges to be present for each bus, just like we mandate ranges to be present. I hope it's not too late for that.

dma_set_mask should definitely look at the dma-ranges properties, and the helper that Santosh just introduced should give us all the information we need. We just need to decide on the correct behavior.

Last time I looked at Santosh's patches I thought the dma-ranges is per device rather than per bus. We could make it per bus only and let the device call dma_set_mask() explicitly if it wants to restrict it further.

Can you check again? I've read the code again yesterday to check this, and I concluded it was correctly doing this per bus.

You are right, I missed the fact that of_dma_get_range() checks the dma-ranges property in the node's parent.

So what we need is setting the default dma mask based on the size information in dma-ranges to something like this:

mask = rounddown_pow_of_two(size) - 1; dma_set_mask(dev, mask); /* or dma_set_mask_and_coherent() */

I don't think we should be calling dma_set_mask here, since that would just go and parse the same property again once we fix the implementation.

Since this is a low-level helper, we can probably just assign the dma mask directly.

I was thinking of calling it in of_dma_configure() as that's where we already set the default dma_mask and coherent_dma_mask. Default to 32-bit if no dma-ranges are present.

Right. Actually it should also be capped to 32-bit, to allow compatibility with drivers that don't call dma_set_mask and can't do 64-bit DMA. This is the normal behavior for PCI drivers. They need to set a 64-bit mask and check the result.

...

...
...
...
...
...
> > While we currently don't have a set of swiotlb DMA ops on ARM32, we do > > have it on ARM64, and I think we should be using them properly. It should > > really not be hard to implement a proper dma_set_mask() function for > > ARM64 that gets is able to set up the swiotlb based on the dma-ranges > > properties and always returns success but leaves the mask unchanged. > > The swiotlb bounce buffer needs to be pre-allocated at boot, otherwise > we don't have any guarantees. Since we can't honour random masks anyway, > we stick to ZONE_DMA which is currently in the 4G limit. But the driver > calls dma_set_mask() too late for any further swiotlb setup. > > With IOMMU we can be more flexible around dma_set_mask(), can be done at > run-time.

What we can do with swiotlb is to check if the mask is smaller than ZONE_DMA. If it ever is, we have to fail dma_set_mask and hope the driver can fall back to PIO mode or it will have to fail its probe() function.

dma_set_(coherent_)mask check swiotlb_dma_supported() which returns false if io_tlb_end goes beyond the device mask. So we just need to ensure that io_tlb is allocated within ZONE_DMA.

Makes sense for dma_set_mask. Why do you do the same thing for coherent_mask? Shouldn't that check against ZONE_DMA instead?

It depends on the meaning of coherent_dma_mask. As I understand it, that's the dma mask use for dma_alloc_coherent() to return a memory buffer that the device is able to access. I don't see it much different from the dma_mask used by the streaming API. I guess some hardware could have different masks here if they have cache coherency only for certain memory ranges (and the coherent_dma_mask would be smaller than dma_mask).

The point I was trying to make is a different one: For the streaming mapping, you can allow any mask as long as the swiotlb is able to create a bounce buffer.

dma_mask is a hardware parameter and is used by the streaming DMA API implementation (which could be swiotlb) to decide whether to bounce or not. The streaming DMA API can take any CPU address as input, independent of the dma_mask, and bounce accordingly. If it doesn't have a bounce buffer matching the dma_mask, dma_supported() would return false.

...
This does not work for the coherent mapping, because that by definition cannot use bounce buffers.

Yes but most of the time these masks have the same value.

Let me start again with an example:

io_tlb_end = 0x1000000; // 16 MB dma_mask = 0x10000000; // 256 MB ZONE_DMA = 0x100000000; // 4 GB max_pfn = 0x1000000000; // 64 GB

The device driver has set dma_mask and dma_coherent_mask to 256 because of a limitation in the hardware. This succeeded because io_tlb_end is well within the dma mask.

Since the coherent_dma_mask is smaller than max_pfn, the swiotlb version of dma_alloc_coherent then allocates the buffer using GFP_DMA. However, that likely returns an address above dma_mask/coherent_dma_mask.

My point was that to fix this case, you must not compare the coherent_dma_mask requested by the device against io_tlb_end but against ZONE_DMA.

...

...
...
...
...
...
For dma_set_coherent_mask(), we also have to fail any call that tries to set a mask larger than what the device hardware can do. Unlike that, dma_set_mask() can succeed with any mask, we just have to enable swiotlb if the mask that the driver wants is larger than what the hardware can do.

Currently we can't satisfy any arbitrarily small dma mask even with swiotlb since the bounce buffer is just guaranteed to be in ZONE_DMA. Swiotlb allows for smaller masks but we need to reserve the io_tlb buffer early during boot and at smaller addresses. For example, swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA) and if the coherent_dma_mask isn't matched, it frees the pages and falls back to the io_tlb buffer. However, I don't think it's worth going for masks smaller than 32-bit on arm64.

Is that safe for noncoherent systems? I'd expect the io_tlb buffer to be cached there, which means we can't use it for coherent allocations.

For non-coherent systems, the current arm64 approach is to take the address returned by swiotlb_alloc_coherent() and remap it as Normal NonCacheable (see the arm64 __dma_alloc_noncoherent()). The CPU should only access the dma buffer via the non-cacheable mapping.

Doesn't that require the allocated page to be mapped using small pages in the linear mapping?

Not in the linear mapping but in the vmalloc space (for arm64).

...

...
As far as I understand it, you are not allowed to create a noncacheable mapping for a page that also has a cachable mapping, or did that change between armv7 and armv8?

No change between ARMv7 and ARMv8. In fact, ARMv7 has clarified what "unpredictable" means when you mix different attributes. Basically, we can mix the same memory type (Normal Cacheable vs Normal NonCacheable) but if the cacheability is different, we need to do some cache maintenance before accessing the non-cacheable mapping the first time (clear potentially dirty lines that could be evicted randomly). Any speculatively loaded cache lines via the cacheable mapping would not be hit when accessed via the non-cacheable one.

Ah, only for the first access? I had remembered this differently, thanks for the clarification.

...

...
For CMA, we use the trick to change the mapping in place, but that doesn't work for normal pages, which are mapped using huge tlbs.

That's true for arm32 and we could do the same for arm64, though I don't think its worth because we could break very huge mappings (e.g. 1GB). We have plenty of VA space and we just create a new non-coherent mapping (this latter part could be optimised, even generically, to use huge pages where possible).

I thought we specifically did this on arm32 to avoid the duplicate mapping because we had concluded that it would be broken even with the updated ARMv7 definition.

...

...
...
Of course, another approach would be to change the cacheability attributes of the io_tlb buffer (or the CMA one) but current approach has the advantage that the io_tlb buffer can be used for both coherent and non-coherent devices.

That definitely shouldn't be done without performance testing. However I wonder if they were done in the first place for the swiotlb: It's possible that it's cheaper to have the bounce buffer for noncoherent devices be mapped noncachable so we do cache bypassing copy into the bounce buffer and out of it and save the extra flushes.

I doesn't do this AFAICT but it's an interesting idea. Well, to be optimised when we can benchmark this on hardware.

One issue though is when some (or all) devices are coherent. In this case, even if you use a bounce buffer, it needs to be mapped coherently at the CPU level (e.g. the device has its own cache that can be snooped, so CPU accesses need to be cacheable).

Do you mean there are cases where an uncached buffer is not coherent with a device cache?

Arnd

Rob Herring

3:26 p.m.

On Wed, May 21, 2014 at 9:43 AM, Arnd Bergmann arnd@arndb.de wrote:

...

On Wednesday 21 May 2014 14:56:36 Catalin Marinas wrote:

...
On Tue, May 20, 2014 at 08:37:10PM +0100, Arnd Bergmann wrote:

...
On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote:

...
On Tue, May 20, 2014 at 12:08:58PM +0100, Arnd Bergmann wrote:

...
On Tuesday 20 May 2014 12:02:46 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 05:55:56PM +0100, Arnd Bergmann wrote: > On Monday 19 May 2014 16:56:08 Catalin Marinas wrote: > > On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote: > > > On Monday 19 May 2014 10:03:40 Catalin Marinas wrote: > We probably want to default to 32-bit for arm32 in the absence of dma-ranges. > For arm64, I'd prefer if we could always mandate dma-ranges to be present > for each bus, just like we mandate ranges to be present. > I hope it's not too late for that. > > dma_set_mask should definitely look at the dma-ranges properties, and the > helper that Santosh just introduced should give us all the information > we need. We just need to decide on the correct behavior.

Last time I looked at Santosh's patches I thought the dma-ranges is per device rather than per bus. We could make it per bus only and let the device call dma_set_mask() explicitly if it wants to restrict it further.

Can you check again? I've read the code again yesterday to check this, and I concluded it was correctly doing this per bus.

You are right, I missed the fact that of_dma_get_range() checks the dma-ranges property in the node's parent.

So what we need is setting the default dma mask based on the size information in dma-ranges to something like this:

mask = rounddown_pow_of_two(size) - 1; dma_set_mask(dev, mask); /* or dma_set_mask_and_coherent() */

I don't think we should be calling dma_set_mask here, since that would just go and parse the same property again once we fix the implementation.

Since this is a low-level helper, we can probably just assign the dma mask directly.

I was thinking of calling it in of_dma_configure() as that's where we already set the default dma_mask and coherent_dma_mask. Default to 32-bit if no dma-ranges are present.

Right. Actually it should also be capped to 32-bit, to allow compatibility with drivers that don't call dma_set_mask and can't do 64-bit DMA. This is the normal behavior for PCI drivers. They need to set a 64-bit mask and check the result.

What are you checking against to cause a failure and what do you do on failure? I'm guessing that PCI masks are compared to the mask of parent bridges and PCI devices just set the mask to 32-bit if 64-bit fails. That doesn't work if your mask needs to be somewhere between 32 and 64-bit due to some bus constraints. Perhaps that's not something we need to worry about until we see hardware with that condition.

Rob

Arnd Bergmann

3:48 p.m.

On Wednesday 21 May 2014 10:26:01 Rob Herring wrote:

...

On Wed, May 21, 2014 at 9:43 AM, Arnd Bergmann arnd@arndb.de wrote:

...
On Wednesday 21 May 2014 14:56:36 Catalin Marinas wrote:

...
On Tue, May 20, 2014 at 08:37:10PM +0100, Arnd Bergmann wrote:

...
On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote:

...
On Tue, May 20, 2014 at 12:08:58PM +0100, Arnd Bergmann wrote:

...
On Tuesday 20 May 2014 12:02:46 Catalin Marinas wrote: > On Mon, May 19, 2014 at 05:55:56PM +0100, Arnd Bergmann wrote: > > On Monday 19 May 2014 16:56:08 Catalin Marinas wrote: > > > On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote: > > > > On Monday 19 May 2014 10:03:40 Catalin Marinas wrote: > > We probably want to default to 32-bit for arm32 in the absence of dma-ranges. > > For arm64, I'd prefer if we could always mandate dma-ranges to be present > > for each bus, just like we mandate ranges to be present. > > I hope it's not too late for that. > > > > dma_set_mask should definitely look at the dma-ranges properties, and the > > helper that Santosh just introduced should give us all the information > > we need. We just need to decide on the correct behavior. > > Last time I looked at Santosh's patches I thought the dma-ranges is per > device rather than per bus. We could make it per bus only and let the > device call dma_set_mask() explicitly if it wants to restrict it > further.

Can you check again? I've read the code again yesterday to check this, and I concluded it was correctly doing this per bus.

You are right, I missed the fact that of_dma_get_range() checks the dma-ranges property in the node's parent.

So what we need is setting the default dma mask based on the size information in dma-ranges to something like this:

mask = rounddown_pow_of_two(size) - 1; dma_set_mask(dev, mask); /* or dma_set_mask_and_coherent() */

I don't think we should be calling dma_set_mask here, since that would just go and parse the same property again once we fix the implementation.

Since this is a low-level helper, we can probably just assign the dma mask directly.

I was thinking of calling it in of_dma_configure() as that's where we already set the default dma_mask and coherent_dma_mask. Default to 32-bit if no dma-ranges are present.

Right. Actually it should also be capped to 32-bit, to allow compatibility with drivers that don't call dma_set_mask and can't do 64-bit DMA. This is the normal behavior for PCI drivers. They need to set a 64-bit mask and check the result.

What are you checking against to cause a failure and what do you do on failure? I'm guessing that PCI masks are compared to the mask of parent bridges and PCI devices just set the mask to 32-bit if 64-bit fails. That doesn't work if your mask needs to be somewhere between 32 and 64-bit due to some bus constraints. Perhaps that's not something we need to worry about until we see hardware with that condition.

We should compare against the size returned by of_dma_get_range(). If the mask requested by the driver is larger than the mask of the bus it's attached on, dma_set_mask should fail.

We can always allow 64-bit masks if the actual bus capability is enough to cover all the installed RAM. That is a relatively common case.

I'm not entirely sure how to handle drivers trying to set a 32-bit mask when the bus has less than 32-bits, such as for the shmobile rcar EHCI. Maybe we should succeed but cap the mask instead.

Arnd

Rob Herring

4:15 p.m.

On Wed, May 21, 2014 at 10:48 AM, Arnd Bergmann arnd@arndb.de wrote:

...

On Wednesday 21 May 2014 10:26:01 Rob Herring wrote:

...
On Wed, May 21, 2014 at 9:43 AM, Arnd Bergmann arnd@arndb.de wrote:

...
On Wednesday 21 May 2014 14:56:36 Catalin Marinas wrote:

...
On Tue, May 20, 2014 at 08:37:10PM +0100, Arnd Bergmann wrote:

...
On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote:

...
On Tue, May 20, 2014 at 12:08:58PM +0100, Arnd Bergmann wrote: > On Tuesday 20 May 2014 12:02:46 Catalin Marinas wrote: > > On Mon, May 19, 2014 at 05:55:56PM +0100, Arnd Bergmann wrote: > > > On Monday 19 May 2014 16:56:08 Catalin Marinas wrote: > > > > On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote: > > > > > On Monday 19 May 2014 10:03:40 Catalin Marinas wrote: > > > We probably want to default to 32-bit for arm32 in the absence of dma-ranges. > > > For arm64, I'd prefer if we could always mandate dma-ranges to be present > > > for each bus, just like we mandate ranges to be present. > > > I hope it's not too late for that. > > > > > > dma_set_mask should definitely look at the dma-ranges properties, and the > > > helper that Santosh just introduced should give us all the information > > > we need. We just need to decide on the correct behavior. > > > > Last time I looked at Santosh's patches I thought the dma-ranges is per > > device rather than per bus. We could make it per bus only and let the > > device call dma_set_mask() explicitly if it wants to restrict it > > further. > > Can you check again? I've read the code again yesterday to check this, > and I concluded it was correctly doing this per bus.

You are right, I missed the fact that of_dma_get_range() checks the dma-ranges property in the node's parent.

So what we need is setting the default dma mask based on the size information in dma-ranges to something like this:

mask = rounddown_pow_of_two(size) - 1; dma_set_mask(dev, mask); /* or dma_set_mask_and_coherent() */

I don't think we should be calling dma_set_mask here, since that would just go and parse the same property again once we fix the implementation.

Since this is a low-level helper, we can probably just assign the dma mask directly.

I was thinking of calling it in of_dma_configure() as that's where we already set the default dma_mask and coherent_dma_mask. Default to 32-bit if no dma-ranges are present.

Right. Actually it should also be capped to 32-bit, to allow compatibility with drivers that don't call dma_set_mask and can't do 64-bit DMA. This is the normal behavior for PCI drivers. They need to set a 64-bit mask and check the result.

What are you checking against to cause a failure and what do you do on failure? I'm guessing that PCI masks are compared to the mask of parent bridges and PCI devices just set the mask to 32-bit if 64-bit fails. That doesn't work if your mask needs to be somewhere between 32 and 64-bit due to some bus constraints. Perhaps that's not something we need to worry about until we see hardware with that condition.

We should compare against the size returned by of_dma_get_range(). If the mask requested by the driver is larger than the mask of the bus it's attached on, dma_set_mask should fail.

We can always allow 64-bit masks if the actual bus capability is enough to cover all the installed RAM. That is a relatively common case.

Agreed. However, if we check dma-ranges, it may be large enough for "all of RAM", but less than a full 64-bit. There is also the edge case that you cannot set the size to 2^64, but only 2^64 - 1. That means dma_set_mask(2^64 - 1) will always fail.

...

I'm not entirely sure how to handle drivers trying to set a 32-bit mask when the bus has less than 32-bits, such as for the shmobile rcar EHCI. Maybe we should succeed but cap the mask instead.

Yes, adjusting the mask is what I have been suggesting for this example and my example here.

Rob

Catalin Marinas

4:35 p.m.

On Wed, May 21, 2014 at 05:15:06PM +0100, Rob Herring wrote:

...

On Wed, May 21, 2014 at 10:48 AM, Arnd Bergmann arnd@arndb.de wrote:

...
On Wednesday 21 May 2014 10:26:01 Rob Herring wrote:

...
What are you checking against to cause a failure and what do you do on failure? I'm guessing that PCI masks are compared to the mask of parent bridges and PCI devices just set the mask to 32-bit if 64-bit fails. That doesn't work if your mask needs to be somewhere between 32 and 64-bit due to some bus constraints. Perhaps that's not something we need to worry about until we see hardware with that condition.

We should compare against the size returned by of_dma_get_range(). If the mask requested by the driver is larger than the mask of the bus it's attached on, dma_set_mask should fail.

We can always allow 64-bit masks if the actual bus capability is enough to cover all the installed RAM. That is a relatively common case.

Agreed. However, if we check dma-ranges, it may be large enough for "all of RAM", but less than a full 64-bit. There is also the edge case that you cannot set the size to 2^64, but only 2^64 - 1. That means dma_set_mask(2^64 - 1) will always fail.

Size of 0 meaning all range?

-- Catalin

Rob Herring

5:01 p.m.

On Wed, May 21, 2014 at 11:35 AM, Catalin Marinas catalin.marinas@arm.com wrote:

...

On Wed, May 21, 2014 at 05:15:06PM +0100, Rob Herring wrote:

...
On Wed, May 21, 2014 at 10:48 AM, Arnd Bergmann arnd@arndb.de wrote:

...
On Wednesday 21 May 2014 10:26:01 Rob Herring wrote:

...
What are you checking against to cause a failure and what do you do on failure? I'm guessing that PCI masks are compared to the mask of parent bridges and PCI devices just set the mask to 32-bit if 64-bit fails. That doesn't work if your mask needs to be somewhere between 32 and 64-bit due to some bus constraints. Perhaps that's not something we need to worry about until we see hardware with that condition.

We should compare against the size returned by of_dma_get_range(). If the mask requested by the driver is larger than the mask of the bus it's attached on, dma_set_mask should fail.

We can always allow 64-bit masks if the actual bus capability is enough to cover all the installed RAM. That is a relatively common case.

Agreed. However, if we check dma-ranges, it may be large enough for "all of RAM", but less than a full 64-bit. There is also the edge case that you cannot set the size to 2^64, but only 2^64 - 1. That means dma_set_mask(2^64 - 1) will always fail.

Size of 0 meaning all range?

Sure. Either 0 or (2^64 - 1) could have special meaning. That case is easy to solve. It's the first case you cannot solve only looking at dma-ranges. You would also have to look at the max RAM address. Then what do you do if dma-ranges gives you a mask greater than 2^32 and less than max RAM address?

Rob

Arnd Bergmann

5:19 p.m.

On Wednesday 21 May 2014 12:01:29 Rob Herring wrote:

...

On Wed, May 21, 2014 at 11:35 AM, Catalin Marinas catalin.marinas@arm.com wrote:

...
On Wed, May 21, 2014 at 05:15:06PM +0100, Rob Herring wrote:

...
On Wed, May 21, 2014 at 10:48 AM, Arnd Bergmann arnd@arndb.de wrote:

...
On Wednesday 21 May 2014 10:26:01 Rob Herring wrote:

...
What are you checking against to cause a failure and what do you do on failure? I'm guessing that PCI masks are compared to the mask of parent bridges and PCI devices just set the mask to 32-bit if 64-bit fails. That doesn't work if your mask needs to be somewhere between 32 and 64-bit due to some bus constraints. Perhaps that's not something we need to worry about until we see hardware with that condition.

We should compare against the size returned by of_dma_get_range(). If the mask requested by the driver is larger than the mask of the bus it's attached on, dma_set_mask should fail.

We can always allow 64-bit masks if the actual bus capability is enough to cover all the installed RAM. That is a relatively common case.

Agreed. However, if we check dma-ranges, it may be large enough for "all of RAM", but less than a full 64-bit. There is also the edge case that you cannot set the size to 2^64, but only 2^64 - 1. That means dma_set_mask(2^64 - 1) will always fail.

Size of 0 meaning all range?

Sure. Either 0 or (2^64 - 1) could have special meaning. That case is easy to solve. It's the first case you cannot solve only looking at dma-ranges. You would also have to look at the max RAM address. Then what do you do if dma-ranges gives you a mask greater than 2^32 and less than max RAM address?

The DMA mask is always inclusive, i.e. the boundary minus one. There should be no ambiguity here.

Arnd

Catalin Marinas

3:32 p.m.

On Wed, May 21, 2014 at 03:43:39PM +0100, Arnd Bergmann wrote:

...

On Wednesday 21 May 2014 14:56:36 Catalin Marinas wrote:

...
On Tue, May 20, 2014 at 08:37:10PM +0100, Arnd Bergmann wrote:

...
On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote:

...
So what we need is setting the default dma mask based on the size information in dma-ranges to something like this:

mask = rounddown_pow_of_two(size) - 1; dma_set_mask(dev, mask); /* or dma_set_mask_and_coherent() */

I don't think we should be calling dma_set_mask here, since that would just go and parse the same property again once we fix the implementation.

Since this is a low-level helper, we can probably just assign the dma mask directly.

I was thinking of calling it in of_dma_configure() as that's where we already set the default dma_mask and coherent_dma_mask. Default to 32-bit if no dma-ranges are present.

Right. Actually it should also be capped to 32-bit, to allow compatibility with drivers that don't call dma_set_mask and can't do 64-bit DMA. This is the normal behavior for PCI drivers. They need to set a 64-bit mask and check the result.

OK.

...

...
...
The point I was trying to make is a different one: For the streaming mapping, you can allow any mask as long as the swiotlb is able to create a bounce buffer.

dma_mask is a hardware parameter and is used by the streaming DMA API implementation (which could be swiotlb) to decide whether to bounce or not. The streaming DMA API can take any CPU address as input, independent of the dma_mask, and bounce accordingly. If it doesn't have a bounce buffer matching the dma_mask, dma_supported() would return false.

...
This does not work for the coherent mapping, because that by definition cannot use bounce buffers.

Yes but most of the time these masks have the same value.

Let me start again with an example:

io_tlb_end = 0x1000000; // 16 MB dma_mask = 0x10000000; // 256 MB ZONE_DMA = 0x100000000; // 4 GB max_pfn = 0x1000000000; // 64 GB

The device driver has set dma_mask and dma_coherent_mask to 256 because of a limitation in the hardware. This succeeded because io_tlb_end is well within the dma mask.

Since the coherent_dma_mask is smaller than max_pfn, the swiotlb version of dma_alloc_coherent then allocates the buffer using GFP_DMA. However, that likely returns an address above dma_mask/coherent_dma_mask.

swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA). If the returned address is not within coherent_dma_mask, it frees the allocated pages and tries to allocate from its bounce buffer (io_tlb).

...

My point was that to fix this case, you must not compare the coherent_dma_mask requested by the device against io_tlb_end but against ZONE_DMA.

See above, I think swiotlb does the right thing.

We have a problem, however, with CMA, as we assume that the returned address is within coherent_dma_mask (we need a contiguous_dma_supported function, similar to swiotlb_dma_supported). We also don't limit the CMA buffer to ZONE_DMA on arm64.

...

...
...
As far as I understand it, you are not allowed to create a noncacheable mapping for a page that also has a cachable mapping, or did that change between armv7 and armv8?

No change between ARMv7 and ARMv8. In fact, ARMv7 has clarified what "unpredictable" means when you mix different attributes. Basically, we can mix the same memory type (Normal Cacheable vs Normal NonCacheable) but if the cacheability is different, we need to do some cache maintenance before accessing the non-cacheable mapping the first time (clear potentially dirty lines that could be evicted randomly). Any speculatively loaded cache lines via the cacheable mapping would not be hit when accessed via the non-cacheable one.

Ah, only for the first access? I had remembered this differently, thanks for the clarification.

The first time, but assuming that you no longer write to the cacheable alias to dirty it again (basically it's like the streaming DMA, if you want to access the cacheable alias you need cache maintenance).

...

...
...
For CMA, we use the trick to change the mapping in place, but that doesn't work for normal pages, which are mapped using huge tlbs.

That's true for arm32 and we could do the same for arm64, though I don't think its worth because we could break very huge mappings (e.g. 1GB). We have plenty of VA space and we just create a new non-coherent mapping (this latter part could be optimised, even generically, to use huge pages where possible).

I thought we specifically did this on arm32 to avoid the duplicate mapping because we had concluded that it would be broken even with the updated ARMv7 definition.

We did it because it was vaguely defined as "unpredictable" in the ARM ARM for a long time until the hw guys realised it's not easy to change Linux and decided to clarify what could actually happen (see A3.5.7 in the ARMv7 ARM, "Mismatched memory attributes").

But in the past we used to have coherent DMA buffers mapped as Strongly Ordered and this would be a different memory type than Normal (NonCacheable). But even in this case, ARM ARM now states that such mapping is permitted but the Strongly Ordered properties are not guaranteed (could simply behave line Normal NonCacheable).

...

...
...
...
Of course, another approach would be to change the cacheability attributes of the io_tlb buffer (or the CMA one) but current approach has the advantage that the io_tlb buffer can be used for both coherent and non-coherent devices.

That definitely shouldn't be done without performance testing. However I wonder if they were done in the first place for the swiotlb: It's possible that it's cheaper to have the bounce buffer for noncoherent devices be mapped noncachable so we do cache bypassing copy into the bounce buffer and out of it and save the extra flushes.

I doesn't do this AFAICT but it's an interesting idea. Well, to be optimised when we can benchmark this on hardware.

One issue though is when some (or all) devices are coherent. In this case, even if you use a bounce buffer, it needs to be mapped coherently at the CPU level (e.g. the device has its own cache that can be snooped, so CPU accesses need to be cacheable).

Do you mean there are cases where an uncached buffer is not coherent with a device cache?

Yes, if for example you have a device (e.g. GPU, micro-controller) on a coherent interconnect, non-cacheable CPU accesses are not guaranteed to (probably should not) hit into the other device's cache (it could be seen as a side-effect of allowing cacheable vs non-cacheable aliases).

-- Catalin

Arnd Bergmann

3:57 p.m.

On Wednesday 21 May 2014 16:32:16 Catalin Marinas wrote:

...

On Wed, May 21, 2014 at 03:43:39PM +0100, Arnd Bergmann wrote:

...
On Wednesday 21 May 2014 14:56:36 Catalin Marinas wrote:

...
On Tue, May 20, 2014 at 08:37:10PM +0100, Arnd Bergmann wrote:

...
On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote: The point I was trying to make is a different one: For the streaming mapping, you can allow any mask as long as the swiotlb is able to create a bounce buffer.

dma_mask is a hardware parameter and is used by the streaming DMA API implementation (which could be swiotlb) to decide whether to bounce or not. The streaming DMA API can take any CPU address as input, independent of the dma_mask, and bounce accordingly. If it doesn't have a bounce buffer matching the dma_mask, dma_supported() would return false.

...
This does not work for the coherent mapping, because that by definition cannot use bounce buffers.

Yes but most of the time these masks have the same value.

Let me start again with an example:

io_tlb_end = 0x1000000; // 16 MB dma_mask = 0x10000000; // 256 MB ZONE_DMA = 0x100000000; // 4 GB max_pfn = 0x1000000000; // 64 GB

The device driver has set dma_mask and dma_coherent_mask to 256 because of a limitation in the hardware. This succeeded because io_tlb_end is well within the dma mask.

Since the coherent_dma_mask is smaller than max_pfn, the swiotlb version of dma_alloc_coherent then allocates the buffer using GFP_DMA. However, that likely returns an address above dma_mask/coherent_dma_mask.

swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA). If the returned address is not within coherent_dma_mask, it frees the allocated pages and tries to allocate from its bounce buffer (io_tlb).

Ok, got it now. Thanks for your patience. I suppose this was done to maintain the API guarantee that dma_set_coherent_mask(dev, dma_get_mask(dev)) always succeeds.

...

...
My point was that to fix this case, you must not compare the coherent_dma_mask requested by the device against io_tlb_end but against ZONE_DMA.

See above, I think swiotlb does the right thing.

We have a problem, however, with CMA, as we assume that the returned address is within coherent_dma_mask (we need a contiguous_dma_supported function, similar to swiotlb_dma_supported). We also don't limit the CMA buffer to ZONE_DMA on arm64.

Right. I guess the answer to that is to teach CMA to honor GFP_DMA. This was probably never needed on ARM32 but is an obvious extension.

...

...
...
...
As far as I understand it, you are not allowed to create a noncacheable mapping for a page that also has a cachable mapping, or did that change between armv7 and armv8?

No change between ARMv7 and ARMv8. In fact, ARMv7 has clarified what "unpredictable" means when you mix different attributes. Basically, we can mix the same memory type (Normal Cacheable vs Normal NonCacheable) but if the cacheability is different, we need to do some cache maintenance before accessing the non-cacheable mapping the first time (clear potentially dirty lines that could be evicted randomly). Any speculatively loaded cache lines via the cacheable mapping would not be hit when accessed via the non-cacheable one.

Ah, only for the first access? I had remembered this differently, thanks for the clarification.

The first time, but assuming that you no longer write to the cacheable alias to dirty it again (basically it's like the streaming DMA, if you want to access the cacheable alias you need cache maintenance).

Ok.

...

...
...
...
For CMA, we use the trick to change the mapping in place, but that doesn't work for normal pages, which are mapped using huge tlbs.

That's true for arm32 and we could do the same for arm64, though I don't think its worth because we could break very huge mappings (e.g. 1GB). We have plenty of VA space and we just create a new non-coherent mapping (this latter part could be optimised, even generically, to use huge pages where possible).

I thought we specifically did this on arm32 to avoid the duplicate mapping because we had concluded that it would be broken even with the updated ARMv7 definition.

We did it because it was vaguely defined as "unpredictable" in the ARM ARM for a long time until the hw guys realised it's not easy to change Linux and decided to clarify what could actually happen (see A3.5.7 in the ARMv7 ARM, "Mismatched memory attributes").

But in the past we used to have coherent DMA buffers mapped as Strongly Ordered and this would be a different memory type than Normal (NonCacheable). But even in this case, ARM ARM now states that such mapping is permitted but the Strongly Ordered properties are not guaranteed (could simply behave line Normal NonCacheable).

Right.

...

...
...
...
...
Of course, another approach would be to change the cacheability attributes of the io_tlb buffer (or the CMA one) but current approach has the advantage that the io_tlb buffer can be used for both coherent and non-coherent devices.

That definitely shouldn't be done without performance testing. However I wonder if they were done in the first place for the swiotlb: It's possible that it's cheaper to have the bounce buffer for noncoherent devices be mapped noncachable so we do cache bypassing copy into the bounce buffer and out of it and save the extra flushes.

I doesn't do this AFAICT but it's an interesting idea. Well, to be optimised when we can benchmark this on hardware.

One issue though is when some (or all) devices are coherent. In this case, even if you use a bounce buffer, it needs to be mapped coherently at the CPU level (e.g. the device has its own cache that can be snooped, so CPU accesses need to be cacheable).

Do you mean there are cases where an uncached buffer is not coherent with a device cache?

Yes, if for example you have a device (e.g. GPU, micro-controller) on a coherent interconnect, non-cacheable CPU accesses are not guaranteed to (probably should not) hit into the other device's cache (it could be seen as a side-effect of allowing cacheable vs non-cacheable aliases).

Rihgt, makes sense. So if we do this trick, it should only be done for devices that are not coherent to start with.

Arnd

Rob Herring

19 May 19 May

9:42 p.m.

On Mon, May 19, 2014 at 10:56 AM, Catalin Marinas catalin.marinas@arm.comwrote:

...

On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote:

...
On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 09:32:43AM +0100, Arnd Bergmann wrote:

...
The more important question is what happens to high buffers

allocated elsewhere

...
...
...
that get passed into dma_map_sg by a device driver. Depending on the

DT properties

...
...
...
of the device and its parents, this needs to do one of three things:

a) translate the 64-bit virtual address into a 64-bit bus address b) create an IOMMU entry for the 64-bit address and pass the 32-bit

IOMMU

...
...
...
address to the driver c) use the swiotlb code to create a bounce buffer at a 32-bit DMA

address

...
...
...
and copy the data around

It's definitely wrong to just hardcode a DMA mask in the driver

because that

...
...
...
code doesn't know which of the three cases is being used. Moreover,

you can't

...
...
...
do it using an #ifdef CONFIG_ARM64, because it's completely

independent of

...
...
...
the architecture, and we need to do the exact same logic on ARM32

and any

...
...
...
other architecture.

I agree.

The problem we currently have is system topology description to pass

the

...
...
DMA mask and in a hierarchical way. I can see Santosh's patches introducing dma-ranges but the coherent dma mask still set as 32-bit.

We

...
...
can use the dma-ranges to infer a mask but that's only specific to the device and the driver doesn't know whether it goes through an iommu or not.

We definitely have to fix this very quickly, before people start building real arm64 systems and shipping them.

We should not merge any hacks that attempt to work around the problem, but try to come to a conclusion how to handle them properly. My hope was that we could just always set the dma mask to whatever the DT says it should be to keep the burden from device drivers, unless they want to restrict it further (e.g. when the specific peripheral hardware has a bug that prevents us from using high addresses, even though the SoC in theory supports it everywhere).

I agree.

...
Rob Herring argued that we should always mimic PCI and call

dma_set_mask()

...
in drivers but default to a 32-bit mask otherwise, independent of whether the hardware can do more or less than that, IIRC.

Can we not default to something built up from dma-ranges? Or 32-bit if dma-ranges property is missing?

My reasoning was the information may not come from DT. For AHCI, the 32 vs. 64 bit capability is in a capability register and the DMA mask is set based on that. This information only exists with-in the driver.

I think some sort of capability merging is needed here to merge a bus mask with a device mask. The current api doesn't seem to support this as adjusting the set mask is expected to fail if the requested mask is not supported. Here is how I see it working. Devices can have a default mask (32-bit typically). The default is simply the common case. Really the mask should only be set for DMA capable devices, but many drivers fail to set the mask. Devices can then enlarge or shrink what they support based on knowing the IP block features or getting capabilities from the IP block such as the AHCI case. This mask then needs to be adjusted (only shrinking) if the parent bus or platform has restrictions which is the part that should come from dma-ranges.

Keep in mind that dma-ranges is defined as a bus property, not a property of a device. I don't have any issue with allowing it in a device, but I don't think it should be required and am not yet convinced it is needed.

Rob

Arnd Bergmann

20 May 20 May

8:01 a.m.

On Monday 19 May 2014 16:42:18 Rob Herring wrote:

...

On Mon, May 19, 2014 at 10:56 AM, Catalin Marinas catalin.marinas@arm.comwrote:

...
On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote:

...
On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 09:32:43AM +0100, Arnd Bergmann wrote:

...

...
...
We definitely have to fix this very quickly, before people start building real arm64 systems and shipping them.

We should not merge any hacks that attempt to work around the problem, but try to come to a conclusion how to handle them properly. My hope was that we could just always set the dma mask to whatever the DT says it should be to keep the burden from device drivers, unless they want to restrict it further (e.g. when the specific peripheral hardware has a bug that prevents us from using high addresses, even though the SoC in theory supports it everywhere).

I agree.

...
Rob Herring argued that we should always mimic PCI and call

dma_set_mask()

...
in drivers but default to a 32-bit mask otherwise, independent of whether the hardware can do more or less than that, IIRC.

Can we not default to something built up from dma-ranges? Or 32-bit if dma-ranges property is missing?

My reasoning was the information may not come from DT. For AHCI, the 32 vs. 64 bit capability is in a capability register and the DMA mask is set based on that. This information only exists with-in the driver.

I think some sort of capability merging is needed here to merge a bus mask with a device mask. The current api doesn't seem to support this as adjusting the set mask is expected to fail if the requested mask is not supported. Here is how I see it working. Devices can have a default mask (32-bit typically). The default is simply the common case. Really the mask should only be set for DMA capable devices, but many drivers fail to set the mask. Devices can then enlarge or shrink what they support based on knowing the IP block features or getting capabilities from the IP block such as the AHCI case. This mask then needs to be adjusted (only shrinking) if the parent bus or platform has restrictions which is the part that should come from dma-ranges.

Where do you think the shrinking should be done then? I'd like to see that as part of the initial bus probe that Santosh just implemented to set the offset. I think it would be reasonable to apply whatever the bus can do there, but limit it to 32-bit.

...

Keep in mind that dma-ranges is defined as a bus property, not a property of a device. I don't have any issue with allowing it in a device, but I don't think it should be required and am not yet convinced it is needed.

Right. Arnd

Rob Herring

6:10 p.m.

On Tue, May 20, 2014 at 3:01 AM, Arnd Bergmann arnd@arndb.de wrote:

...

On Monday 19 May 2014 16:42:18 Rob Herring wrote:

...
On Mon, May 19, 2014 at 10:56 AM, Catalin Marinas catalin.marinas@arm.comwrote:

...
On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote:

...
On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:

...
On Mon, May 19, 2014 at 09:32:43AM +0100, Arnd Bergmann wrote:

...
...
...
We definitely have to fix this very quickly, before people start building real arm64 systems and shipping them.

We should not merge any hacks that attempt to work around the problem, but try to come to a conclusion how to handle them properly. My hope was that we could just always set the dma mask to whatever the DT says it should be to keep the burden from device drivers, unless they want to restrict it further (e.g. when the specific peripheral hardware has a bug that prevents us from using high addresses, even though the SoC in theory supports it everywhere).

I agree.

...
Rob Herring argued that we should always mimic PCI and call

dma_set_mask()

...
in drivers but default to a 32-bit mask otherwise, independent of whether the hardware can do more or less than that, IIRC.

Can we not default to something built up from dma-ranges? Or 32-bit if dma-ranges property is missing?

My reasoning was the information may not come from DT. For AHCI, the 32 vs. 64 bit capability is in a capability register and the DMA mask is set based on that. This information only exists with-in the driver.

I think some sort of capability merging is needed here to merge a bus mask with a device mask. The current api doesn't seem to support this as adjusting the set mask is expected to fail if the requested mask is not supported. Here is how I see it working. Devices can have a default mask (32-bit typically). The default is simply the common case. Really the mask should only be set for DMA capable devices, but many drivers fail to set the mask. Devices can then enlarge or shrink what they support based on knowing the IP block features or getting capabilities from the IP block such as the AHCI case. This mask then needs to be adjusted (only shrinking) if the parent bus or platform has restrictions which is the part that should come from dma-ranges.

Where do you think the shrinking should be done then? I'd like to see that as part of the initial bus probe that Santosh just implemented to set the offset. I think it would be reasonable to apply whatever the bus can do there, but limit it to 32-bit.

Before probe, we can only set a default unless we add a bus_mask or dma_mask from a parent device (i.e. a bus device). After driver probe would be too late as DMA allocations could occur during probe. So that basically leaves it to driver calls to dma_set_mask_and_coherent/dma_set_mask. We could do something like this for .set_dma_mask:

bus_mask = of_get_dma_ranges_mask(); if (bus_mask) mask &= bus_mask;

*dev->dma_mask = mask;

There is an issue in how dma_set_coherent_mask works though. By default it can only check if the mask is valid, but cannot adjust the mask. As long as drivers use dma_set_mask or dma_set_mask_and_coherent, the above should work.

This also adds an OF dependency into the DMA mapping code which I don't like too much. The alternative is storing the bus mask somewhere in struct device.

Rob

Alan Stern

19 May 19 May

2:16 p.m.

On Mon, 19 May 2014, Arnd Bergmann wrote:

...

...
dma_alloc_coherent() on arm64 should return 32-bit addresses if the coherent_dma_mask is set to 32-bit. Which kernel version is this?

The more important question is what happens to high buffers allocated elsewhere that get passed into dma_map_sg by a device driver.

I disagree. That is, the question about dma_map_sg is not more important (for ehci-hcd) than the question about dma_alloc_coherent.

In this case it is particularly tricky. The driver calls dma_pool_create() several times in addition to calling dma_alloc_coherent(), and the hardware requires that all of those pools plus the coherent buffer have DMA addresses with the same upper 32 bits.

As far as I know, the only way to enforce that is by requiring all those items either to be allocated in or to be mapped to the first 4 GB of memory.

Alan Stern

Arnd Bergmann

4:47 p.m.

On Monday 19 May 2014 10:16:57 Alan Stern wrote:

...

On Mon, 19 May 2014, Arnd Bergmann wrote:

...
...
dma_alloc_coherent() on arm64 should return 32-bit addresses if the coherent_dma_mask is set to 32-bit. Which kernel version is this?

The more important question is what happens to high buffers allocated elsewhere that get passed into dma_map_sg by a device driver.

I disagree. That is, the question about dma_map_sg is not more important (for ehci-hcd) than the question about dma_alloc_coherent.

In this case it is particularly tricky. The driver calls dma_pool_create() several times in addition to calling dma_alloc_coherent(), and the hardware requires that all of those pools plus the coherent buffer have DMA addresses with the same upper 32 bits.

As far as I know, the only way to enforce that is by requiring all those items either to be allocated in or to be mapped to the first 4 GB of memory.

But that part is easy to enforce by using a 32-bit dma mask. For dma_map_*, the problem is that we can't possibly enforce the location of the buffer, you have to go back to swiotlb bounce buffers for those.

In a lot of cases, tests don't even run into the problem because the block layer and the network layer both have their own way to deal with bounce buffers, at least on 32-bit systems, but you still run into the issue with less common drivers.

Arnd

4057

days inactive

4065

days old

linaro-kernel@lists.linaro.org

38 comments

participants

tags (0)

participants (7)

Alan Stern
Arnd Bergmann
Catalin Marinas
Jon Medhurst (Tixy)
Liviu Dudau
Mark Brown
Rob Herring