Setting "Double linefill enable" bit improves memcpy performance from ~750 MB/s to ~1150 MB/s when working with large buffers and also the performance of just anything which may need good memory bandwidth (for example, software rendered graphics).
Additionally setting "Double linefill on WRAP read disable" bit compensates most of the random access latency increase.
Signed-off-by: Siarhei Siamashka siarhei.siamashka@gmail.com --- arch/arm/mach-exynos4/cpu.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/arch/arm/mach-exynos4/cpu.c b/arch/arm/mach-exynos4/cpu.c index ba503c3..1afd25f 100644 --- a/arch/arm/mach-exynos4/cpu.c +++ b/arch/arm/mach-exynos4/cpu.c @@ -238,7 +238,7 @@ static int __init exynos4_l2x0_cache_init(void) __raw_writel(0x110, S5P_VA_L2CC + L2X0_DATA_LATENCY_CTRL);
/* L2X0 Prefetch Control */ - __raw_writel(0x30000007, S5P_VA_L2CC + L2X0_PREFETCH_CTRL); + __raw_writel(0x78000007, S5P_VA_L2CC + L2X0_PREFETCH_CTRL);
/* L2X0 Power Control */ __raw_writel(L2X0_DYNAMIC_CLK_GATING_EN | L2X0_STNDBY_MODE_EN,
Hi Siarhei,
Interesting feature, and it's not samsung soc issue, so add the arm mailing list. It checked and the see the read performance improvement from 868MiB/s to 981MiB/s with lmbench. It's helpful to test other SoC., e.g., OMAP4, STE and so on.
BTW, why do you set the 27-bit? In my PL310 Spec., it's reserved bit and should be zero (SBZ).
Thank you, Kyungmin Park
On Tue, Sep 13, 2011 at 3:07 PM, Siarhei Siamashka siarhei.siamashka@gmail.com wrote:
Setting "Double linefill enable" bit improves memcpy performance from ~750 MB/s to ~1150 MB/s when working with large buffers and also the performance of just anything which may need good memory bandwidth (for example, software rendered graphics).
Additionally setting "Double linefill on WRAP read disable" bit compensates most of the random access latency increase.
Signed-off-by: Siarhei Siamashka siarhei.siamashka@gmail.com
arch/arm/mach-exynos4/cpu.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/arch/arm/mach-exynos4/cpu.c b/arch/arm/mach-exynos4/cpu.c index ba503c3..1afd25f 100644 --- a/arch/arm/mach-exynos4/cpu.c +++ b/arch/arm/mach-exynos4/cpu.c @@ -238,7 +238,7 @@ static int __init exynos4_l2x0_cache_init(void) __raw_writel(0x110, S5P_VA_L2CC + L2X0_DATA_LATENCY_CTRL);
/* L2X0 Prefetch Control */
- __raw_writel(0x30000007, S5P_VA_L2CC + L2X0_PREFETCH_CTRL);
- __raw_writel(0x78000007, S5P_VA_L2CC + L2X0_PREFETCH_CTRL);
/* L2X0 Power Control */ __raw_writel(L2X0_DYNAMIC_CLK_GATING_EN | L2X0_STNDBY_MODE_EN, -- 1.7.3.4
linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev
On Wednesday 14 September 2011 11:38 AM, Kyungmin Park wrote:
Hi Siarhei,
Interesting feature, and it's not samsung soc issue, so add the arm mailing list. It checked and the see the read performance improvement from 868MiB/s to 981MiB/s with lmbench. It's helpful to test other SoC., e.g., OMAP4, STE and so on.
BTW, why do you set the 27-bit? In my PL310 Spec., it's reserved bit and should be zero (SBZ).
That's because not all PL310 versions double line fill.
Regards santosh
On Wed, Sep 14, 2011 at 9:08 AM, Kyungmin Park kmpark@infradead.org wrote:
Hi Siarhei,
Interesting feature, and it's not samsung soc issue, so add the arm mailing list. It checked and the see the read performance improvement from 868MiB/s to 981MiB/s with lmbench.
Maybe lmbench does not try very hard to get the best out of the hardware? On my origenboard, I'm getting ~1.15GB/s performance for the standard LDM/STM based memcpy from libc-ports, which is ~2.3GB/s memory bandwidth if both reads and writes are accounted separately.
It's helpful to test other SoC., e.g., OMAP4, STE and so on.
The current (?) state of the support for this feature in OMAP4 is explained here by Richard Woodruff: http://groups.google.com/group/pandaboard/msg/dfd2d2e1336d435b
BTW, why do you set the 27-bit? In my PL310 Spec., it's reserved bit and should be zero (SBZ).
This PL310 thing seems to have been renamed to "CoreLink Level 2 Cache Controller L2C-310" in later revisions, and its Prefetch Control Register is described here: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246f/CHDHIECI.html
Sorry for the confusing subject.
Regarding bit 27 ('Double linefill on WRAP read disable'), it seems to reduce the impact of enabling double linefill on the random access latency as measured by my self-written simple memory benchmark program: http://github.com/downloads/ssvb/ssvb-membench/ssvb-membench-0.1.tar.gz
On Wed, Sep 14, 2011 at 4:43 PM, Siarhei Siamashka siarhei.siamashka@gmail.com wrote:
On Wed, Sep 14, 2011 at 9:08 AM, Kyungmin Park kmpark@infradead.org wrote:
Hi Siarhei,
Interesting feature, and it's not samsung soc issue, so add the arm mailing list. It checked and the see the read performance improvement from 868MiB/s to 981MiB/s with lmbench.
Maybe lmbench does not try very hard to get the best out of the hardware? On my origenboard, I'm getting ~1.15GB/s performance for the standard LDM/STM based memcpy from libc-ports, which is ~2.3GB/s memory bandwidth if both reads and writes are accounted separately.
It's helpful to test other SoC., e.g., OMAP4, STE and so on.
The current (?) state of the support for this feature in OMAP4 is explained here by Richard Woodruff: http://groups.google.com/group/pandaboard/msg/dfd2d2e1336d435b
BTW, why do you set the 27-bit? In my PL310 Spec., it's reserved bit and should be zero (SBZ).
This PL310 thing seems to have been renamed to "CoreLink Level 2 Cache Controller L2C-310" in later revisions, and its Prefetch Control Register is described here: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246f/CHDHIECI.html
Thanks for link. it has 27-bit description. but does it correct bit description for exynos4 PL310? I mean I received the PL310 TRM with exynos4 chip used. there's no 27-bit description. it's just reserved bit. Can it enable the 27-bit at exynos4210? or can be used for exynos4212 or later?
Thank you, Kyungmin Park
Sorry for the confusing subject.
Regarding bit 27 ('Double linefill on WRAP read disable'), it seems to reduce the impact of enabling double linefill on the random access latency as measured by my self-written simple memory benchmark program: http://github.com/downloads/ssvb/ssvb-membench/ssvb-membench-0.1.tar.gz
-- Best regards, Siarhei Siamashka
On Wed, Sep 14, 2011 at 10:57 AM, Kyungmin Park kmpark@infradead.org wrote:
On Wed, Sep 14, 2011 at 4:43 PM, Siarhei Siamashka siarhei.siamashka@gmail.com wrote:
On Wed, Sep 14, 2011 at 9:08 AM, Kyungmin Park kmpark@infradead.org wrote:
Hi Siarhei,
Interesting feature, and it's not samsung soc issue, so add the arm mailing list. It checked and the see the read performance improvement from 868MiB/s to 981MiB/s with lmbench.
Maybe lmbench does not try very hard to get the best out of the hardware? On my origenboard, I'm getting ~1.15GB/s performance for the standard LDM/STM based memcpy from libc-ports, which is ~2.3GB/s memory bandwidth if both reads and writes are accounted separately.
It's helpful to test other SoC., e.g., OMAP4, STE and so on.
The current (?) state of the support for this feature in OMAP4 is explained here by Richard Woodruff: http://groups.google.com/group/pandaboard/msg/dfd2d2e1336d435b
BTW, why do you set the 27-bit? In my PL310 Spec., it's reserved bit and should be zero (SBZ).
This PL310 thing seems to have been renamed to "CoreLink Level 2 Cache Controller L2C-310" in later revisions, and its Prefetch Control Register is described here: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246f/CHDHIECI.html
Thanks for link. it has 27-bit description. but does it correct bit description for exynos4 PL310? I mean I received the PL310 TRM with exynos4 chip used. there's no 27-bit description. it's just reserved bit. Can it enable the 27-bit at exynos4210? or can be used for exynos4212 or later?
That's a good point. I think it is exynos4210 that is used in origenboard. And according to the value in Cache ID Register (0x4100c4c5), it has r3p0 revision of L2C-310. Which means that the Prefetch Control Register is actually described at: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246d/CHDHIECI.html And bit 27 is indeed reserved. However flipping it seems to have some measurable impact on performance (unless I screwed up the benchmarks), so maybe it does something but is undocumented? In any case, I agree that it's better not to mess up with this bit.
By the way, does anybody have L2C-310 errata list? Is double linefill actually safe to use in r3p0?
Siarhei Siamashka wrote:
On Wed, Sep 14, 2011 at 10:57 AM, Kyungmin Park kmpark@infradead.org wrote:
On Wed, Sep 14, 2011 at 4:43 PM, Siarhei Siamashka siarhei.siamashka@gmail.com wrote:
On Wed, Sep 14, 2011 at 9:08 AM, Kyungmin Park kmpark@infradead.org
wrote:
Hi Siarhei,
Interesting feature, and it's not samsung soc issue, so add the arm mailing list. It checked and the see the read performance improvement from 868MiB/s to 981MiB/s with lmbench.
Maybe lmbench does not try very hard to get the best out of the hardware? On my origenboard, I'm getting ~1.15GB/s performance for the standard LDM/STM based memcpy from libc-ports, which is ~2.3GB/s memory bandwidth if both reads and writes are accounted separately.
It's helpful to test other SoC., e.g., OMAP4, STE and so on.
The current (?) state of the support for this feature in OMAP4 is explained here by Richard Woodruff: http://groups.google.com/group/pandaboard/msg/dfd2d2e1336d435b
BTW, why do you set the 27-bit? In my PL310 Spec., it's reserved bit and should be zero (SBZ).
This PL310 thing seems to have been renamed to "CoreLink Level 2 Cache Controller L2C-310" in later revisions, and its Prefetch Control Register is described here: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246f/CHDHIECI.html
Thanks for link. it has 27-bit description. but does it correct bit description for exynos4 PL310? I mean I received the PL310 TRM with exynos4 chip used. there's no 27-bit description. it's just reserved bit. Can it enable the 27-bit at exynos4210? or can be used for exynos4212 or later?
That's a good point. I think it is exynos4210 that is used in origenboard. And according to the value in Cache ID Register (0x4100c4c5), it has r3p0 revision of L2C-310. Which means that the Prefetch Control Register is actually described at: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246d/CHDHIECI.html And bit 27 is indeed reserved. However flipping it seems to have some measurable impact on performance (unless I screwed up the benchmarks), so maybe it does something but is undocumented? In any case, I agree that it's better not to mess up with this bit.
Hi all,
Please adding me in Cc for Samsung stuff...
By the way, does anybody have L2C-310 errata list? Is double linefill actually safe to use in r3p0?
No. it is _not_ safe on EXYNOS4210.
Since L2C-310 ERRTA, current EXYNOS4210 cannot enable double linefill feature and as Siarhei said, need to check its version of L2C-310 in Cache ID register before enabling it. As a note, it's possible to enable it on EXYNOS4212 SoC and in opposite of Siarhei's patch, enabling WRAP read is better on it. Actually my colleague, Boojin Kim is testing it so that can submit it soon.
Thanks.
Best regards, Kgene. -- Kukjin Kim kgene.kim@samsung.com, Senior Engineer, SW Solution Development Team, Samsung Electronics Co., Ltd.
On Wed, Sep 14, 2011 at 2:23 PM, Kukjin Kim kgene.kim@samsung.com wrote:
Siarhei Siamashka wrote:
By the way, does anybody have L2C-310 errata list? Is double linefill actually safe to use in r3p0?
No. it is _not_ safe on EXYNOS4210.
Since L2C-310 ERRTA, current EXYNOS4210 cannot enable double linefill feature
Thanks for this information. It's a pity, because double linefill could provide a really serious memory performance boost. Looks like we have to wait for EXYNOS4212 and/or OMAP4460 to really see how Cortex-A9 is actually supposed to perform on memory intensive tasks.
However I really appreciate that with EXYNOS4210 you are not shoving some hardcoded configuration down our throats and not restricting access to the relevant Cortex-A9 and L2C-310 configuration registers. So it is still possible to temporarily enable double linefill and use origenboard for benchmarking purposes to estimate how EXYNOS4212 is going to perform when it becomes available.
and as Siarhei said, need to check its version of L2C-310 in Cache ID register before enabling it.
If EXYNOS4212 has a bugfree double linefill support, then enabling it based on checking L2C-310 revision looks like a good idea.
As a note, it's possible to enable it on EXYNOS4212 SoC and in opposite of Siarhei's patch, enabling WRAP read is better on it. Actually my colleague, Boojin Kim is testing it so that can submit it soon.
If you have some benchmark results with all these options, they would be very interesting for me.
As for the general memory performance tuning, there are more things to try (carefully watching for possible errata): - SCU Speculative linefills enable bit in SCU Control Register as described in http://infocenter.arm.com/help/topic/com.arm.doc.ddi0407f/BABEBFBH.html (this seems to be a good tweak and it really reduces L2 access latency a bit in my tests) - Exclusive cache configuration (should increase effective L1/L2 cache size, but seems to make L2 cache access latency worse in my tests) - Tune L2C-310 Prefetch offset (without double linefill, the value 6 or even 5 seems to be a bit better than 7) - 'Alloc in one way', 'Write full line of zeros mode' and maybe something else
Thank you for your replies and the interest in this subject.