After llvm commit 75127bce6de78b83b70b898a04473f213451f13e
Author: Qiongsi Wu <qwu(a)ibm.com>
[AIX][ZOS] Excluding merge-objc-interface.m from Tests
the following hot functions slowed down by more than 10% (but their benchmarks slowed down by less than 2%):
- 433.milc:[.] mult_su3_mat_vec slowed down by 16% from 1615 to 1871 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O3
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Reproduce builds:
<cut>
mkdir investigate-llvm-75127bce6de78b83b70b898a04473f213451f13e
cd investigate-llvm-75127bce6de78b83b70b898a04473f213451f13e
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach 75127bce6de78b83b70b898a04473f213451f13e
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach d01ae990e1fd6561ed86dc8004a7147dd09fb13c
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 75127bce6de78b83b70b898a04473f213451f13e
Author: Qiongsi Wu <qwu(a)ibm.com>
Date: Fri Oct 8 13:58:32 2021 +0000
[AIX][ZOS] Excluding merge-objc-interface.m from Tests
Objective C is not supported on AIX or ZOS. This patch excludes the newly added `clang/test/Modules/merge-objc-interface.m` (added by https://reviews.llvm.org/D110280) from AIX and ZOS testing.
Many existing tests are already disabled by https://reviews.llvm.org/D109060.
Reviewed By: jsji
Differential Revision: https://reviews.llvm.org/D111406
---
clang/test/Modules/merge-objc-interface.m | 1 +
1 file changed, 1 insertion(+)
diff --git a/clang/test/Modules/merge-objc-interface.m b/clang/test/Modules/merge-objc-interface.m
index fba06294a26a..f62f541c1a29 100644
--- a/clang/test/Modules/merge-objc-interface.m
+++ b/clang/test/Modules/merge-objc-interface.m
@@ -1,3 +1,4 @@
+// UNSUPPORTED: -zos, -aix
// RUN: rm -rf %t
// RUN: split-file %s %t
// RUN: %clang_cc1 -emit-llvm -o %t/test.bc -F%t/Frameworks %t/test.m \
</cut>
After llvm commit 483db1c706864d0940206228dfe64bdcd17faa4e
Author: Muhammad Omair Javaid <omair.javaid(a)linaro.org>
[LLDB] Remove xfail decorator TestInferiorAssert.py AArch64/Linux
the following benchmarks slowed down by more than 2%:
- 433.milc slowed down by 4% from 13309 to 13838 perf samples
- 433.milc:[.] mult_su3_mat_vec slowed down by 17% from 2058 to 2409 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O3
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Reproduce builds:
<cut>
mkdir investigate-llvm-483db1c706864d0940206228dfe64bdcd17faa4e
cd investigate-llvm-483db1c706864d0940206228dfe64bdcd17faa4e
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach 483db1c706864d0940206228dfe64bdcd17faa4e
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach d11ec6f67e45c630ab87bfb6010dcc93e89542fc
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 483db1c706864d0940206228dfe64bdcd17faa4e
Author: Muhammad Omair Javaid <omair.javaid(a)linaro.org>
Date: Mon Oct 11 14:34:41 2021 +0500
[LLDB] Remove xfail decorator TestInferiorAssert.py AArch64/Linux
TestInferiorAssert.py test_inferior_asserting_disassemble passes after
upgrading LLDB AArch64/Linux buildbot to Ubuntu Focal.
---
lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py b/lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py
index c533a1e29a12..5ac4eeb0514a 100644
--- a/lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py
+++ b/lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py
@@ -45,9 +45,7 @@ class AssertingInferiorTestCase(TestBase):
bugnumber="llvm.org/pr21793: need to implement support for detecting assertion / abort on Windows")
@expectedFailureAll(
oslist=["linux"],
- archs=[
- "aarch64",
- "arm"],
+ archs=["arm"],
triple=no_match(".*-android"),
bugnumber="llvm.org/pr25338")
@expectedFailureAll(bugnumber="llvm.org/pr26592", triple='^mips')
</cut>
After gcc commit 4a960d548b7d7d942f316c5295f6d849b74214f5
Author: Aldy Hernandez <aldyh(a)redhat.com>
Avoid invalid loop transformations in jump threading registry.
the following benchmarks slowed down by more than 2%:
- 471.omnetpp slowed down by 8% from 6348 to 6828 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: GCC + Glibc + GNU Linker
- Version: all components were built from their tip of trunk
- Target: arm-linux-gnueabihf
- Compiler flags: -O3 -marm
- Hardware: NVidia TK1 4x Cortex-A15
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
Reproduce builds:
<cut>
mkdir investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
cd investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach 4a960d548b7d7d942f316c5295f6d849b74214f5
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 29c92857039d0a105281be61c10c9e851aaeea4a
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 4a960d548b7d7d942f316c5295f6d849b74214f5
Author: Aldy Hernandez <aldyh(a)redhat.com>
Date: Thu Sep 23 10:59:24 2021 +0200
Avoid invalid loop transformations in jump threading registry.
My upcoming improvements to the forward jump threader make it thread
more aggressively. In investigating some "regressions", I noticed
that it has always allowed threading through empty latches and across
loop boundaries. As we have discussed recently, this should be avoided
until after loop optimizations have run their course.
Note that this wasn't much of a problem before because DOM/VRP
couldn't find these opportunities, but with a smarter solver, we trip
over them more easily.
Because the forward threader doesn't have an independent localized cost
model like the new threader (profitable_path_p), it is difficult to
catch these things at discovery. However, we can catch them at
registration time, with the added benefit that all the threaders
(forward and backward) can share the handcuffs.
This patch is an adaptation of what we do in the backward threader, but
it is not meant to catch everything we do there, as some of the
restrictions there are due to limitations of the different block
copiers (for example, the generic copier does not re-use existing
threading paths).
We could ideally remove the now redundant bits in profitable_path_p, but
I would prefer not to for two reasons. First, the backward threader uses
profitable_path_p as it discovers paths to avoid discovering paths in
unprofitable directions. Second, I would like to merge all the forward
cost restrictions into the profitability class in the backward threader,
not the other way around. Alas, that reshuffling will have to wait for
the next release.
As usual, there are quite a few tests that needed adjustments. It seems
we were quite happily threading improper scenarios. With most of them,
as can be seen in pr77445-2.c, we're merely shifting the threading to
after loop optimizations.
Tested on x86-64 Linux.
gcc/ChangeLog:
* tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths):
New.
(jt_path_registry::register_jump_thread): Call
cancel_invalid_paths.
* tree-ssa-threadupdate.h (class jt_path_registry): Add
cancel_invalid_paths.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/20030714-2.c: Adjust.
* gcc.dg/tree-ssa/pr66752-3.c: Adjust.
* gcc.dg/tree-ssa/pr77445-2.c: Adjust.
* gcc.dg/tree-ssa/ssa-dom-thread-18.c: Adjust.
* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust.
* gcc.dg/vect/bb-slp-16.c: Adjust.
---
gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c | 7 ++-
gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c | 19 ++++---
gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c | 4 +-
gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c | 4 +-
gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 4 +-
gcc/testsuite/gcc.dg/vect/bb-slp-16.c | 7 ---
gcc/tree-ssa-threadupdate.c | 67 ++++++++++++++++++-----
gcc/tree-ssa-threadupdate.h | 1 +
8 files changed, 78 insertions(+), 35 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c
index eb663f2ff5b..9585ff11307 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c
@@ -32,7 +32,8 @@ get_alias_set (t)
}
}
-/* There should be exactly three IF conditionals if we thread jumps
- properly. */
-/* { dg-final { scan-tree-dump-times "if " 3 "dom2"} } */
+/* There should be exactly 4 IF conditionals if we thread jumps
+ properly. There used to be 3, but one thread was crossing
+ loops. */
+/* { dg-final { scan-tree-dump-times "if " 4 "dom2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
index e1464e21170..922a331b217 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
@@ -1,5 +1,5 @@
/* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */
+/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3" } */
extern int status, pt;
extern int count;
@@ -32,10 +32,15 @@ foo (int N, int c, int b, int *a)
pt--;
}
-/* There are 4 jump threading opportunities, all of which will be
- realized, which will eliminate testing of FLAG, completely. */
-/* { dg-final { scan-tree-dump-times "Registering jump" 4 "thread1"} } */
+/* There are 2 jump threading opportunities (which don't cross loops),
+ all of which will be realized, which will eliminate testing of
+ FLAG, completely. */
+/* { dg-final { scan-tree-dump-times "Registering jump" 2 "thread1"} } */
-/* There should be no assignments or references to FLAG, verify they're
- eliminated as early as possible. */
-/* { dg-final { scan-tree-dump-not "if .flag" "dce2"} } */
+/* We used to remove references to FLAG by DCE2, but this was
+ depending on early threaders threading through loop boundaries
+ (which we shouldn't do). However, the late threading passes, which
+ run after loop optimizations , can successfully eliminate the
+ references to FLAG. Verify that ther are no references by the late
+ threading passes. */
+/* { dg-final { scan-tree-dump-not "if .flag" "thread3"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
index f9fc212f49e..01a0f1f197d 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
@@ -123,8 +123,8 @@ enum STATES FMS( u8 **in , u32 *transitions) {
aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough
to change decisions in switch expansion which in turn can expose new
jump threading opportunities. Skip the later tests on aarch64. */
-/* { dg-final { scan-tree-dump "Jumps threaded: 1\[1-9\]" "thread1" } } */
-/* { dg-final { scan-tree-dump-times "Invalid sum" 4 "thread1" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread1" } } */
+/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */
/* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */
/* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */
/* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target { ! aarch64*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
index 60d4f76f076..2d78d045516 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
@@ -21,5 +21,7 @@
condition.
All the cases are picked up by VRP1 as jump threads. */
-/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
+
+/* There used to be 6 jump threads found by thread1, but they all
+ depended on threading through distinct loops in ethread. */
/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
index e3d4b311c03..16abcde5053 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
@@ -1,8 +1,8 @@
/* { dg-do compile } */
/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 18" "thread1" } } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread3" { target { ! aarch64*-*-* } } } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread1" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! aarch64*-*-* } } } } */
/* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */
/* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
index 664e93e9b60..e68a9b62535 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
@@ -1,8 +1,5 @@
/* { dg-require-effective-target vect_int } */
-/* See note below as to why we disable threading. */
-/* { dg-additional-options "-fdisable-tree-thread1" } */
-
#include <stdarg.h>
#include "tree-vect.h"
@@ -30,10 +27,6 @@ main1 (int dummy)
*pout++ = *pin++ + a;
*pout++ = *pin++ + a;
*pout++ = *pin++ + a;
- /* In some architectures like ppc64, jump threading may thread
- the iteration where i==0 such that we no longer optimize the
- BB. Another alternative to disable jump threading would be
- to wrap the read from `i' into a function returning i. */
if (arr[i] = i)
a = i;
else
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index baac11280fa..2b9b8f81274 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -2757,6 +2757,58 @@ fwd_jt_path_registry::update_cfg (bool may_peel_loop_headers)
return retval;
}
+bool
+jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
+{
+ gcc_checking_assert (!path.is_empty ());
+ edge taken_edge = path[path.length () - 1]->e;
+ loop_p loop = taken_edge->src->loop_father;
+ bool seen_latch = false;
+ bool path_crosses_loops = false;
+
+ for (unsigned int i = 0; i < path.length (); i++)
+ {
+ edge e = path[i]->e;
+
+ if (e == NULL)
+ {
+ // NULL outgoing edges on a path can happen for jumping to a
+ // constant address.
+ cancel_thread (&path, "Found NULL edge in jump threading path");
+ return true;
+ }
+
+ if (loop->latch == e->src || loop->latch == e->dest)
+ seen_latch = true;
+
+ // The first entry represents the block with an outgoing edge
+ // that we will redirect to the jump threading path. Thus we
+ // don't care about that block's loop father.
+ if ((i > 0 && e->src->loop_father != loop)
+ || e->dest->loop_father != loop)
+ path_crosses_loops = true;
+
+ if (flag_checking && !m_backedge_threads)
+ gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
+ }
+
+ if (cfun->curr_properties & PROP_loop_opts_done)
+ return false;
+
+ if (seen_latch && empty_block_p (loop->latch))
+ {
+ cancel_thread (&path, "Threading through latch before loop opts "
+ "would create non-empty latch");
+ return true;
+ }
+ if (path_crosses_loops)
+ {
+ cancel_thread (&path, "Path crosses loops");
+ return true;
+ }
+ return false;
+}
+
/* Register a jump threading opportunity. We queue up all the jump
threading opportunities discovered by a pass and update the CFG
and SSA form all at once.
@@ -2776,19 +2828,8 @@ jt_path_registry::register_jump_thread (vec<jump_thread_edge *> *path)
return false;
}
- /* First make sure there are no NULL outgoing edges on the jump threading
- path. That can happen for jumping to a constant address. */
- for (unsigned int i = 0; i < path->length (); i++)
- {
- if ((*path)[i]->e == NULL)
- {
- cancel_thread (path, "Found NULL edge in jump threading path");
- return false;
- }
-
- if (flag_checking && !m_backedge_threads)
- gcc_assert (((*path)[i]->e->flags & EDGE_DFS_BACK) == 0);
- }
+ if (cancel_invalid_paths (*path))
+ return false;
if (dump_file && (dump_flags & TDF_DETAILS))
dump_jump_thread_path (dump_file, *path, true);
diff --git a/gcc/tree-ssa-threadupdate.h b/gcc/tree-ssa-threadupdate.h
index 8b48a671212..d68795c9f27 100644
--- a/gcc/tree-ssa-threadupdate.h
+++ b/gcc/tree-ssa-threadupdate.h
@@ -75,6 +75,7 @@ protected:
unsigned long m_num_threaded_edges;
private:
virtual bool update_cfg (bool peel_loop_headers) = 0;
+ bool cancel_invalid_paths (vec<jump_thread_edge *> &path);
jump_thread_path_allocator m_allocator;
// True if threading through back edges is allowed. This is only
// allowed in the generic copier in the backward threader.
</cut>
[TCWG CI] Regression caused by gcc: tree-optimization/102570 - teach VN about internal functions:
commit 55a3be2f5255d69e740d61b2c5aaba1ccbc643b8
Author: Richard Biener <rguenther(a)suse.de>
tree-optimization/102570 - teach VN about internal functions
Results regressed to
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_abe stage1:
-5
# build_abe qemu:
-2
# linux_n_obj:
18360
# First few build errors in logs:
# 00:01:21 ./include/linux/arm-smccc.h:460:40: error: ‘res.a0’ is used uninitialized [-Werror=uninitialized]
# 00:01:21 ./include/linux/arm-smccc.h:460:40: error: ‘res.a0’ is used uninitialized [-Werror=uninitialized]
# 00:01:21 ./include/linux/arm-smccc.h:460:40: error: ‘res.a0’ is used uninitialized [-Werror=uninitialized]
# 00:01:21 make[2]: *** [scripts/Makefile.build:288: arch/arm64/hyperv/hv_core.o] Error 1
# 00:01:22 crypto/asymmetric_keys/asymmetric_type.c:481:15: error: ‘restrict_method’ is used uninitialized [-Werror=uninitialized]
# 00:01:22 make[2]: *** [scripts/Makefile.build:288: crypto/asymmetric_keys/asymmetric_type.o] Error 1
# 00:01:22 ./include/trace/perf.h:38:25: error: ‘entry’ is used uninitialized [-Werror=uninitialized]
# 00:01:22 ./include/trace/perf.h:44:13: error: ‘__entry_size’ is used uninitialized [-Werror=uninitialized]
# 00:01:23 security/keys/encrypted-keys/encrypted.c:660:19: error: ‘mkey’ is used uninitialized [-Werror=uninitialized]
# 00:01:23 security/keys/encrypted-keys/encrypted.c:905:19: error: ‘epayload’ is used uninitialized [-Werror=uninitialized]
from
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_abe stage1:
-5
# build_abe qemu:
-2
# linux_n_obj:
21404
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_kernel/gnu-master-aarch64-next-allmodconfig
First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al…
Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al…
Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al…
Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al…
Reproduce builds:
<cut>
mkdir investigate-gcc-55a3be2f5255d69e740d61b2c5aaba1ccbc643b8
cd investigate-gcc-55a3be2f5255d69e740d61b2c5aaba1ccbc643b8
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach 55a3be2f5255d69e740d61b2c5aaba1ccbc643b8
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 22d34a2a50651d01669b6fbcdb9677c18d2197c5
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 55a3be2f5255d69e740d61b2c5aaba1ccbc643b8
Author: Richard Biener <rguenther(a)suse.de>
Date: Mon Oct 4 10:57:45 2021 +0200
tree-optimization/102570 - teach VN about internal functions
We're now using internal functions for a lot of stuff but there's
still missing VN support out of laziness. The following instantiates
support and adds testcases for FRE and PRE (hoisting).
2021-10-04 Richard Biener <rguenther(a)suse.de>
PR tree-optimization/102570
* tree-ssa-sccvn.h (vn_reference_op_struct): Document
we are using clique for the internal function code.
* tree-ssa-sccvn.c (vn_reference_op_eq): Compare the
internal function code.
(print_vn_reference_ops): Print the internal function code.
(vn_reference_op_compute_hash): Hash it.
(copy_reference_ops_from_call): Record it.
(visit_stmt): Remove the restriction around internal function
calls.
(fully_constant_vn_reference_p): Use fold_const_call and handle
internal functions.
(vn_reference_eq): Compare call return types.
* tree-ssa-pre.c (create_expression_by_pieces): Handle
generating calls to internal functions.
(compute_avail): Remove the restriction around internal function
calls.
* gcc.dg/tree-ssa/ssa-fre-96.c: New testcase.
* gcc.dg/tree-ssa/ssa-pre-33.c: Likewise.
---
gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-96.c | 14 +++++
gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-33.c | 15 +++++
gcc/tree-ssa-pre.c | 27 +++++----
gcc/tree-ssa-sccvn.c | 91 ++++++++++++++++++------------
gcc/tree-ssa-sccvn.h | 3 +-
5 files changed, 103 insertions(+), 47 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-96.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-96.c
new file mode 100644
index 00000000000..fd1d5713b5f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-96.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdump-tree-fre1" } */
+
+_Bool f1(unsigned x, unsigned y, unsigned *res)
+{
+ _Bool t = __builtin_add_overflow(x, y, res);
+ unsigned res1;
+ _Bool t1 = __builtin_add_overflow(x, y, &res1);
+ *res -= res1;
+ return t==t1;
+}
+
+/* { dg-final { scan-tree-dump-times "ADD_OVERFLOW" 1 "fre1" } } */
+/* { dg-final { scan-tree-dump "return 1;" "fre1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-33.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-33.c
new file mode 100644
index 00000000000..3b3bd629bc2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-33.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-pre" } */
+
+_Bool f1(unsigned x, unsigned y, unsigned *res, int flag, _Bool *t)
+{
+ if (flag)
+ *t = __builtin_add_overflow(x, y, res);
+ unsigned res1;
+ _Bool t1 = __builtin_add_overflow(x, y, &res1);
+ *res -= res1;
+ return *t==t1;
+}
+
+/* We should hoist the .ADD_OVERFLOW to before the check. */
+/* { dg-final { scan-tree-dump-times "ADD_OVERFLOW" 1 "pre" } } */
diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 08755847f66..1cc1aae694f 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -2855,9 +2855,13 @@ create_expression_by_pieces (basic_block block, pre_expr expr,
unsigned int operand = 1;
vn_reference_op_t currop = &ref->operands[0];
tree sc = NULL_TREE;
- tree fn = find_or_generate_expression (block, currop->op0, stmts);
- if (!fn)
- return NULL_TREE;
+ tree fn = NULL_TREE;
+ if (currop->op0)
+ {
+ fn = find_or_generate_expression (block, currop->op0, stmts);
+ if (!fn)
+ return NULL_TREE;
+ }
if (currop->op1)
{
sc = find_or_generate_expression (block, currop->op1, stmts);
@@ -2873,12 +2877,19 @@ create_expression_by_pieces (basic_block block, pre_expr expr,
return NULL_TREE;
args.quick_push (arg);
}
- gcall *call = gimple_build_call_vec (fn, args);
+ gcall *call;
+ if (currop->op0)
+ {
+ call = gimple_build_call_vec (fn, args);
+ gimple_call_set_fntype (call, currop->type);
+ }
+ else
+ call = gimple_build_call_internal_vec ((internal_fn)currop->clique,
+ args);
gimple_set_location (call, expr->loc);
- gimple_call_set_fntype (call, currop->type);
if (sc)
gimple_call_set_chain (call, sc);
- tree forcedname = make_ssa_name (TREE_TYPE (currop->type));
+ tree forcedname = make_ssa_name (ref->type);
gimple_call_set_lhs (call, forcedname);
/* There's no CCP pass after PRE which would re-compute alignment
information so make sure we re-materialize this here. */
@@ -4004,10 +4015,6 @@ compute_avail (function *fun)
vn_reference_s ref1;
pre_expr result = NULL;
- /* We can value number only calls to real functions. */
- if (gimple_call_internal_p (stmt))
- continue;
-
vn_reference_lookup_call (as_a <gcall *> (stmt), &ref, &ref1);
/* There is no point to PRE a call without a value. */
if (!ref || !ref->result)
diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index 416a5252144..0d942218279 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -70,6 +70,7 @@ along with GCC; see the file COPYING3. If not see
#include "tree-scalar-evolution.h"
#include "tree-ssa-loop-niter.h"
#include "builtins.h"
+#include "fold-const-call.h"
#include "tree-ssa-sccvn.h"
/* This algorithm is based on the SCC algorithm presented by Keith
@@ -212,7 +213,8 @@ vn_reference_op_eq (const void *p1, const void *p2)
TYPE_MAIN_VARIANT (vro2->type))))
&& expressions_equal_p (vro1->op0, vro2->op0)
&& expressions_equal_p (vro1->op1, vro2->op1)
- && expressions_equal_p (vro1->op2, vro2->op2));
+ && expressions_equal_p (vro1->op2, vro2->op2)
+ && (vro1->opcode != CALL_EXPR || vro1->clique == vro2->clique));
}
/* Free a reference operation structure VP. */
@@ -264,15 +266,18 @@ print_vn_reference_ops (FILE *outfile, const vec<vn_reference_op_s> ops)
&& TREE_CODE_CLASS (vro->opcode) != tcc_declaration)
{
fprintf (outfile, "%s", get_tree_code_name (vro->opcode));
- if (vro->op0)
+ if (vro->op0 || vro->opcode == CALL_EXPR)
{
fprintf (outfile, "<");
closebrace = true;
}
}
- if (vro->op0)
+ if (vro->op0 || vro->opcode == CALL_EXPR)
{
- print_generic_expr (outfile, vro->op0);
+ if (!vro->op0)
+ fprintf (outfile, internal_fn_name ((internal_fn)vro->clique));
+ else
+ print_generic_expr (outfile, vro->op0);
if (vro->op1)
{
fprintf (outfile, ",");
@@ -684,6 +689,8 @@ static void
vn_reference_op_compute_hash (const vn_reference_op_t vro1, inchash::hash &hstate)
{
hstate.add_int (vro1->opcode);
+ if (vro1->opcode == CALL_EXPR && !vro1->op0)
+ hstate.add_int (vro1->clique);
if (vro1->op0)
inchash::add_expr (vro1->op0, hstate);
if (vro1->op1)
@@ -769,11 +776,16 @@ vn_reference_eq (const_vn_reference_t const vr1, const_vn_reference_t const vr2)
if (vr1->type != vr2->type)
return false;
}
+ else if (vr1->type == vr2->type)
+ ;
else if (COMPLETE_TYPE_P (vr1->type) != COMPLETE_TYPE_P (vr2->type)
|| (COMPLETE_TYPE_P (vr1->type)
&& !expressions_equal_p (TYPE_SIZE (vr1->type),
TYPE_SIZE (vr2->type))))
return false;
+ else if (vr1->operands[0].opcode == CALL_EXPR
+ && !types_compatible_p (vr1->type, vr2->type))
+ return false;
else if (INTEGRAL_TYPE_P (vr1->type)
&& INTEGRAL_TYPE_P (vr2->type))
{
@@ -1270,6 +1282,8 @@ copy_reference_ops_from_call (gcall *call,
temp.type = gimple_call_fntype (call);
temp.opcode = CALL_EXPR;
temp.op0 = gimple_call_fn (call);
+ if (gimple_call_internal_p (call))
+ temp.clique = gimple_call_internal_fn (call);
temp.op1 = gimple_call_chain (call);
if (stmt_could_throw_p (cfun, call) && (lr = lookup_stmt_eh_lp (call)) > 0)
temp.op2 = size_int (lr);
@@ -1459,9 +1473,11 @@ fully_constant_vn_reference_p (vn_reference_t ref)
a call to a builtin function with at most two arguments. */
op = &operands[0];
if (op->opcode == CALL_EXPR
- && TREE_CODE (op->op0) == ADDR_EXPR
- && TREE_CODE (TREE_OPERAND (op->op0, 0)) == FUNCTION_DECL
- && fndecl_built_in_p (TREE_OPERAND (op->op0, 0))
+ && (!op->op0
+ || (TREE_CODE (op->op0) == ADDR_EXPR
+ && TREE_CODE (TREE_OPERAND (op->op0, 0)) == FUNCTION_DECL
+ && fndecl_built_in_p (TREE_OPERAND (op->op0, 0),
+ BUILT_IN_NORMAL)))
&& operands.length () >= 2
&& operands.length () <= 3)
{
@@ -1481,13 +1497,17 @@ fully_constant_vn_reference_p (vn_reference_t ref)
anyconst = true;
if (anyconst)
{
- tree folded = build_call_expr (TREE_OPERAND (op->op0, 0),
- arg1 ? 2 : 1,
- arg0->op0,
- arg1 ? arg1->op0 : NULL);
- if (folded
- && TREE_CODE (folded) == NOP_EXPR)
- folded = TREE_OPERAND (folded, 0);
+ combined_fn fn;
+ if (op->op0)
+ fn = as_combined_fn (DECL_FUNCTION_CODE
+ (TREE_OPERAND (op->op0, 0)));
+ else
+ fn = as_combined_fn ((internal_fn) op->clique);
+ tree folded;
+ if (arg1)
+ folded = fold_const_call (fn, ref->type, arg0->op0, arg1->op0);
+ else
+ folded = fold_const_call (fn, ref->type, arg0->op0);
if (folded
&& is_gimple_min_invariant (folded))
return folded;
@@ -5648,28 +5668,27 @@ visit_stmt (gimple *stmt, bool backedges_varying_p = false)
&& TREE_CODE (TREE_OPERAND (fn, 0)) == FUNCTION_DECL)
extra_fnflags = flags_from_decl_or_type (TREE_OPERAND (fn, 0));
}
- if (!gimple_call_internal_p (call_stmt)
- && (/* Calls to the same function with the same vuse
- and the same operands do not necessarily return the same
- value, unless they're pure or const. */
- ((gimple_call_flags (call_stmt) | extra_fnflags)
- & (ECF_PURE | ECF_CONST))
- /* If calls have a vdef, subsequent calls won't have
- the same incoming vuse. So, if 2 calls with vdef have the
- same vuse, we know they're not subsequent.
- We can value number 2 calls to the same function with the
- same vuse and the same operands which are not subsequent
- the same, because there is no code in the program that can
- compare the 2 values... */
- || (gimple_vdef (call_stmt)
- /* ... unless the call returns a pointer which does
- not alias with anything else. In which case the
- information that the values are distinct are encoded
- in the IL. */
- && !(gimple_call_return_flags (call_stmt) & ERF_NOALIAS)
- /* Only perform the following when being called from PRE
- which embeds tail merging. */
- && default_vn_walk_kind == VN_WALK)))
+ if (/* Calls to the same function with the same vuse
+ and the same operands do not necessarily return the same
+ value, unless they're pure or const. */
+ ((gimple_call_flags (call_stmt) | extra_fnflags)
+ & (ECF_PURE | ECF_CONST))
+ /* If calls have a vdef, subsequent calls won't have
+ the same incoming vuse. So, if 2 calls with vdef have the
+ same vuse, we know they're not subsequent.
+ We can value number 2 calls to the same function with the
+ same vuse and the same operands which are not subsequent
+ the same, because there is no code in the program that can
+ compare the 2 values... */
+ || (gimple_vdef (call_stmt)
+ /* ... unless the call returns a pointer which does
+ not alias with anything else. In which case the
+ information that the values are distinct are encoded
+ in the IL. */
+ && !(gimple_call_return_flags (call_stmt) & ERF_NOALIAS)
+ /* Only perform the following when being called from PRE
+ which embeds tail merging. */
+ && default_vn_walk_kind == VN_WALK))
changed = visit_reference_op_call (lhs, call_stmt);
else
changed = defs_to_varying (call_stmt);
diff --git a/gcc/tree-ssa-sccvn.h b/gcc/tree-ssa-sccvn.h
index 96100596d2e..8a1b649c726 100644
--- a/gcc/tree-ssa-sccvn.h
+++ b/gcc/tree-ssa-sccvn.h
@@ -106,7 +106,8 @@ typedef const struct vn_phi_s *const_vn_phi_t;
typedef struct vn_reference_op_struct
{
ENUM_BITFIELD(tree_code) opcode : 16;
- /* Dependence info, used for [TARGET_]MEM_REF only. */
+ /* Dependence info, used for [TARGET_]MEM_REF only. For internal
+ function calls clique is also used for the internal function code. */
unsigned short clique;
unsigned short base;
unsigned reverse : 1;
</cut>
After gcc commit a459ee44c0a74b0df0485ed7a56683816c02aae9
Author: Kyrylo Tkachov <kyrylo.tkachov(a)arm.com>
aarch64: Improve size heuristic for cpymem expansion
the following benchmarks grew in size by more than 1%:
- 458.sjeng grew in size by 4% from 105780 to 109944 bytes
- 459.GemsFDTD grew in size by 2% from 247504 to 251468 bytes
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: GCC + Glibc + GNU Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -Os -flto
- Hardware: APM Mustang 8x X-Gene1
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_gnu_apm/gnu-master-aarch64-spec2k6-Os_LTO
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Reproduce builds:
<cut>
mkdir investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9
cd investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach a459ee44c0a74b0df0485ed7a56683816c02aae9
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 8f95e3c04d659d541ca4937b3df2f1175a1c5f05
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit a459ee44c0a74b0df0485ed7a56683816c02aae9
Author: Kyrylo Tkachov <kyrylo.tkachov(a)arm.com>
Date: Wed Sep 29 11:21:45 2021 +0100
aarch64: Improve size heuristic for cpymem expansion
Similar to my previous patch for setmem this one does the same for the cpymem expansion.
We count the number of ops emitted and compare it against the alternative of just calling
the library function when optimising for size.
For the code:
void
cpy_127 (char *out, char *in)
{
__builtin_memcpy (out, in, 127);
}
void
cpy_128 (char *out, char *in)
{
__builtin_memcpy (out, in, 128);
}
we now emit a call to memcpy (with an extra MOV-immediate instruction for the size) instead of:
cpy_127(char*, char*):
ldp q0, q1, [x1]
stp q0, q1, [x0]
ldp q0, q1, [x1, 32]
stp q0, q1, [x0, 32]
ldp q0, q1, [x1, 64]
stp q0, q1, [x0, 64]
ldr q0, [x1, 96]
str q0, [x0, 96]
ldr q0, [x1, 111]
str q0, [x0, 111]
ret
cpy_128(char*, char*):
ldp q0, q1, [x1]
stp q0, q1, [x0]
ldp q0, q1, [x1, 32]
stp q0, q1, [x0, 32]
ldp q0, q1, [x1, 64]
stp q0, q1, [x0, 64]
ldp q0, q1, [x1, 96]
stp q0, q1, [x0, 96]
ret
which is a clear code size win. Speed optimisation heuristics remain unchanged.
2021-09-29 Kyrylo Tkachov <kyrylo.tkachov(a)arm.com>
* config/aarch64/aarch64.c (aarch64_expand_cpymem): Count number of
emitted operations and adjust heuristic for code size.
2021-09-29 Kyrylo Tkachov <kyrylo.tkachov(a)arm.com>
* gcc.target/aarch64/cpymem-size.c: New test.
---
gcc/config/aarch64/aarch64.c | 36 ++++++++++++++++++--------
gcc/testsuite/gcc.target/aarch64/cpymem-size.c | 29 +++++++++++++++++++++
2 files changed, 54 insertions(+), 11 deletions(-)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index ac17c1c88fb..a9a1800af53 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -23390,7 +23390,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
}
/* Expand cpymem, as if from a __builtin_memcpy. Return true if
- we succeed, otherwise return false. */
+ we succeed, otherwise return false, indicating that a libcall to
+ memcpy should be emitted. */
bool
aarch64_expand_cpymem (rtx *operands)
@@ -23407,11 +23408,13 @@ aarch64_expand_cpymem (rtx *operands)
unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
- /* Inline up to 256 bytes when optimizing for speed. */
+ /* Try to inline up to 256 bytes. */
unsigned HOST_WIDE_INT max_copy_size = 256;
- if (optimize_function_for_size_p (cfun))
- max_copy_size = 128;
+ bool size_p = optimize_function_for_size_p (cfun);
+
+ if (size > max_copy_size)
+ return false;
int copy_bits = 256;
@@ -23421,13 +23424,14 @@ aarch64_expand_cpymem (rtx *operands)
|| !TARGET_SIMD
|| (aarch64_tune_params.extra_tuning_flags
& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
- {
- copy_bits = 128;
- max_copy_size = max_copy_size / 2;
- }
+ copy_bits = 128;
- if (size > max_copy_size)
- return false;
+ /* Emit an inline load+store sequence and count the number of operations
+ involved. We use a simple count of just the loads and stores emitted
+ rather than rtx_insn count as all the pointer adjustments and reg copying
+ in this function will get optimized away later in the pipeline. */
+ start_sequence ();
+ unsigned nops = 0;
base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
dst = adjust_automodify_address (dst, VOIDmode, base, 0);
@@ -23456,7 +23460,8 @@ aarch64_expand_cpymem (rtx *operands)
cur_mode = V4SImode;
aarch64_copy_one_block_and_progress_pointers (&src, &dst, cur_mode);
-
+ /* A single block copy is 1 load + 1 store. */
+ nops += 2;
n -= mode_bits;
/* Emit trailing copies using overlapping unaligned accesses - this is
@@ -23471,7 +23476,16 @@ aarch64_expand_cpymem (rtx *operands)
n = n_bits;
}
}
+ rtx_insn *seq = get_insns ();
+ end_sequence ();
+
+ /* A memcpy libcall in the worst case takes 3 instructions to prepare the
+ arguments + 1 for the call. */
+ unsigned libcall_cost = 4;
+ if (size_p && libcall_cost < nops)
+ return false;
+ emit_insn (seq);
return true;
}
diff --git a/gcc/testsuite/gcc.target/aarch64/cpymem-size.c b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c
new file mode 100644
index 00000000000..4d488b74301
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+#include <stdlib.h>
+
+/*
+** cpy_127:
+** mov x2, 127
+** b memcpy
+*/
+void
+cpy_127 (char *out, char *in)
+{
+ __builtin_memcpy (out, in, 127);
+}
+
+/*
+** cpy_128:
+** mov x2, 128
+** b memcpy
+*/
+void
+cpy_128 (char *out, char *in)
+{
+ __builtin_memcpy (out, in, 128);
+}
+
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
</cut>