Progress:
* UM-2 [QEMU upstream maintainership]
- More code review: now have a target-arm.next poised and ready to
send once 6.2 is released
* QEMU-420 [GICv4 emulation]
- Working on the ITS changes needed for GICv4 support (this turns
out to be a more tractable end to start than the redistributor)
- I have a preliminary set of 25 or so patches to the ITS which
clean up the code and fix some pre-existing bugs that I found
while working on the GICv4 changes
- have implemented the new VMAPI, VMAPTI, VMAPP ITS commands
-- PMM
After llvm commit 3d549dddf75b6ff9e0ec8c053677750bde4226ea
Author: Sander de Smalen <sander.desmalen(a)arm.com>
[LV] Pass compare predicate to getCmpSelInstrCost.
the following benchmarks slowed down by more than 2%:
- 464.h264ref slowed down by 7% from 11115 to 11846 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O2 -flto
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Reproduce builds:
<cut>
mkdir investigate-llvm-3d549dddf75b6ff9e0ec8c053677750bde4226ea
cd investigate-llvm-3d549dddf75b6ff9e0ec8c053677750bde4226ea
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach 3d549dddf75b6ff9e0ec8c053677750bde4226ea
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach ab31d003e16e483bff298ea2f28fec0f23e8eb79
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 3d549dddf75b6ff9e0ec8c053677750bde4226ea
Author: Sander de Smalen <sander.desmalen(a)arm.com>
Date: Mon Dec 6 11:14:27 2021 +0000
[LV] Pass compare predicate to getCmpSelInstrCost.
If the condition of a select is a compare, pass its predicate to
TTI::getCmpSelInstrCost to get a more accurate cost value instead
of passing BAD_ICMP_PREDICATE.
I noticed that the commit message from D90070 had a comment about the
vectorized select predicate possibly being composed of other compares with
different predicate values, but I wasn't able to construct an example
where this was an actual issue. If this is an issue, I guess we could
add another check that the block isn't predicated for any reason.
Reviewed By: dmgreen, fhahn
Differential Revision: https://reviews.llvm.org/D114646
---
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | 11 ++++++++---
llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll | 14 +++++++-------
2 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 050879144afd..c03e506b7474 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7570,8 +7570,12 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
Type *CondTy = SI->getCondition()->getType();
if (!ScalarCond)
CondTy = VectorType::get(CondTy, VF);
- return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy,
- CmpInst::BAD_ICMP_PREDICATE, CostKind, I);
+
+ CmpInst::Predicate Pred = CmpInst::BAD_ICMP_PREDICATE;
+ if (auto *Cmp = dyn_cast<CmpInst>(SI->getCondition()))
+ Pred = Cmp->getPredicate();
+ return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy, Pred,
+ CostKind, I);
}
case Instruction::ICmp:
case Instruction::FCmp: {
@@ -7581,7 +7585,8 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
ValTy = IntegerType::get(ValTy->getContext(), MinBWs[Op0AsInstruction]);
VectorTy = ToVectorTy(ValTy, VF);
return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, nullptr,
- CmpInst::BAD_ICMP_PREDICATE, CostKind, I);
+ cast<CmpInst>(I)->getPredicate(), CostKind,
+ I);
}
case Instruction::Store:
case Instruction::Load: {
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll
index 62b18f44fbc5..20d2dc0b7cda 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll
@@ -5,17 +5,17 @@ target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-ios5.0.0"
define void @selects_1(i32* nocapture %dst, i32 %A, i32 %B, i32 %C, i32 %N) {
-; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and
-; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and
-; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6
+; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and
+; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and
+; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6
-; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and
-; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and
-; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6
+; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and
+; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and
+; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6
; CHECK-LABEL: define void @selects_1(
; CHECK: vector.body:
-; CHECK: select <2 x i1>
+; CHECK: select <4 x i1>
entry:
%cmp26 = icmp sgt i32 %N, 0
</cut>
Dear Linaro Toolchain Working Group,
clang-thumbv7-full-2stage is red for 20 days.
Could you take it to the staging area and make it green again, please?
Thanks
Galina
After llvm commit bd4c6a476fd037fb07a1c484f75d93ee40713d3d
Author: David Blaikie <dblaikie(a)gmail.com>
Add missing header
the following benchmarks slowed down by more than 2%:
- 433.milc slowed down by 4% from 12427 to 12916 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O2 -flto
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Reproduce builds:
<cut>
mkdir investigate-llvm-bd4c6a476fd037fb07a1c484f75d93ee40713d3d
cd investigate-llvm-bd4c6a476fd037fb07a1c484f75d93ee40713d3d
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach bd4c6a476fd037fb07a1c484f75d93ee40713d3d
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 7d4da4e1ab7f79e51db0d5c2a0f5ef1711122dd7
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit bd4c6a476fd037fb07a1c484f75d93ee40713d3d
Author: David Blaikie <dblaikie(a)gmail.com>
Date: Mon Nov 29 16:29:25 2021 -0800
Add missing header
---
llvm/lib/Demangle/DLangDemangle.cpp | 1 +
1 file changed, 1 insertion(+)
diff --git a/llvm/lib/Demangle/DLangDemangle.cpp b/llvm/lib/Demangle/DLangDemangle.cpp
index faf91b239490..f380aa90035e 100644
--- a/llvm/lib/Demangle/DLangDemangle.cpp
+++ b/llvm/lib/Demangle/DLangDemangle.cpp
@@ -17,6 +17,7 @@
#include "llvm/Demangle/StringView.h"
#include "llvm/Demangle/Utility.h"
+#include <cctype>
#include <cstring>
#include <limits>
</cut>
VirtIO Initiative ([STR-9])
===========================
- synced up on [AX_XDP task with Akashi-san]
- synced on rust-vmm
[AX_XDP task with Akashi-san]
<https://linaro.atlassian.net/browse/STR-68>
vhost-device maintainer effort ([UM-196])
- started looking at https://github.com/rust-vmm/vhost-device/pull/4
QEMU Upstream Work ([UM-2])
===========================
- posted [PULL for 6.2 0/8] more tcg, plugin, test and build fixes
Message-Id: <20211129171449.4176301-1-alex.bennee(a)linaro.org>
- commented on Re: Follow-up on the CXL discussion at OFTC Message-Id:
<20211119015207.62fhk5mjmvaj5nz4(a)intel.com> to see if I can unblock
- posted [RFC PATCH] blog post: how to get your new feature
up-streamed Message-Id:
<20211126203319.3298089-1-alex.bennee(a)linaro.org>
- posted [PATCH for 6.2?] Revert "vga: don't abort when adding a
duplicate isa-vga device" Message-Id:
<20211202164929.1119036-1-alex.bennee(a)linaro.org>
Upstream MTTCG tests ([QEMU-52])
- posted [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM
Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org>
[QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52>
Other
=====
- wrote [RFC PATCH 0/2] insn plugin tweaks for measuring frequency
Message-Id: <20211203144421.1445232-1-alex.bennee(a)linaro.org>
- might make a good basis for a TCG plugins blog post
Completed Reviews [2/2]
=======================
[PATCH] tests/plugin/syscall.c: fix compiler warnings
Message-Id: <20211128011551.2115468-1-juro.bystricky(a)intel.com>
[PATCH for-6.2? 0/2] arm_gicv3: Fix handling of LPIs in list registers
Message-Id: <20211126163915.1048353-2-peter.maydell(a)linaro.org>
Current Review Queue
====================
TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64
Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com>
===============================================================================================================
TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things
Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com>
===================================================================================================================
TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option
Message-Id: <20211013010120.96851-1-sjg(a)chromium.org>
===========================================================================================================
TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV
Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org>
==================================================================================================================================
--
Alex Bennée
Progress:
* UM-2 [QEMU upstream maintainership]
- Code review: worked through some of the backlog and accumulated
a list of series to take once the tree reopens for 7.0
- Wrote and sent some cleanup patches relating to the qemu-common.h
header file
- Fixed a bug where we miscalculated the length for TLB range
invalidations
* QEMU-420 [GICv4 emulation]
- Found the problem with PCI passthrough in my nested test setup:
apparently virtio PCI devices need an extra command line argument
to get them to honour the presence of an IOMMU. Everything is
now working and I've put some notes about the setup into
https://linaro.atlassian.net/browse/QEMU-447
- started to implement the GICv4 redistributor changes
-- PMM
VirtIO Initiative ([STR-9])
===========================
- [this weeks sync], topics on AF_XDP, virtio-video and
virtio-watchdog
[upstream rust-vmm sync meeting]
<https://etherpad.opendev.org/p/rust-vmm-sync-2021&sa=D&source=calendar&ust=…>
QEMU Upstream Work ([UM-2])
===========================
- posted [PATCH for 6.2 v2 0/7] more tcg, plugin, test and build fixes
Message-Id: <20211125154144.2904741-1-alex.bennee(a)linaro.org>
Upstream MTTCG tests ([QEMU-52])
- posted [kvm-unit-tests PATCH v8 00/10] MTTCG sanity tests for ARM
Message-Id: <20211118184650.661575-1-alex.bennee(a)linaro.org>
[mttcg tests to current state and fixed up]
<https://github.com/stsquad/qemu/tree/mttcg/current-tests-v8>
Other
=====
- renewal feedback
Completed Reviews [2/2]
=======================
[PATCH v2 0/3] KVM: qemu patches for few KVM features I developed
Message-Id: <20211101132300.192584-1-mlevitsk(a)redhat.com>
[PATCH v2] hw/intc/arm_gicv3: Update cached state after LPI state changes
Message-Id: <20211124202005.989935-1-peter.maydell(a)linaro.org>
Absences
========
- off 2 days sick
Current Review Queue
====================
TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64
Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com>
===============================================================================================================
TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things
Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com>
===================================================================================================================
TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option
Message-Id: <20211013010120.96851-1-sjg(a)chromium.org>
===========================================================================================================
TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV
Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org>
==================================================================================================================================
--
Alex Bennée
Progress:
* QEMU-420 [GICv4 emulation]
- Tracked down and fixed a bug in our ITS emulation which would
(intermittently?) result in a Linux guest reporting "irq 54:
nobody cared" and hanging, because we were not correctly
recalculating the highest priority pending interrupt when the
guest acknowledged a pending LPI. This fix will go into 6.2.
- Set up a test environment for GICv4 work -- because the major
feature of GICv4 is support for directly injecting interrupts
into a VM, the test setup needs to be nested virtualization,
where an outer L1 guest runs on pure emulated QEMU, the inner
L2 guest uses KVM (as provided by L1), and we pass a PCI device
(emulated by QEMU) through from L1 to L2. I think I have this
correctly set up now, but...
- ...the L2 guest hangs because it apparently never sees an
interrupt from the passed-through PCI device. This implies a
bug in our current GICv3 emulation somewhere: need to track
this down before starting in on GICv4 work.
- Separately, I found through code inspection a bug where we
do the wrong thing in the non-passthrough case when the L1 guest
sets a virtual interrupt for the L2 guest in the GIC list
registers and that interrupt has an ID > 1023 (ie it is an LPI).
We got this wrong both for acknowledging and ending an interrupt,
so the two bugs cancel each other out except that we don't set
the vCPU priority and so the L2 guest might get an unexpected
interrupt while it was servicing the LPI. Patches sent.
-- PMM
VirtIO Initiative ([STR-9])
===========================
- posted Initial thoughts for test scenarios for AF_XDP epic
Message-Id: <87k0h5v6ju.fsf(a)linaro.org>
vhost-device maintainer effort ([UM-196])
- more review
- [did some more noodling with rust] to get comfortable with generics
[UM-196] <https://linaro.atlassian.net/browse/UM-196>
[vhost-device crate] <https://github.com/rust-vmm/vhost-device>
[did some more noodling with rust]
<https://gitlab.com/stsquad/softfloat.rs>
QEMU Upstream Work ([UM-2])
===========================
- posted [PULL for 6.2 0/7] misc build and test fixes Message-Id:
<20211116162515.4100231-1-alex.bennee(a)linaro.org>
- posted [RFC PATCH] tests/avocado: fix tcg_plugin mem access count
test Message-Id: <20211117095448.136558-1-alex.bennee(a)linaro.org>
- posted Re: [RFC PATCH] plugins/meson.build: fix linker issue with
weird paths (for v6.2?) Message-Id:
<20211117111924.179776-1-alex.bennee(a)linaro.org>
- posted Re: [PATCH v2 1/3] icount: preserve cflags when custom tb is
about to execute Message-Id: <87h7cbw1tx.fsf(a)linaro.org>
- posted [RFC PATCH] gdbstub: handle a potentially racing TaskState
Message-Id: <20211119145124.942390-1-alex.bennee(a)linaro.org>
[UM-2] <https://linaro.atlassian.net/browse/UM-2>
Upstream MTTCG tests ([QEMU-52])
- posted [kvm-unit-tests PATCH v3 0/3] GIC ITS tests Message-Id:
<20211112114734.3058678-1-alex.bennee(a)linaro.org>
- posted [kvm-unit-tests PATCH v8 00/10] MTTCG sanity tests for ARM
Message-Id: <20211118184650.661575-1-alex.bennee(a)linaro.org>
[QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52>
[mttcg tests to current state and fixed up]
<https://github.com/stsquad/qemu/tree/mttcg/current-tests-v8>
Completed Reviews [2/2]
=======================
[PATCH v2 0/3] Some watchpoint-related patches
Message-Id: <163662450348.125458.5494710452733592356.stgit@pasha-ThinkPad-X280>
[PATCH 0/5] Update linux-headers + NOIRQ support for KVM gdbstub
Message-Id: <20211111110604.207376-1-pbonzini(a)redhat.com>
Absences
========
- none
Current Review Queue
====================
TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64
Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com>
===============================================================================================================
TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things
Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com>
===================================================================================================================
TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option
Message-Id: <20211013010120.96851-1-sjg(a)chromium.org>
===========================================================================================================
TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV
Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org>
==================================================================================================================================
--
Alex Bennée
Progress (short week, 3 days)
* UM-2 [QEMU upstream maintainership]
- Still trying to sort out the regression of booting EL3 guest
code on the imx7 board. I got most of the way through prototyping
a cleanup which would fix this, but then spotted that the
highbank board has a more awkward-to-fix similar problem.
We're going to revert the PSCI emulation change for 6.2 so we
can take the time to get the cleanup right and land it in 7.0.
- Usual patch accumulation, review, etc during release cycle
-- PMM
VirtIO Initiative ([STR-9])
===========================
- project admin
[STR-9] <https://linaro.atlassian.net/browse/STR-9>
[upstream rust-vmm sync meeting]
<https://etherpad.opendev.org/p/rust-vmm-sync-2021&sa=D&source=calendar&ust=…>
[proposal] <https://github.com/rust-vmm/vhost-device/pull/57>
vhost-device maintainer effort ([UM-196])
- did a bunch of review on [vhost-device crate]
[UM-196] <https://linaro.atlassian.net/browse/UM-196>
[vhost-device crate] <https://github.com/rust-vmm/vhost-device>
QEMU Upstream Work ([UM-2])
===========================
[UM-2] <https://linaro.atlassian.net/browse/UM-2>
Upstream MTTCG tests ([QEMU-52])
- posted [kvm-unit-tests PATCH v3 0/3] GIC ITS tests Message-Id:
<20211112114734.3058678-1-alex.bennee(a)linaro.org>
- might as well flush the tree state as I left it
- posted [RFC PATCH] hw/intc: clean-up error reporting for failed
ITS cmd Message-Id:
<20211112170454.3158925-1-alex.bennee(a)linaro.org>
- re-based [mttcg tests to current state and fixed up]
[QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52>
[mttcg tests to current state and fixed up]
<https://github.com/stsquad/qemu/tree/mttcg/current-tests-v8>
Completed Reviews [2/2]
=======================
[PATCH v2 0/3] Some watchpoint-related patches
Message-Id: <163662450348.125458.5494710452733592356.stgit@pasha-ThinkPad-X280>
[PATCH 0/5] Update linux-headers + NOIRQ support for KVM gdbstub
Message-Id: <20211111110604.207376-1-pbonzini(a)redhat.com>
Absences
========
- none
Current Review Queue
====================
TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things
Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com>
===================================================================================================================
TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option
Message-Id: <20211013010120.96851-1-sjg(a)chromium.org>
===========================================================================================================
TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV
Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org>
==================================================================================================================================
TODO [PATCH] softmmu: fix watchpoint processing in icount mode
Message-Id: <163101424137.678744.18360776310711795413.stgit@pasha-ThinkPad-X280>
==============================================================================================================================================
--
Alex Bennée
Progress (short week, 3 days)
* UM-2 [QEMU upstream maintainership]
- recent changes to QEMU's PSCI emulation broke booting of guest code
at EL3 on the imx7 board, which was previously accidentally
relying on PSCI-emulation-via-SMC not getting in its way despite
being enabled. We need to make this board disable PSCI when the
guest code is booting to EL3, as the virt board does, but it's
trickier here because the CPU-creation code is hidden inside a
model of an SoC object. After some on-list discussion I have a
plan for how to restructure this, and need to write some code...
* QEMU-420 [GICv4 emulation]
- re-read the GIC architecture specification, acquired a better
understanding of the required work, and broke this epic down into
stories
- discussed with Leif how the ITS support should be landed in the
sbsa-ref board
Misc:
* higher-than-usual amount of meetings and meeting-prep this week
-- PMM
After llvm commit f411c1dd95092139c8b992260705ac0b75c8583f
Author: Peter Klausler <pklausler(a)nvidia.com>
[flang] Fix crash in semantic error recovery situation
the following benchmarks slowed down by more than 2%:
- 456.hmmer slowed down by 3% from 7600 to 7806 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O2 -flto
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Reproduce builds:
<cut>
mkdir investigate-llvm-f411c1dd95092139c8b992260705ac0b75c8583f
cd investigate-llvm-f411c1dd95092139c8b992260705ac0b75c8583f
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach f411c1dd95092139c8b992260705ac0b75c8583f
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach c0b298fc213c1b33e97ca72fba58597365375875
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit f411c1dd95092139c8b992260705ac0b75c8583f
Author: Peter Klausler <pklausler(a)nvidia.com>
Date: Tue Nov 2 16:41:15 2021 -0700
[flang] Fix crash in semantic error recovery situation
A CHECK() in semantics is triggering when analyzing a program
with an undefined derived type pointer because the CHECK is
expecting a new error message to have been issued in a function
but not allowing for the case that a diagnostic could have been
produced earlier. Adjust the predicate.
Differential Revision: https://reviews.llvm.org/D113307
---
flang/lib/Semantics/expression.cpp | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/flang/lib/Semantics/expression.cpp b/flang/lib/Semantics/expression.cpp
index 331b9b2cf5bc..8ee8c9a9c9ce 100644
--- a/flang/lib/Semantics/expression.cpp
+++ b/flang/lib/Semantics/expression.cpp
@@ -1916,7 +1916,7 @@ auto ExpressionAnalyzer::AnalyzeProcedureComponentRef(
"Base of procedure component reference is not a derived-type object"_err_en_US);
}
}
- CHECK(!GetContextualMessages().empty());
+ CHECK(context_.AnyFatalError());
return std::nullopt;
}
</cut>
Hello,
We have been using Linaro GCC 7.5-2019.12 for the A53.
As we move on to new tech there seems to be no support for "-
mcpu=cortex-a55".
Today, we use the aarch64-elf- toolchain.
What GCC do you suggest we start using for A55 ?
Thanks,
Stefan
VirtIO Initiative ([STR-9])
===========================
- various rust-vmm discussions
- [upstream rust-vmm sync meeting]
- how to deal with vhost-device/vm-virtio split: [proposal]
- synced with ARM on their interests
- got update on Fwd: FW: [App-services] Slides from the
hypervisor-less virtio status meeting Message-Id:
<CAHDbmO2G4hUyfxtaxwnbxsrMk+P41zbL-7VNe=Aa6DshxC-5zQ(a)mail.gmail.com>
[STR-9] <https://linaro.atlassian.net/browse/STR-9>
[upstream rust-vmm sync meeting]
<https://etherpad.opendev.org/p/rust-vmm-sync-2021&sa=D&source=calendar&ust=…>
[proposal] <https://github.com/rust-vmm/vhost-device/pull/57>
QEMU Upstream Work ([UM-2])
===========================
- did some bug triage and investigated [555] and [690] which might
intersect with earlier changes I made
- spent time on the PR from hell [PULL 00/30] testing, gdbstub and
semihosting Message-Id:
<20210115130828.23968-1-alex.bennee(a)linaro.org>
[UM-2] <https://linaro.atlassian.net/browse/UM-2>
[555] <https://gitlab.com/qemu-project/qemu/-/issues/555>
[690] <https://gitlab.com/qemu-project/qemu/-/issues/690>
Other
=====
- TSC report preparation for QEMU and Stratos
Completed Reviews [1/1]
=======================
[XEN PATCH v7 00/51] xen: Build system improvements, now with out-of-tree build!
Message-Id: <20210824105038.1257926-1-anthony.perard(a)citrix.com>
Absences
========
,----
| (save-excursion
| (goto-char (point-min))
| (when (re-search-forward "* Absences")
| (goto-char (match-beginning 0))
| (org-export-as 'ascii t nil t )))
`----
Current Review Queue
====================
TODO [PATCH v2 00/48] tcg: optimize redundant sign extensions
Message-Id: <20211007195456.1168070-1-richard.henderson(a)linaro.org>
================================================================================================================================
TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things
Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com>
===================================================================================================================
TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option
Message-Id: <20211013010120.96851-1-sjg(a)chromium.org>
===========================================================================================================
TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV
Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org>
==================================================================================================================================
--
Alex Bennée
Progress
* UM-2 [QEMU upstream maintainership]
+ worked through the big pile of email that had built up while
I was on holiday...
+ some long-delayed sysadmin tasks on my work machines now I have
an opportunity to go into the office and do things that would be
too risky with only remote access
+ triaged a bunch of Coverity issues
* QEMU-406 [QEMU support for MVE (M-profile Vector Extension; Helium)]
+ All work here has now gone upstream; closed!
-- PMM
After llvm commit fbc0c308d599fe3300ab6516650b65b41979446d
Author: Nikita Popov <nikita.ppv(a)gmail.com>
[BasicAA] Handle known bits as ranges
the following benchmarks slowed down by more than 2%:
- 464.h264ref slowed down by 7% from 10899 to 11610 perf samples
- 464.h264ref:libc.so.6 slowed down by 11% from 3538 to 3922 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O2 -flto
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Reproduce builds:
<cut>
mkdir investigate-llvm-fbc0c308d599fe3300ab6516650b65b41979446d
cd investigate-llvm-fbc0c308d599fe3300ab6516650b65b41979446d
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach fbc0c308d599fe3300ab6516650b65b41979446d
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 30a3652b6ade43504087f6e3acd8dc879055f501
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit fbc0c308d599fe3300ab6516650b65b41979446d
Author: Nikita Popov <nikita.ppv(a)gmail.com>
Date: Mon Oct 25 15:47:21 2021 +0200
[BasicAA] Handle known bits as ranges
BasicAA currently tries to determine that the offset is positive by
checking whether all variable indices are positive based on known
bits, multiplied by a positive scale. However, this is incorrect
if the scale multiplication might overflow. In the modified test
case the original value is positive, but may be negative after a
left shift.
Fix this by converting known bits into a constant range and reusing
the range-based logic, which handles overflow correctly.
Differential Revision: https://reviews.llvm.org/D112611
---
llvm/lib/Analysis/BasicAliasAnalysis.cpp | 51 +++++-----------------
.../test/Analysis/BasicAA/assume-index-positive.ll | 4 +-
2 files changed, 12 insertions(+), 43 deletions(-)
diff --git a/llvm/lib/Analysis/BasicAliasAnalysis.cpp b/llvm/lib/Analysis/BasicAliasAnalysis.cpp
index 0305732ca5d5..8cf947c43bf4 100644
--- a/llvm/lib/Analysis/BasicAliasAnalysis.cpp
+++ b/llvm/lib/Analysis/BasicAliasAnalysis.cpp
@@ -318,15 +318,6 @@ struct CastedValue {
return N;
}
- KnownBits evaluateWith(KnownBits N) const {
- assert(N.getBitWidth() == V->getType()->getPrimitiveSizeInBits() &&
- "Incompatible bit width");
- if (TruncBits) N = N.trunc(N.getBitWidth() - TruncBits);
- if (SExtBits) N = N.sext(N.getBitWidth() + SExtBits);
- if (ZExtBits) N = N.zext(N.getBitWidth() + ZExtBits);
- return N;
- }
-
ConstantRange evaluateWith(ConstantRange N) const {
assert(N.getBitWidth() == V->getType()->getPrimitiveSizeInBits() &&
"Incompatible bit width");
@@ -1250,8 +1241,6 @@ AliasResult BasicAAResult::aliasGEP(
if (!DecompGEP1.VarIndices.empty()) {
APInt GCD;
- bool AllNonNegative = DecompGEP1.Offset.isNonNegative();
- bool AllNonPositive = DecompGEP1.Offset.isNonPositive();
ConstantRange OffsetRange = ConstantRange(DecompGEP1.Offset);
for (unsigned i = 0, e = DecompGEP1.VarIndices.size(); i != e; ++i) {
const VariableGEPIndex &Index = DecompGEP1.VarIndices[i];
@@ -1266,24 +1255,19 @@ AliasResult BasicAAResult::aliasGEP(
else
GCD = APIntOps::GreatestCommonDivisor(GCD, ScaleForGCD.abs());
- if (AllNonNegative || AllNonPositive) {
- KnownBits Known = Index.Val.evaluateWith(
- computeKnownBits(Index.Val.V, DL, 0, &AC, Index.CxtI, DT));
- bool SignKnownZero = Known.isNonNegative();
- bool SignKnownOne = Known.isNegative();
- AllNonNegative &= (SignKnownZero && Scale.isNonNegative()) ||
- (SignKnownOne && Scale.isNonPositive());
- AllNonPositive &= (SignKnownZero && Scale.isNonPositive()) ||
- (SignKnownOne && Scale.isNonNegative());
- }
+ ConstantRange CR =
+ computeConstantRange(Index.Val.V, true, &AC, Index.CxtI);
+ KnownBits Known =
+ computeKnownBits(Index.Val.V, DL, 0, &AC, Index.CxtI, DT);
+ CR = CR.intersectWith(
+ ConstantRange::fromKnownBits(Known, /* Signed */ true),
+ ConstantRange::Signed);
assert(OffsetRange.getBitWidth() == Scale.getBitWidth() &&
"Bit widths are normalized to MaxPointerSize");
- OffsetRange = OffsetRange.add(Index.Val
- .evaluateWith(computeConstantRange(
- Index.Val.V, true, &AC, Index.CxtI))
- .sextOrTrunc(OffsetRange.getBitWidth())
- .smul_fast(ConstantRange(Scale)));
+ OffsetRange = OffsetRange.add(
+ Index.Val.evaluateWith(CR).sextOrTrunc(OffsetRange.getBitWidth())
+ .smul_fast(ConstantRange(Scale)));
}
// We now have accesses at two offsets from the same base:
@@ -1300,21 +1284,6 @@ AliasResult BasicAAResult::aliasGEP(
(GCD - ModOffset).uge(V1Size.getValue()))
return AliasResult::NoAlias;
- // If we know all the variables are non-negative, then the total offset is
- // also non-negative and >= DecompGEP1.Offset. We have the following layout:
- // [0, V2Size) ... [TotalOffset, TotalOffer+V1Size]
- // If DecompGEP1.Offset >= V2Size, the accesses don't alias.
- if (AllNonNegative && V2Size.hasValue() &&
- DecompGEP1.Offset.uge(V2Size.getValue()))
- return AliasResult::NoAlias;
- // Similarly, if the variables are non-positive, then the total offset is
- // also non-positive and <= DecompGEP1.Offset. We have the following layout:
- // [TotalOffset, TotalOffset+V1Size) ... [0, V2Size)
- // If -DecompGEP1.Offset >= V1Size, the accesses don't alias.
- if (AllNonPositive && V1Size.hasValue() &&
- (-DecompGEP1.Offset).uge(V1Size.getValue()))
- return AliasResult::NoAlias;
-
if (V1Size.hasValue() && V2Size.hasValue()) {
// Compute ranges of potentially accessed bytes for both accesses. If the
// interseciton is empty, there can be no overlap.
diff --git a/llvm/test/Analysis/BasicAA/assume-index-positive.ll b/llvm/test/Analysis/BasicAA/assume-index-positive.ll
index 451592067f4b..a53fff2c6009 100644
--- a/llvm/test/Analysis/BasicAA/assume-index-positive.ll
+++ b/llvm/test/Analysis/BasicAA/assume-index-positive.ll
@@ -130,12 +130,12 @@ define void @symmetry([0 x i8]* %ptr, i32 %a, i32 %b, i32 %c) {
ret void
}
-; TODO: %ptr.neg and %ptr.shl may alias, as the shl renders the previously
+; %ptr.neg and %ptr.shl may alias, as the shl renders the previously
; non-negative value potentially negative.
define void @shl_of_non_negative(i8* %ptr, i64 %a) {
; CHECK-LABEL: Function: shl_of_non_negative
; CHECK: NoAlias: i8* %ptr.a, i8* %ptr.neg
-; CHECK: NoAlias: i8* %ptr.neg, i8* %ptr.shl
+; CHECK: MayAlias: i8* %ptr.neg, i8* %ptr.shl
%a.cmp = icmp sge i64 %a, 0
call void @llvm.assume(i1 %a.cmp)
%ptr.neg = getelementptr i8, i8* %ptr, i64 -2
</cut>
After llvm commit adf55ac6657693f7bfbe3087b599b4031a765a44
Author: Lang Hames <lhames(a)gmail.com>
[ORC] Call ExecutorProcessControl::disconnect in unit tests that require it.
the following hot functions slowed down by more than 10% (but their benchmarks slowed down by less than 2%):
- 400.perlbench:[.] S_find_byclass slowed down by 12% from 644 to 721 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O2
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Reproduce builds:
<cut>
mkdir investigate-llvm-adf55ac6657693f7bfbe3087b599b4031a765a44
cd investigate-llvm-adf55ac6657693f7bfbe3087b599b4031a765a44
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach adf55ac6657693f7bfbe3087b599b4031a765a44
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach f526ee5b8517b60620cd03bb3e5945ed69d6bfaa
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit adf55ac6657693f7bfbe3087b599b4031a765a44
Author: Lang Hames <lhames(a)gmail.com>
Date: Tue Oct 12 14:55:49 2021 -0700
[ORC] Call ExecutorProcessControl::disconnect in unit tests that require it.
Another follow-up to 2815ed57e3c and 19b4e3cfc6a. For unit tests that don't use
an ExecutionSession we need to call ExecutorProcessControl::disconnect directly
to wait for the dispatcher to shut down.
https://llvm.org/PR52153
---
.../ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp | 2 ++
llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp | 2 ++
2 files changed, 4 insertions(+)
diff --git a/llvm/unittests/ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp b/llvm/unittests/ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp
index f2b157e424b6..a95435aec2a3 100644
--- a/llvm/unittests/ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp
+++ b/llvm/unittests/ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp
@@ -134,6 +134,8 @@ TEST(EPCGenericJITLinkMemoryManagerTest, AllocFinalizeFree) {
auto Err2 = MemMgr->deallocate(std::move(*FA));
EXPECT_THAT_ERROR(std::move(Err2), Succeeded());
+
+ cantFail(SelfEPC->disconnect());
}
} // namespace
diff --git a/llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp b/llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp
index 78024644ca8b..beb0fefa094a 100644
--- a/llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp
+++ b/llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp
@@ -93,6 +93,8 @@ TEST(EPCGenericMemoryAccessTest, MemWrites) {
{{pointerToJITTargetAddress(&Test_Buffer), TestMsg}});
EXPECT_THAT_ERROR(std::move(Err5), Succeeded());
EXPECT_EQ(StringRef(Test_Buffer, TestMsg.size()), TestMsg);
+
+ cantFail(SelfEPC->disconnect());
}
} // namespace
</cut>
== This Week ==
* GCC
- Committed a clean up patch to gimple-isel
- PR93183: Committed fix
- PR102376: Patch approved upstream
- PR83750: Patch approved upstream but it regresses one test-case.
== Next Week ==
- Continue with ongoing tasks
After llvm commit bc69dd62c04a70d29943c1c06c7effed150b70e1
Author: Alexey Bataev <a.bataev(a)outlook.com>
[SLP]Improve graph reordering.
the following benchmarks grew in size by more than 1%:
- 444.namd grew in size by 2% from 192302 to 195218 bytes
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -Os
- Hardware: APM Mustang 8x X-Gene1
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_apm/llvm-master-aarch64-spec2k6-Os
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-…
Reproduce builds:
<cut>
mkdir investigate-llvm-bc69dd62c04a70d29943c1c06c7effed150b70e1
cd investigate-llvm-bc69dd62c04a70d29943c1c06c7effed150b70e1
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach bc69dd62c04a70d29943c1c06c7effed150b70e1
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 5661317f864abf750cf893c6a4cc7a977be0995a
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit bc69dd62c04a70d29943c1c06c7effed150b70e1
Author: Alexey Bataev <a.bataev(a)outlook.com>
Date: Tue Aug 3 13:20:32 2021 -0700
[SLP]Improve graph reordering.
Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.
Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.
The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.
The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.
Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.
Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.
Differential Revision: https://reviews.llvm.org/D105020
---
.../llvm/Transforms/Vectorize/SLPVectorizer.h | 3 +-
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 1364 ++++++++++++++------
.../AArch64/transpose-inseltpoison.ll | 84 +-
.../Transforms/SLPVectorizer/AArch64/transpose.ll | 84 +-
llvm/test/Transforms/SLPVectorizer/X86/addsub.ll | 42 +-
.../Transforms/SLPVectorizer/X86/crash_cmpop.ll | 6 +-
llvm/test/Transforms/SLPVectorizer/X86/extract.ll | 6 +-
.../SLPVectorizer/X86/jumbled-load-multiuse.ll | 12 +-
.../Transforms/SLPVectorizer/X86/jumbled-load.ll | 22 +-
.../SLPVectorizer/X86/jumbled_store_crash.ll | 29 +-
.../SLPVectorizer/X86/reorder_repeated_ops.ll | 4 +-
.../SLPVectorizer/X86/split-load8_2-unord.ll | 4 +-
.../X86/vectorize-reorder-alt-shuffle.ll | 9 +-
.../SLPVectorizer/X86/vectorize-reorder-reuse.ll | 52 +-
14 files changed, 1119 insertions(+), 602 deletions(-)
diff --git a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
index f416a592d683..5e8c29913cad 100644
--- a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
+++ b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
@@ -95,8 +95,7 @@ private:
/// Try to vectorize a list of operands.
/// \returns true if a value was vectorized.
- bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R,
- bool AllowReorder = false);
+ bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R);
/// Try to vectorize a chain that may start at the operands of \p I.
bool tryToVectorize(Instruction *I, slpvectorizer::BoUpSLP &R);
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 9c0029484964..7400b3d8a503 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -21,6 +21,7 @@
#include "llvm/ADT/DenseSet.h"
#include "llvm/ADT/Optional.h"
#include "llvm/ADT/PostOrderIterator.h"
+#include "llvm/ADT/PriorityQueue.h"
#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SetOperations.h"
#include "llvm/ADT/SetVector.h"
@@ -535,13 +536,68 @@ static bool isSimple(Instruction *I) {
return true;
}
+/// Shuffles \p Mask in accordance with the given \p SubMask.
+static void addMask(SmallVectorImpl<int> &Mask, ArrayRef<int> SubMask) {
+ if (SubMask.empty())
+ return;
+ if (Mask.empty()) {
+ Mask.append(SubMask.begin(), SubMask.end());
+ return;
+ }
+ SmallVector<int> NewMask(SubMask.size(), UndefMaskElem);
+ int TermValue = std::min(Mask.size(), SubMask.size());
+ for (int I = 0, E = SubMask.size(); I < E; ++I) {
+ if (SubMask[I] >= TermValue || SubMask[I] == UndefMaskElem ||
+ Mask[SubMask[I]] >= TermValue)
+ continue;
+ NewMask[I] = Mask[SubMask[I]];
+ }
+ Mask.swap(NewMask);
+}
+
+/// Order may have elements assigned special value (size) which is out of
+/// bounds. Such indices only appear on places which correspond to undef values
+/// (see canReuseExtract for details) and used in order to avoid undef values
+/// have effect on operands ordering.
+/// The first loop below simply finds all unused indices and then the next loop
+/// nest assigns these indices for undef values positions.
+/// As an example below Order has two undef positions and they have assigned
+/// values 3 and 7 respectively:
+/// before: 6 9 5 4 9 2 1 0
+/// after: 6 3 5 4 7 2 1 0
+/// \returns Fixed ordering.
+static void fixupOrderingIndices(SmallVectorImpl<unsigned> &Order) {
+ const unsigned Sz = Order.size();
+ SmallBitVector UsedIndices(Sz);
+ SmallVector<int> MaskedIndices;
+ for (unsigned I = 0; I < Sz; ++I) {
+ if (Order[I] < Sz)
+ UsedIndices.set(Order[I]);
+ else
+ MaskedIndices.push_back(I);
+ }
+ if (MaskedIndices.empty())
+ return;
+ SmallVector<int> AvailableIndices(MaskedIndices.size());
+ unsigned Cnt = 0;
+ int Idx = UsedIndices.find_first();
+ do {
+ AvailableIndices[Cnt] = Idx;
+ Idx = UsedIndices.find_next(Idx);
+ ++Cnt;
+ } while (Idx > 0);
+ assert(Cnt == MaskedIndices.size() && "Non-synced masked/available indices.");
+ for (int I = 0, E = MaskedIndices.size(); I < E; ++I)
+ Order[MaskedIndices[I]] = AvailableIndices[I];
+}
+
namespace llvm {
static void inversePermutation(ArrayRef<unsigned> Indices,
SmallVectorImpl<int> &Mask) {
Mask.clear();
const unsigned E = Indices.size();
- Mask.resize(E, E + 1);
+ Mask.resize(E, UndefMaskElem);
for (unsigned I = 0; I < E; ++I)
Mask[Indices[I]] = I;
}
@@ -581,6 +637,22 @@ static Optional<int> getInsertIndex(Value *InsertInst, unsigned Offset) {
return Index;
}
+/// Reorders the list of scalars in accordance with the given \p Order and then
+/// the \p Mask. \p Order - is the original order of the scalars, need to
+/// reorder scalars into an unordered state at first according to the given
+/// order. Then the ordered scalars are shuffled once again in accordance with
+/// the provided mask.
+static void reorderScalars(SmallVectorImpl<Value *> &Scalars,
+ ArrayRef<int> Mask) {
+ assert(!Mask.empty() && "Expected non-empty mask.");
+ SmallVector<Value *> Prev(Scalars.size(),
+ UndefValue::get(Scalars.front()->getType()));
+ Prev.swap(Scalars);
+ for (unsigned I = 0, E = Prev.size(); I < E; ++I)
+ if (Mask[I] != UndefMaskElem)
+ Scalars[Mask[I]] = Prev[I];
+}
+
namespace slpvectorizer {
/// Bottom Up SLP Vectorizer.
@@ -645,13 +717,12 @@ public:
void buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst = None);
- /// Construct a vectorizable tree that starts at \p Roots, ignoring users for
- /// the purpose of scheduling and extraction in the \p UserIgnoreLst taking
- /// into account (and updating it, if required) list of externally used
- /// values stored in \p ExternallyUsedValues.
- void buildTree(ArrayRef<Value *> Roots,
- ExtraValueToDebugLocsMap &ExternallyUsedValues,
- ArrayRef<Value *> UserIgnoreLst = None);
+ /// Builds external uses of the vectorized scalars, i.e. the list of
+ /// vectorized scalars to be extracted, their lanes and their scalar users. \p
+ /// ExternallyUsedValues contains additional list of external uses to handle
+ /// vectorization of reductions.
+ void
+ buildExternalUses(const ExtraValueToDebugLocsMap &ExternallyUsedValues = {});
/// Clear the internal data structures that are created by 'buildTree'.
void deleteTree() {
@@ -659,8 +730,6 @@ public:
ScalarToTreeEntry.clear();
MustGather.clear();
ExternalUses.clear();
- NumOpsWantToKeepOrder.clear();
- NumOpsWantToKeepOriginalOrder = 0;
for (auto &Iter : BlocksSchedules) {
BlockScheduling *BS = Iter.second.get();
BS->clear();
@@ -674,103 +743,22 @@ public:
/// Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();
- /// \returns The best order of instructions for vectorization.
- Optional<ArrayRef<unsigned>> bestOrder() const {
- assert(llvm::all_of(
- NumOpsWantToKeepOrder,
- [this](const decltype(NumOpsWantToKeepOrder)::value_type &D) {
- return D.getFirst().size() ==
- VectorizableTree[0]->Scalars.size();
- }) &&
- "All orders must have the same size as number of instructions in "
- "tree node.");
- auto I = std::max_element(
- NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(),
- [](const decltype(NumOpsWantToKeepOrder)::value_type &D1,
- const decltype(NumOpsWantToKeepOrder)::value_type &D2) {
- return D1.second < D2.second;
- });
- if (I == NumOpsWantToKeepOrder.end() ||
- I->getSecond() <= NumOpsWantToKeepOriginalOrder)
- return None;
-
- return makeArrayRef(I->getFirst());
- }
-
- /// Builds the correct order for root instructions.
- /// If some leaves have the same instructions to be vectorized, we may
- /// incorrectly evaluate the best order for the root node (it is built for the
- /// vector of instructions without repeated instructions and, thus, has less
- /// elements than the root node). This function builds the correct order for
- /// the root node.
- /// For example, if the root node is \<a+b, a+c, a+d, f+e\>, then the leaves
- /// are \<a, a, a, f\> and \<b, c, d, e\>. When we try to vectorize the first
- /// leaf, it will be shrink to \<a, b\>. If instructions in this leaf should
- /// be reordered, the best order will be \<1, 0\>. We need to extend this
- /// order for the root node. For the root node this order should look like
- /// \<3, 0, 1, 2\>. This function extends the order for the reused
- /// instructions.
- void findRootOrder(OrdersType &Order) {
- // If the leaf has the same number of instructions to vectorize as the root
- // - order must be set already.
- unsigned RootSize = VectorizableTree[0]->Scalars.size();
- if (Order.size() == RootSize)
- return;
- SmallVector<unsigned, 4> RealOrder(Order.size());
- std::swap(Order, RealOrder);
- SmallVector<int, 4> Mask;
- inversePermutation(RealOrder, Mask);
- Order.assign(Mask.begin(), Mask.end());
- // The leaf has less number of instructions - need to find the true order of
- // the root.
- // Scan the nodes starting from the leaf back to the root.
- const TreeEntry *PNode = VectorizableTree.back().get();
- SmallVector<const TreeEntry *, 4> Nodes(1, PNode);
- SmallPtrSet<const TreeEntry *, 4> Visited;
- while (!Nodes.empty() && Order.size() != RootSize) {
- const TreeEntry *PNode = Nodes.pop_back_val();
- if (!Visited.insert(PNode).second)
- continue;
- const TreeEntry &Node = *PNode;
- for (const EdgeInfo &EI : Node.UserTreeIndices)
- if (EI.UserTE)
- Nodes.push_back(EI.UserTE);
- if (Node.ReuseShuffleIndices.empty())
- continue;
- // Build the order for the parent node.
- OrdersType NewOrder(Node.ReuseShuffleIndices.size(), RootSize);
- SmallVector<unsigned, 4> OrderCounter(Order.size(), 0);
- // The algorithm of the order extension is:
- // 1. Calculate the number of the same instructions for the order.
- // 2. Calculate the index of the new order: total number of instructions
- // with order less than the order of the current instruction + reuse
- // number of the current instruction.
- // 3. The new order is just the index of the instruction in the original
- // vector of the instructions.
- for (unsigned I : Node.ReuseShuffleIndices)
- ++OrderCounter[Order[I]];
- SmallVector<unsigned, 4> CurrentCounter(Order.size(), 0);
- for (unsigned I = 0, E = Node.ReuseShuffleIndices.size(); I < E; ++I) {
- unsigned ReusedIdx = Node.ReuseShuffleIndices[I];
- unsigned OrderIdx = Order[ReusedIdx];
- unsigned NewIdx = 0;
- for (unsigned J = 0; J < OrderIdx; ++J)
- NewIdx += OrderCounter[J];
- NewIdx += CurrentCounter[OrderIdx];
- ++CurrentCounter[OrderIdx];
- assert(NewOrder[NewIdx] == RootSize &&
- "The order index should not be written already.");
- NewOrder[NewIdx] = I;
- }
- std::swap(Order, NewOrder);
- }
- assert(Order.size() == RootSize &&
- "Root node is expected or the size of the order must be the same as "
- "the number of elements in the root node.");
- assert(llvm::all_of(Order,
- [RootSize](unsigned Val) { return Val != RootSize; }) &&
- "All indices must be initialized");
- }
+ /// Reorders the current graph to the most profitable order starting from the
+ /// root node to the leaf nodes. The best order is chosen only from the nodes
+ /// of the same size (vectorization factor). Smaller nodes are considered
+ /// parts of subgraph with smaller VF and they are reordered independently. We
+ /// can make it because we still need to extend smaller nodes to the wider VF
+ /// and we can merge reordering shuffles with the widening shuffles.
+ void reorderTopToBottom();
+
+ /// Reorders the current graph to the most profitable order starting from
+ /// leaves to the root. It allows to rotate small subgraphs and reduce the
+ /// number of reshuffles if the leaf nodes use the same order. In this case we
+ /// can merge the orders and just shuffle user node instead of shuffling its
+ /// operands. Plus, even the leaf nodes have different orders, it allows to
+ /// sink reordering in the graph closer to the root node and merge it later
+ /// during analysis.
+ void reorderBottomToTop();
/// \return The vector element size in bits to use when vectorizing the
/// expression tree ending at \p V. If V is a store, the size is the width of
@@ -793,6 +781,10 @@ public:
return MinVecRegSize;
}
+ unsigned getMinVF(unsigned Sz) const {
+ return std::max(2U, getMinVecRegSize() / Sz);
+ }
+
unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const {
unsigned MaxVF = MaxVFOption.getNumOccurrences() ?
MaxVFOption : TTI->getMaximumVF(ElemWidth, Opcode);
@@ -1621,12 +1613,29 @@ private:
/// \returns true if the scalars in VL are equal to this entry.
bool isSame(ArrayRef<Value *> VL) const {
- if (VL.size() == Scalars.size())
- return std::equal(VL.begin(), VL.end(), Scalars.begin());
- return VL.size() == ReuseShuffleIndices.size() &&
- std::equal(
- VL.begin(), VL.end(), ReuseShuffleIndices.begin(),
- [this](Value *V, int Idx) { return V == Scalars[Idx]; });
+ auto &&IsSame = [VL](ArrayRef<Value *> Scalars, ArrayRef<int> Mask) {
+ if (Mask.size() != VL.size() && VL.size() == Scalars.size())
+ return std::equal(VL.begin(), VL.end(), Scalars.begin());
+ return VL.size() == Mask.size() &&
+ std::equal(
+ VL.begin(), VL.end(), Mask.begin(),
+ [Scalars](Value *V, int Idx) { return V == Scalars[Idx]; });
+ };
+ if (!ReorderIndices.empty()) {
+ // TODO: implement matching if the nodes are just reordered, still can
+ // treat the vector as the same if the list of scalars matches VL
+ // directly, without reordering.
+ SmallVector<int> Mask;
+ inversePermutation(ReorderIndices, Mask);
+ if (VL.size() == Scalars.size())
+ return IsSame(Scalars, Mask);
+ if (VL.size() == ReuseShuffleIndices.size()) {
+ ::addMask(Mask, ReuseShuffleIndices);
+ return IsSame(Scalars, Mask);
+ }
+ return false;
+ }
+ return IsSame(Scalars, ReuseShuffleIndices);
}
/// A vector of scalars.
@@ -1701,6 +1710,12 @@ private:
}
}
+ /// Reorders operands of the node to the given mask \p Mask.
+ void reorderOperands(ArrayRef<int> Mask) {
+ for (ValueList &Operand : Operands)
+ reorderScalars(Operand, Mask);
+ }
+
/// \returns the \p OpIdx operand of this TreeEntry.
ValueList &getOperand(unsigned OpIdx) {
assert(OpIdx < Operands.size() && "Off bounds");
@@ -1760,19 +1775,14 @@ private:
return AltOp ? AltOp->getOpcode() : 0;
}
- /// Update operations state of this entry if reorder occurred.
- bool updateStateIfReorder() {
- if (ReorderIndices.empty())
- return false;
- InstructionsState S = getSameOpcode(Scalars, ReorderIndices.front());
- setOperations(S);
- return true;
- }
- /// When ReuseShuffleIndices is empty it just returns position of \p V
- /// within vector of Scalars. Otherwise, try to remap on its reuse index.
+ /// When ReuseReorderShuffleIndices is empty it just returns position of \p
+ /// V within vector of Scalars. Otherwise, try to remap on its reuse index.
int findLaneForValue(Value *V) const {
unsigned FoundLane = std::distance(Scalars.begin(), find(Scalars, V));
assert(FoundLane < Scalars.size() && "Couldn't find extract lane");
+ if (!ReorderIndices.empty())
+ FoundLane = ReorderIndices[FoundLane];
+ assert(FoundLane < Scalars.size() && "Couldn't find extract lane");
if (!ReuseShuffleIndices.empty()) {
FoundLane = std::distance(ReuseShuffleIndices.begin(),
find(ReuseShuffleIndices, FoundLane));
@@ -1856,7 +1866,7 @@ private:
TreeEntry *newTreeEntry(ArrayRef<Value *> VL, Optional<ScheduleData *> Bundle,
const InstructionsState &S,
const EdgeInfo &UserTreeIdx,
- ArrayRef<unsigned> ReuseShuffleIndices = None,
+ ArrayRef<int> ReuseShuffleIndices = None,
ArrayRef<unsigned> ReorderIndices = None) {
TreeEntry::EntryState EntryState =
Bundle ? TreeEntry::Vectorize : TreeEntry::NeedToGather;
@@ -1869,7 +1879,7 @@ private:
Optional<ScheduleData *> Bundle,
const InstructionsState &S,
const EdgeInfo &UserTreeIdx,
- ArrayRef<unsigned> ReuseShuffleIndices = None,
+ ArrayRef<int> ReuseShuffleIndices = None,
ArrayRef<unsigned> ReorderIndices = None) {
assert(((!Bundle && EntryState == TreeEntry::NeedToGather) ||
(Bundle && EntryState != TreeEntry::NeedToGather)) &&
@@ -1877,12 +1887,25 @@ private:
VectorizableTree.push_back(std::make_unique<TreeEntry>(VectorizableTree));
TreeEntry *Last = VectorizableTree.back().get();
Last->Idx = VectorizableTree.size() - 1;
- Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->State = EntryState;
Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),
ReuseShuffleIndices.end());
- Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end());
- Last->setOperations(S);
+ if (ReorderIndices.empty()) {
+ Last->Scalars.assign(VL.begin(), VL.end());
+ Last->setOperations(S);
+ } else {
+ // Reorder scalars and build final mask.
+ Last->Scalars.assign(VL.size(), nullptr);
+ transform(ReorderIndices, Last->Scalars.begin(),
+ [VL](unsigned Idx) -> Value * {
+ if (Idx >= VL.size())
+ return UndefValue::get(VL.front()->getType());
+ return VL[Idx];
+ });
+ InstructionsState S = getSameOpcode(Last->Scalars);
+ Last->setOperations(S);
+ Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end());
+ }
if (Last->State != TreeEntry::NeedToGather) {
for (Value *V : VL) {
assert(!getTreeEntry(V) && "Scalar already in tree!");
@@ -2431,14 +2454,6 @@ private:
}
};
- /// Contains orders of operations along with the number of bundles that have
- /// operations in this order. It stores only those orders that require
- /// reordering, if reordering is not required it is counted using \a
- /// NumOpsWantToKeepOriginalOrder.
- DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> NumOpsWantToKeepOrder;
- /// Number of bundles that do not require reordering.
- unsigned NumOpsWantToKeepOriginalOrder = 0;
-
// Analysis and block reference.
Function *F;
ScalarEvolution *SE;
@@ -2591,21 +2606,439 @@ void BoUpSLP::eraseInstructions(ArrayRef<Value *> AV) {
};
}
-void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
- ArrayRef<Value *> UserIgnoreLst) {
- ExtraValueToDebugLocsMap ExternallyUsedValues;
- buildTree(Roots, ExternallyUsedValues, UserIgnoreLst);
+/// Reorders the given \p Reuses mask according to the given \p Mask. \p Reuses
+/// contains original mask for the scalars reused in the node. Procedure
+/// transform this mask in accordance with the given \p Mask.
+static void reorderReuses(SmallVectorImpl<int> &Reuses, ArrayRef<int> Mask) {
+ assert(!Mask.empty() && Reuses.size() == Mask.size() &&
+ "Expected non-empty mask.");
+ SmallVector<int> Prev(Reuses.begin(), Reuses.end());
+ Prev.swap(Reuses);
+ for (unsigned I = 0, E = Prev.size(); I < E; ++I)
+ if (Mask[I] != UndefMaskElem)
+ Reuses[Mask[I]] = Prev[I];
}
-void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
- ExtraValueToDebugLocsMap &ExternallyUsedValues,
- ArrayRef<Value *> UserIgnoreLst) {
- deleteTree();
- UserIgnoreList = UserIgnoreLst;
- if (!allSameType(Roots))
+/// Reorders the given \p Order according to the given \p Mask. \p Order - is
+/// the original order of the scalars. Procedure transforms the provided order
+/// in accordance with the given \p Mask. If the resulting \p Order is just an
+/// identity order, \p Order is cleared.
+static void reorderOrder(SmallVectorImpl<unsigned> &Order, ArrayRef<int> Mask) {
+ assert(!Mask.empty() && "Expected non-empty mask.");
+ SmallVector<int> MaskOrder;
+ if (Order.empty()) {
+ MaskOrder.resize(Mask.size());
+ std::iota(MaskOrder.begin(), MaskOrder.end(), 0);
+ } else {
+ inversePermutation(Order, MaskOrder);
+ }
+ reorderReuses(MaskOrder, Mask);
+ if (ShuffleVectorInst::isIdentityMask(MaskOrder)) {
+ Order.clear();
return;
- buildTree_rec(Roots, 0, EdgeInfo());
+ }
+ Order.assign(Mask.size(), Mask.size());
+ for (unsigned I = 0, E = Mask.size(); I < E; ++I)
+ if (MaskOrder[I] != UndefMaskElem)
+ Order[MaskOrder[I]] = I;
+ fixupOrderingIndices(Order);
+}
+
+void BoUpSLP::reorderTopToBottom() {
+ // Maps VF to the graph nodes.
+ DenseMap<unsigned, SmallPtrSet<TreeEntry *, 4>> VFToOrderedEntries;
+ // ExtractElement gather nodes which can be vectorized and need to handle
+ // their ordering.
+ DenseMap<const TreeEntry *, OrdersType> GathersToOrders;
+ // Find all reorderable nodes with the given VF.
+ // Currently the are vectorized loads,extracts + some gathering of extracts.
+ for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders](
+ const std::unique_ptr<TreeEntry> &TE) {
+ // No need to reorder if need to shuffle reuses, still need to shuffle the
+ // node.
+ if (!TE->ReuseShuffleIndices.empty())
+ return;
+ if (TE->State == TreeEntry::Vectorize &&
+ isa<LoadInst, ExtractElementInst, ExtractValueInst, StoreInst,
+ InsertElementInst>(TE->getMainOp()) &&
+ !TE->isAltShuffle()) {
+ VFToOrderedEntries[TE->Scalars.size()].insert(TE.get());
+ } else if (TE->State == TreeEntry::NeedToGather &&
+ TE->getOpcode() == Instruction::ExtractElement &&
+ !TE->isAltShuffle() &&
+ isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp())
+ ->getVectorOperandType()) &&
+ allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) {
+ // Check that gather of extractelements can be represented as
+ // just a shuffle of a single vector.
+ OrdersType CurrentOrder;
+ bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), CurrentOrder);
+ if (Reuse || !CurrentOrder.empty()) {
+ VFToOrderedEntries[TE->Scalars.size()].insert(TE.get());
+ GathersToOrders.try_emplace(TE.get(), CurrentOrder);
+ }
+ }
+ });
+
+ // Reorder the graph nodes according to their vectorization factor.
+ for (unsigned VF = VectorizableTree.front()->Scalars.size(); VF > 1;
+ VF /= 2) {
+ auto It = VFToOrderedEntries.find(VF);
+ if (It == VFToOrderedEntries.end())
+ continue;
+ // Try to find the most profitable order. We just are looking for the most
+ // used order and reorder scalar elements in the nodes according to this
+ // mostly used order.
+ const SmallPtrSetImpl<TreeEntry *> &OrderedEntries = It->getSecond();
+ // All operands are reordered and used only in this node - propagate the
+ // most used order to the user node.
+ DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses;
+ SmallPtrSet<const TreeEntry *, 4> VisitedOps;
+ for (const TreeEntry *OpTE : OrderedEntries) {
+ // No need to reorder this nodes, still need to extend and to use shuffle,
+ // just need to merge reordering shuffle and the reuse shuffle.
+ if (!OpTE->ReuseShuffleIndices.empty())
+ continue;
+ // Count number of orders uses.
+ const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & {
+ if (OpTE->State == TreeEntry::NeedToGather)
+ return GathersToOrders.find(OpTE)->second;
+ return OpTE->ReorderIndices;
+ }();
+ // Stores actually store the mask, not the order, need to invert.
+ if (OpTE->State == TreeEntry::Vectorize && !OpTE->isAltShuffle() &&
+ OpTE->getOpcode() == Instruction::Store && !Order.empty()) {
+ SmallVector<int> Mask;
+ inversePermutation(Order, Mask);
+ unsigned E = Order.size();
+ OrdersType CurrentOrder(E, E);
+ transform(Mask, CurrentOrder.begin(), [E](int Idx) {
+ return Idx == UndefMaskElem ? E : static_cast<unsigned>(Idx);
+ });
+ fixupOrderingIndices(CurrentOrder);
+ ++OrdersUses.try_emplace(CurrentOrder).first->getSecond();
+ } else {
+ ++OrdersUses.try_emplace(Order).first->getSecond();
+ }
+ }
+ // Set order of the user node.
+ if (OrdersUses.empty())
+ continue;
+ // Choose the most used order.
+ ArrayRef<unsigned> BestOrder = OrdersUses.begin()->first;
+ unsigned Cnt = OrdersUses.begin()->second;
+ for (const auto &Pair : llvm::drop_begin(OrdersUses)) {
+ if (Cnt < Pair.second || (Cnt == Pair.second && Pair.first.empty())) {
+ BestOrder = Pair.first;
+ Cnt = Pair.second;
+ }
+ }
+ // Set order of the user node.
+ if (BestOrder.empty())
+ continue;
+ SmallVector<int> Mask;
+ inversePermutation(BestOrder, Mask);
+ SmallVector<int> MaskOrder(BestOrder.size(), UndefMaskElem);
+ unsigned E = BestOrder.size();
+ transform(BestOrder, MaskOrder.begin(), [E](unsigned I) {
+ return I < E ? static_cast<int>(I) : UndefMaskElem;
+ });
+ // Do an actual reordering, if profitable.
+ for (std::unique_ptr<TreeEntry> &TE : VectorizableTree) {
+ // Just do the reordering for the nodes with the given VF.
+ if (TE->Scalars.size() != VF) {
+ if (TE->ReuseShuffleIndices.size() == VF) {
+ // Need to reorder the reuses masks of the operands with smaller VF to
+ // be able to find the match between the graph nodes and scalar
+ // operands of the given node during vectorization/cost estimation.
+ assert(all_of(TE->UserTreeIndices,
+ [VF, &TE](const EdgeInfo &EI) {
+ return EI.UserTE->Scalars.size() == VF ||
+ EI.UserTE->Scalars.size() ==
+ TE->Scalars.size();
+ }) &&
+ "All users must be of VF size.");
+ // Update ordering of the operands with the smaller VF than the given
+ // one.
+ reorderReuses(TE->ReuseShuffleIndices, Mask);
+ }
+ continue;
+ }
+ if (TE->State == TreeEntry::Vectorize &&
+ isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst,
+ InsertElementInst>(TE->getMainOp()) &&
+ !TE->isAltShuffle()) {
+ // Build correct orders for extract{element,value}, loads and
+ // stores.
+ reorderOrder(TE->ReorderIndices, Mask);
+ if (isa<InsertElementInst, StoreInst>(TE->getMainOp()))
+ TE->reorderOperands(Mask);
+ } else {
+ // Reorder the node and its operands.
+ TE->reorderOperands(Mask);
+ assert(TE->ReorderIndices.empty() &&
+ "Expected empty reorder sequence.");
+ reorderScalars(TE->Scalars, Mask);
+ }
+ if (!TE->ReuseShuffleIndices.empty()) {
+ // Apply reversed order to keep the original ordering of the reused
+ // elements to avoid extra reorder indices shuffling.
+ OrdersType CurrentOrder;
+ reorderOrder(CurrentOrder, MaskOrder);
+ SmallVector<int> NewReuses;
+ inversePermutation(CurrentOrder, NewReuses);
+ addMask(NewReuses, TE->ReuseShuffleIndices);
+ TE->ReuseShuffleIndices.swap(NewReuses);
+ }
+ }
+ }
+}
+
+void BoUpSLP::reorderBottomToTop() {
+ SetVector<TreeEntry *> OrderedEntries;
+ DenseMap<const TreeEntry *, OrdersType> GathersToOrders;
+ // Find all reorderable leaf nodes with the given VF.
+ // Currently the are vectorized loads,extracts without alternate operands +
+ // some gathering of extracts.
+ SmallVector<TreeEntry *> NonVectorized;
+ for_each(VectorizableTree, [this, &OrderedEntries, &GathersToOrders,
+ &NonVectorized](
+ const std::unique_ptr<TreeEntry> &TE) {
+ // No need to reorder if need to shuffle reuses, still need to shuffle the
+ // node.
+ if (!TE->ReuseShuffleIndices.empty())
+ return;
+ if (TE->State == TreeEntry::Vectorize &&
+ isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE->getMainOp()) &&
+ !TE->isAltShuffle()) {
+ OrderedEntries.insert(TE.get());
+ } else if (TE->State == TreeEntry::NeedToGather &&
+ TE->getOpcode() == Instruction::ExtractElement &&
+ !TE->isAltShuffle() &&
+ isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp())
+ ->getVectorOperandType()) &&
+ allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) {
+ // Check that gather of extractelements can be represented as
+ // just a shuffle of a single vector with a single user only.
+ OrdersType CurrentOrder;
+ bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), CurrentOrder);
+ if ((Reuse || !CurrentOrder.empty()) &&
+ !any_of(
+ VectorizableTree, [&TE](const std::unique_ptr<TreeEntry> &Entry) {
+ return Entry->State == TreeEntry::NeedToGather &&
+ Entry.get() != TE.get() && Entry->isSame(TE->Scalars);
+ })) {
+ OrderedEntries.insert(TE.get());
+ GathersToOrders.try_emplace(TE.get(), CurrentOrder);
+ }
+ }
+ if (TE->State != TreeEntry::Vectorize)
+ NonVectorized.push_back(TE.get());
+ });
+
+ // Checks if the operands of the users are reordarable and have only single
+ // use.
+ auto &&CheckOperands =
+ [this, &NonVectorized](const auto &Data,
+ SmallVectorImpl<TreeEntry *> &GatherOps) {
+ for (unsigned I = 0, E = Data.first->getNumOperands(); I < E; ++I) {
+ if (any_of(Data.second,
+ [I](const std::pair<unsigned, TreeEntry *> &OpData) {
+ return OpData.first == I &&
+ OpData.second->State == TreeEntry::Vectorize;
+ }))
+ continue;
+ ArrayRef<Value *> VL = Data.first->getOperand(I);
+ const TreeEntry *TE = nullptr;
+ const auto *It = find_if(VL, [this, &TE](Value *V) {
+ TE = getTreeEntry(V);
+ return TE;
+ });
+ if (It != VL.end() && TE->isSame(VL))
+ return false;
+ TreeEntry *Gather = nullptr;
+ if (count_if(NonVectorized, [VL, &Gather](TreeEntry *TE) {
+ assert(TE->State != TreeEntry::Vectorize &&
+ "Only non-vectorized nodes are expected.");
+ if (TE->isSame(VL)) {
+ Gather = TE;
+ return true;
+ }
+ return false;
+ }) > 1)
+ return false;
+ if (Gather)
+ GatherOps.push_back(Gather);
+ }
+ return true;
+ };
+ // 1. Propagate order to the graph nodes, which use only reordered nodes.
+ // I.e., if the node has operands, that are reordered, try to make at least
+ // one operand order in the natural order and reorder others + reorder the
+ // user node itself.
+ SmallPtrSet<const TreeEntry *, 4> Visited;
+ while (!OrderedEntries.empty()) {
+ // 1. Filter out only reordered nodes.
+ // 2. If the entry has multiple uses - skip it and jump to the next node.
+ MapVector<TreeEntry *, SmallVector<std::pair<unsigned, TreeEntry *>>> Users;
+ SmallVector<TreeEntry *> Filtered;
+ for (TreeEntry *TE : OrderedEntries) {
+ if (!(TE->State == TreeEntry::Vectorize ||
+ (TE->State == TreeEntry::NeedToGather &&
+ TE->getOpcode() == Instruction::ExtractElement)) ||
+ TE->UserTreeIndices.empty() || !TE->ReuseShuffleIndices.empty() ||
+ !all_of(drop_begin(TE->UserTreeIndices),
+ [TE](const EdgeInfo &EI) {
+ return EI.UserTE == TE->UserTreeIndices.front().UserTE;
+ }) ||
+ !Visited.insert(TE).second) {
+ Filtered.push_back(TE);
+ continue;
+ }
+ // Build a map between user nodes and their operands order to speedup
+ // search. The graph currently does not provide this dependency directly.
+ for (EdgeInfo &EI : TE->UserTreeIndices) {
+ TreeEntry *UserTE = EI.UserTE;
+ auto It = Users.find(UserTE);
+ if (It == Users.end())
+ It = Users.insert({UserTE, {}}).first;
+ It->second.emplace_back(EI.EdgeIdx, TE);
+ }
+ }
+ // Erase filtered entries.
+ for_each(Filtered,
+ [&OrderedEntries](TreeEntry *TE) { OrderedEntries.remove(TE); });
+ for (const auto &Data : Users) {
+ // Check that operands are used only in the User node.
+ SmallVector<TreeEntry *> GatherOps;
+ if (!CheckOperands(Data, GatherOps)) {
+ for_each(Data.second,
+ [&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) {
+ OrderedEntries.remove(Op.second);
+ });
+ continue;
+ }
+ // All operands are reordered and used only in this node - propagate the
+ // most used order to the user node.
+ DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses;
+ SmallPtrSet<const TreeEntry *, 4> VisitedOps;
+ for (const auto &Op : Data.second) {
+ TreeEntry *OpTE = Op.second;
+ if (!OpTE->ReuseShuffleIndices.empty())
+ continue;
+ const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & {
+ if (OpTE->State == TreeEntry::NeedToGather)
+ return GathersToOrders.find(OpTE)->second;
+ return OpTE->ReorderIndices;
+ }();
+ // Stores actually store the mask, not the order, need to invert.
+ if (OpTE->State == TreeEntry::Vectorize && !OpTE->isAltShuffle() &&
+ OpTE->getOpcode() == Instruction::Store && !Order.empty()) {
+ SmallVector<int> Mask;
+ inversePermutation(Order, Mask);
+ unsigned E = Order.size();
+ OrdersType CurrentOrder(E, E);
+ transform(Mask, CurrentOrder.begin(), [E](int Idx) {
+ return Idx == UndefMaskElem ? E : static_cast<unsigned>(Idx);
+ });
+ fixupOrderingIndices(CurrentOrder);
+ ++OrdersUses.try_emplace(CurrentOrder).first->getSecond();
+ } else {
+ ++OrdersUses.try_emplace(Order).first->getSecond();
+ }
+ if (VisitedOps.insert(OpTE).second)
+ OrdersUses.try_emplace({}, 0).first->getSecond() +=
+ OpTE->UserTreeIndices.size();
+ --OrdersUses[{}];
+ }
+ // If no orders - skip current nodes and jump to the next one, if any.
+ if (OrdersUses.empty()) {
+ for_each(Data.second,
+ [&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) {
+ OrderedEntries.remove(Op.second);
+ });
+ continue;
+ }
+ // Choose the best order.
+ ArrayRef<unsigned> BestOrder = OrdersUses.begin()->first;
+ unsigned Cnt = OrdersUses.begin()->second;
+ for (const auto &Pair : llvm::drop_begin(OrdersUses)) {
+ if (Cnt < Pair.second || (Cnt == Pair.second && Pair.first.empty())) {
+ BestOrder = Pair.first;
+ Cnt = Pair.second;
+ }
+ }
+ // Set order of the user node (reordering of operands and user nodes).
+ if (BestOrder.empty()) {
+ for_each(Data.second,
+ [&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) {
+ OrderedEntries.remove(Op.second);
+ });
+ continue;
+ }
+ // Erase operands from OrderedEntries list and adjust their orders.
+ VisitedOps.clear();
+ SmallVector<int> Mask;
+ inversePermutation(BestOrder, Mask);
+ SmallVector<int> MaskOrder(BestOrder.size(), UndefMaskElem);
+ unsigned E = BestOrder.size();
+ transform(BestOrder, MaskOrder.begin(), [E](unsigned I) {
+ return I < E ? static_cast<int>(I) : UndefMaskElem;
+ });
+ for (const std::pair<unsigned, TreeEntry *> &Op : Data.second) {
+ TreeEntry *TE = Op.second;
+ OrderedEntries.remove(TE);
+ if (!VisitedOps.insert(TE).second)
+ continue;
+ if (!TE->ReuseShuffleIndices.empty() && TE->ReorderIndices.empty()) {
+ // Just reorder reuses indices.
+ reorderReuses(TE->ReuseShuffleIndices, Mask);
+ continue;
+ }
+ // Gathers are processed separately.
+ if (TE->State != TreeEntry::Vectorize)
+ continue;
+ assert((BestOrder.size() == TE->ReorderIndices.size() ||
+ TE->ReorderIndices.empty()) &&
+ "Non-matching sizes of user/operand entries.");
+ reorderOrder(TE->ReorderIndices, Mask);
+ }
+ // For gathers just need to reorder its scalars.
+ for (TreeEntry *Gather : GatherOps) {
+ if (!Gather->ReuseShuffleIndices.empty())
+ continue;
+ assert(Gather->ReorderIndices.empty() &&
+ "Unexpected reordering of gathers.");
+ reorderScalars(Gather->Scalars, Mask);
+ OrderedEntries.remove(Gather);
+ }
+ // Reorder operands of the user node and set the ordering for the user
+ // node itself.
+ if (Data.first->State != TreeEntry::Vectorize ||
+ !isa<ExtractElementInst, ExtractValueInst, LoadInst>(
+ Data.first->getMainOp()) ||
+ Data.first->isAltShuffle())
+ Data.first->reorderOperands(Mask);
+ if (!isa<InsertElementInst, StoreInst>(Data.first->getMainOp()) ||
+ Data.first->isAltShuffle()) {
+ reorderScalars(Data.first->Scalars, Mask);
+ reorderOrder(Data.first->ReorderIndices, MaskOrder);
+ if (Data.first->ReuseShuffleIndices.empty() &&
+ !Data.first->ReorderIndices.empty() &&
+ !Data.first->isAltShuffle()) {
+ // Insert user node to the list to try to sink reordering deeper in
+ // the graph.
+ OrderedEntries.insert(Data.first);
+ }
+ } else {
+ reorderOrder(Data.first->ReorderIndices, Mask);
+ }
+ }
+ }
+}
+void BoUpSLP::buildExternalUses(
+ const ExtraValueToDebugLocsMap &ExternallyUsedValues) {
// Collect the values that we need to extract from the tree.
for (auto &TEPtr : VectorizableTree) {
TreeEntry *Entry = TEPtr.get();
@@ -2664,6 +3097,80 @@ void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
}
}
+void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
+ ArrayRef<Value *> UserIgnoreLst) {
+ deleteTree();
+ UserIgnoreList = UserIgnoreLst;
+ if (!allSameType(Roots))
+ return;
+ buildTree_rec(Roots, 0, EdgeInfo());
+}
+
+namespace {
+/// Tracks the state we can represent the loads in the given sequence.
+enum class LoadsState { Gather, Vectorize, ScatterVectorize };
+} // anonymous namespace
+
+/// Checks if the given array of loads can be represented as a vectorized,
+/// scatter or just simple gather.
+static LoadsState canVectorizeLoads(ArrayRef<Value *> VL, const Value *VL0,
+ const TargetTransformInfo &TTI,
+ const DataLayout &DL, ScalarEvolution &SE,
+ SmallVectorImpl<unsigned> &Order,
+ SmallVectorImpl<Value *> &PointerOps) {
+ // Check that a vectorized load would load the same memory as a scalar
+ // load. For example, we don't want to vectorize loads that are smaller
+ // than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM
+ // treats loading/storing it as an i8 struct. If we vectorize loads/stores
+ // from such a struct, we read/write packed bits disagreeing with the
+ // unvectorized version.
+ Type *ScalarTy = VL0->getType();
+
+ if (DL.getTypeSizeInBits(ScalarTy) != DL.getTypeAllocSizeInBits(ScalarTy))
+ return LoadsState::Gather;
+
+ // Make sure all loads in the bundle are simple - we can't vectorize
+ // atomic or volatile loads.
+ PointerOps.clear();
+ PointerOps.resize(VL.size());
+ auto *POIter = PointerOps.begin();
+ for (Value *V : VL) {
+ auto *L = cast<LoadInst>(V);
+ if (!L->isSimple())
+ return LoadsState::Gather;
+ *POIter = L->getPointerOperand();
+ ++POIter;
+ }
+
+ Order.clear();
+ // Check the order of pointer operands.
+ if (llvm::sortPtrAccesses(PointerOps, ScalarTy, DL, SE, Order)) {
+ Value *Ptr0;
+ Value *PtrN;
+ if (Order.empty()) {
+ Ptr0 = PointerOps.front();
+ PtrN = PointerOps.back();
+ } else {
+ Ptr0 = PointerOps[Order.front()];
+ PtrN = PointerOps[Order.back()];
+ }
+ Optional<int> Diff =
+ getPointersDiff(ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE);
+ // Check that the sorted loads are consecutive.
+ if (static_cast<unsigned>(*Diff) == VL.size() - 1)
+ return LoadsState::Vectorize;
+ Align CommonAlignment = cast<LoadInst>(VL0)->getAlign();
</cut>
And the rest of the week I flushed my maintainer queues ;-)
Other
=====
[update-ticket] <file:~/org/team.org::update-ticket>
Update [update-ticket] to work with cloud JIRA
Completed Reviews [8/8]
=======================
[PATCH 0/7] tests: docker images for hexagon, nios2, microblaze
Message-Id: <20211014224435.2539547-1-richard.henderson(a)linaro.org>
[PATCH] gdbstub: Switch to the thread receiving a signal
Message-Id: <20210930095111.23205-1-pavel(a)labath.sk>
[PATCH] replay: improve determinism of virtio-net
Message-Id: <162125666020.1252655.9997723318921206001.stgit@pasha-ThinkPad-X280>
[PATCH RESEND v3 0/2] add APIs to handle alternative sNaN propagation for fmax/fmin
Message-Id: <20211015065500.3850513-1-frank.chang(a)sifive.com>
[PATCH v3 0/5] plugins/cache: multicore cache modelling and minor tweaks
Message-Id: <20210722065428.134608-1-ma.mandourr(a)gmail.com>
[PATCH v2 0/2] plugins: add a drcov plugin
Message-Id: <163429165642.439576.16356288759891202632.stgit@pc-System-Product-Name>
[PATCH v2 0/2] plugins: add a drcov plugin
Message-Id: <163429165642.439576.16356288759891202632.stgit@pc-System-Product-Name>
[PATCH 0/3] KVM: qemu patches for few KVM features I developed
Message-Id: <20210914155214.105415-1-mlevitsk(a)redhat.com>
Absences
========
- Off Friday next week
Current Review Queue
====================
TODO [PATCH v2 00/48] tcg: optimize redundant sign extensions
Message-Id: <20211007195456.1168070-1-richard.henderson(a)linaro.org>
================================================================================================================================
TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things
Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com>
===================================================================================================================
TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option
Message-Id: <20211013010120.96851-1-sjg(a)chromium.org>
===========================================================================================================
TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV
Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org>
==================================================================================================================================
--
Alex Bennée
OK I've fixed up my JIRA and email tooling so this is a bit of a flush
of stale data from my org-mode.
VirtIO Initiative ([STR-9])
===========================
- posted Enabling hypervisor agnosticism for VirtIO backends
Message-Id: <87v94ldrqq.fsf(a)linaro.org>
- posted [a PR to do some cleanups to vm-virtio]
[STR-9] <https://projects.linaro.org/browse/STR-9>
[a PR to do some cleanups to vm-virtio]
<https://github.com/rust-vmm/vm-virtio/pull/103>
VirtIO RPMB ([STR-5])
=====================
- made more progress and now have PROGRAM_KEY/WRITE_COUNTER done -
feels like it's getting faster
[STR-5] <https://projects.linaro.org/browse/STR-5>
[Rust version of virtio-rpmb] <https://github.com/stsquad/virtio-rpmb>
[fixes for the C daemon]
<https://github.com/ruchi393/qemu/tree/vhost-user-rpmb-fixes>
[hacking branch] <https://github.com/stsquad/virtio-rpmb/tree/hacking>
Fix VirtIO spec as per Rucha's email
QEMU Upstream Work ([UM-2])
===========================
- posted [PATCH for 6.1-rc3 v1 0/4] gitlab and plugins pre-PR
Message-Id: <20210806141015.2487502-1-alex.bennee(a)linaro.org>
- prepared a potential [pull request for testing issues] but looks
like it will wait for 6.2
[UM-2] <https://projects.linaro.org/browse/UM-2>
[this is the last iteration before Monday]
<https://patchew.org/QEMU/20210709143005.1554-1-alex.bennee@linaro.org/>
[pull request for testing issues]
<https://github.com/stsquad/qemu/tree/pr/120821-for-6.1-rc4-1>
Completed Reviews [4/4]
=======================
[RFC PATCH 0/1] QEMU TCG plugin interface extensions
Message-Id: <20210821094527.491232-1-florian.hauschild(a)fs.ei.tum.de>
[PATCH 0/8] tcg: support 32-bit guest addresses as signed
Message-Id: <20211010174401.141339-1-richard.henderson(a)linaro.org>
[PATCH 0/3] KVM: qemu patches for few KVM features I developed
Message-Id: <20210914155214.105415-1-mlevitsk(a)redhat.com>
[PATCH 0/6] More record/replay acceptance tests
Message-Id: <162332427732.194926.7555369160312506539.stgit@pasha-ThinkPad-X280>
[PATCH 0/3] Gitlab-CI improvements
Message-Id: <20210730143809.717079-1-thuth(a)redhat.com>
========================================================================================
[PATCH v3 00/13] new plugin argument passing scheme
Message-Id: <20210722071236.139520-1-ma.mandourr(a)gmail.com>
==============================================================================================================
[PATCH] contrib/plugins: add a drcov plugin
Message-Id: <20211011111130.170178-1-arkaisp2021(a)gmail.com>
======================================================================================================
[RFC PATCH v2] Add a post for the new TCG cache modelling plugin
Message-Id: <20210617121707.764126-1-ma.mandourr(a)gmail.com>
===========================================================================================================================
Current Review Queue
====================
TODO [PATCH v2 00/48] tcg: optimize redundant sign extensions
Message-Id: <20211007195456.1168070-1-richard.henderson(a)linaro.org>
================================================================================================================================
TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things
Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com>
===================================================================================================================
TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option
Message-Id: <20211013010120.96851-1-sjg(a)chromium.org>
===========================================================================================================
TODO [PATCH v4 00/48]
Message-Id: <20211013024607.731881-1-richard.henderson(a)linaro.org>
=======================================================================================
--
Alex Bennée
After llvm commit 75127bce6de78b83b70b898a04473f213451f13e
Author: Qiongsi Wu <qwu(a)ibm.com>
[AIX][ZOS] Excluding merge-objc-interface.m from Tests
the following hot functions slowed down by more than 10% (but their benchmarks slowed down by less than 2%):
- 433.milc:[.] mult_su3_mat_vec slowed down by 16% from 1615 to 1871 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O3
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Reproduce builds:
<cut>
mkdir investigate-llvm-75127bce6de78b83b70b898a04473f213451f13e
cd investigate-llvm-75127bce6de78b83b70b898a04473f213451f13e
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach 75127bce6de78b83b70b898a04473f213451f13e
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach d01ae990e1fd6561ed86dc8004a7147dd09fb13c
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 75127bce6de78b83b70b898a04473f213451f13e
Author: Qiongsi Wu <qwu(a)ibm.com>
Date: Fri Oct 8 13:58:32 2021 +0000
[AIX][ZOS] Excluding merge-objc-interface.m from Tests
Objective C is not supported on AIX or ZOS. This patch excludes the newly added `clang/test/Modules/merge-objc-interface.m` (added by https://reviews.llvm.org/D110280) from AIX and ZOS testing.
Many existing tests are already disabled by https://reviews.llvm.org/D109060.
Reviewed By: jsji
Differential Revision: https://reviews.llvm.org/D111406
---
clang/test/Modules/merge-objc-interface.m | 1 +
1 file changed, 1 insertion(+)
diff --git a/clang/test/Modules/merge-objc-interface.m b/clang/test/Modules/merge-objc-interface.m
index fba06294a26a..f62f541c1a29 100644
--- a/clang/test/Modules/merge-objc-interface.m
+++ b/clang/test/Modules/merge-objc-interface.m
@@ -1,3 +1,4 @@
+// UNSUPPORTED: -zos, -aix
// RUN: rm -rf %t
// RUN: split-file %s %t
// RUN: %clang_cc1 -emit-llvm -o %t/test.bc -F%t/Frameworks %t/test.m \
</cut>
After llvm commit 483db1c706864d0940206228dfe64bdcd17faa4e
Author: Muhammad Omair Javaid <omair.javaid(a)linaro.org>
[LLDB] Remove xfail decorator TestInferiorAssert.py AArch64/Linux
the following benchmarks slowed down by more than 2%:
- 433.milc slowed down by 4% from 13309 to 13838 perf samples
- 433.milc:[.] mult_su3_mat_vec slowed down by 17% from 2058 to 2409 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O3
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Reproduce builds:
<cut>
mkdir investigate-llvm-483db1c706864d0940206228dfe64bdcd17faa4e
cd investigate-llvm-483db1c706864d0940206228dfe64bdcd17faa4e
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach 483db1c706864d0940206228dfe64bdcd17faa4e
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach d11ec6f67e45c630ab87bfb6010dcc93e89542fc
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 483db1c706864d0940206228dfe64bdcd17faa4e
Author: Muhammad Omair Javaid <omair.javaid(a)linaro.org>
Date: Mon Oct 11 14:34:41 2021 +0500
[LLDB] Remove xfail decorator TestInferiorAssert.py AArch64/Linux
TestInferiorAssert.py test_inferior_asserting_disassemble passes after
upgrading LLDB AArch64/Linux buildbot to Ubuntu Focal.
---
lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py b/lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py
index c533a1e29a12..5ac4eeb0514a 100644
--- a/lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py
+++ b/lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py
@@ -45,9 +45,7 @@ class AssertingInferiorTestCase(TestBase):
bugnumber="llvm.org/pr21793: need to implement support for detecting assertion / abort on Windows")
@expectedFailureAll(
oslist=["linux"],
- archs=[
- "aarch64",
- "arm"],
+ archs=["arm"],
triple=no_match(".*-android"),
bugnumber="llvm.org/pr25338")
@expectedFailureAll(bugnumber="llvm.org/pr26592", triple='^mips')
</cut>
After gcc commit 4a960d548b7d7d942f316c5295f6d849b74214f5
Author: Aldy Hernandez <aldyh(a)redhat.com>
Avoid invalid loop transformations in jump threading registry.
the following benchmarks slowed down by more than 2%:
- 471.omnetpp slowed down by 8% from 6348 to 6828 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: GCC + Glibc + GNU Linker
- Version: all components were built from their tip of trunk
- Target: arm-linux-gnueabihf
- Compiler flags: -O3 -marm
- Hardware: NVidia TK1 4x Cortex-A15
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar…
Reproduce builds:
<cut>
mkdir investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
cd investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach 4a960d548b7d7d942f316c5295f6d849b74214f5
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 29c92857039d0a105281be61c10c9e851aaeea4a
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 4a960d548b7d7d942f316c5295f6d849b74214f5
Author: Aldy Hernandez <aldyh(a)redhat.com>
Date: Thu Sep 23 10:59:24 2021 +0200
Avoid invalid loop transformations in jump threading registry.
My upcoming improvements to the forward jump threader make it thread
more aggressively. In investigating some "regressions", I noticed
that it has always allowed threading through empty latches and across
loop boundaries. As we have discussed recently, this should be avoided
until after loop optimizations have run their course.
Note that this wasn't much of a problem before because DOM/VRP
couldn't find these opportunities, but with a smarter solver, we trip
over them more easily.
Because the forward threader doesn't have an independent localized cost
model like the new threader (profitable_path_p), it is difficult to
catch these things at discovery. However, we can catch them at
registration time, with the added benefit that all the threaders
(forward and backward) can share the handcuffs.
This patch is an adaptation of what we do in the backward threader, but
it is not meant to catch everything we do there, as some of the
restrictions there are due to limitations of the different block
copiers (for example, the generic copier does not re-use existing
threading paths).
We could ideally remove the now redundant bits in profitable_path_p, but
I would prefer not to for two reasons. First, the backward threader uses
profitable_path_p as it discovers paths to avoid discovering paths in
unprofitable directions. Second, I would like to merge all the forward
cost restrictions into the profitability class in the backward threader,
not the other way around. Alas, that reshuffling will have to wait for
the next release.
As usual, there are quite a few tests that needed adjustments. It seems
we were quite happily threading improper scenarios. With most of them,
as can be seen in pr77445-2.c, we're merely shifting the threading to
after loop optimizations.
Tested on x86-64 Linux.
gcc/ChangeLog:
* tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths):
New.
(jt_path_registry::register_jump_thread): Call
cancel_invalid_paths.
* tree-ssa-threadupdate.h (class jt_path_registry): Add
cancel_invalid_paths.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/20030714-2.c: Adjust.
* gcc.dg/tree-ssa/pr66752-3.c: Adjust.
* gcc.dg/tree-ssa/pr77445-2.c: Adjust.
* gcc.dg/tree-ssa/ssa-dom-thread-18.c: Adjust.
* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust.
* gcc.dg/vect/bb-slp-16.c: Adjust.
---
gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c | 7 ++-
gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c | 19 ++++---
gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c | 4 +-
gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c | 4 +-
gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 4 +-
gcc/testsuite/gcc.dg/vect/bb-slp-16.c | 7 ---
gcc/tree-ssa-threadupdate.c | 67 ++++++++++++++++++-----
gcc/tree-ssa-threadupdate.h | 1 +
8 files changed, 78 insertions(+), 35 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c
index eb663f2ff5b..9585ff11307 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c
@@ -32,7 +32,8 @@ get_alias_set (t)
}
}
-/* There should be exactly three IF conditionals if we thread jumps
- properly. */
-/* { dg-final { scan-tree-dump-times "if " 3 "dom2"} } */
+/* There should be exactly 4 IF conditionals if we thread jumps
+ properly. There used to be 3, but one thread was crossing
+ loops. */
+/* { dg-final { scan-tree-dump-times "if " 4 "dom2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
index e1464e21170..922a331b217 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
@@ -1,5 +1,5 @@
/* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */
+/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3" } */
extern int status, pt;
extern int count;
@@ -32,10 +32,15 @@ foo (int N, int c, int b, int *a)
pt--;
}
-/* There are 4 jump threading opportunities, all of which will be
- realized, which will eliminate testing of FLAG, completely. */
-/* { dg-final { scan-tree-dump-times "Registering jump" 4 "thread1"} } */
+/* There are 2 jump threading opportunities (which don't cross loops),
+ all of which will be realized, which will eliminate testing of
+ FLAG, completely. */
+/* { dg-final { scan-tree-dump-times "Registering jump" 2 "thread1"} } */
-/* There should be no assignments or references to FLAG, verify they're
- eliminated as early as possible. */
-/* { dg-final { scan-tree-dump-not "if .flag" "dce2"} } */
+/* We used to remove references to FLAG by DCE2, but this was
+ depending on early threaders threading through loop boundaries
+ (which we shouldn't do). However, the late threading passes, which
+ run after loop optimizations , can successfully eliminate the
+ references to FLAG. Verify that ther are no references by the late
+ threading passes. */
+/* { dg-final { scan-tree-dump-not "if .flag" "thread3"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
index f9fc212f49e..01a0f1f197d 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
@@ -123,8 +123,8 @@ enum STATES FMS( u8 **in , u32 *transitions) {
aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough
to change decisions in switch expansion which in turn can expose new
jump threading opportunities. Skip the later tests on aarch64. */
-/* { dg-final { scan-tree-dump "Jumps threaded: 1\[1-9\]" "thread1" } } */
-/* { dg-final { scan-tree-dump-times "Invalid sum" 4 "thread1" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread1" } } */
+/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */
/* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */
/* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */
/* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target { ! aarch64*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
index 60d4f76f076..2d78d045516 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
@@ -21,5 +21,7 @@
condition.
All the cases are picked up by VRP1 as jump threads. */
-/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
+
+/* There used to be 6 jump threads found by thread1, but they all
+ depended on threading through distinct loops in ethread. */
/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
index e3d4b311c03..16abcde5053 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
@@ -1,8 +1,8 @@
/* { dg-do compile } */
/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 18" "thread1" } } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread3" { target { ! aarch64*-*-* } } } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread1" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! aarch64*-*-* } } } } */
/* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */
/* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
index 664e93e9b60..e68a9b62535 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
@@ -1,8 +1,5 @@
/* { dg-require-effective-target vect_int } */
-/* See note below as to why we disable threading. */
-/* { dg-additional-options "-fdisable-tree-thread1" } */
-
#include <stdarg.h>
#include "tree-vect.h"
@@ -30,10 +27,6 @@ main1 (int dummy)
*pout++ = *pin++ + a;
*pout++ = *pin++ + a;
*pout++ = *pin++ + a;
- /* In some architectures like ppc64, jump threading may thread
- the iteration where i==0 such that we no longer optimize the
- BB. Another alternative to disable jump threading would be
- to wrap the read from `i' into a function returning i. */
if (arr[i] = i)
a = i;
else
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index baac11280fa..2b9b8f81274 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -2757,6 +2757,58 @@ fwd_jt_path_registry::update_cfg (bool may_peel_loop_headers)
return retval;
}
+bool
+jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
+{
+ gcc_checking_assert (!path.is_empty ());
+ edge taken_edge = path[path.length () - 1]->e;
+ loop_p loop = taken_edge->src->loop_father;
+ bool seen_latch = false;
+ bool path_crosses_loops = false;
+
+ for (unsigned int i = 0; i < path.length (); i++)
+ {
+ edge e = path[i]->e;
+
+ if (e == NULL)
+ {
+ // NULL outgoing edges on a path can happen for jumping to a
+ // constant address.
+ cancel_thread (&path, "Found NULL edge in jump threading path");
+ return true;
+ }
+
+ if (loop->latch == e->src || loop->latch == e->dest)
+ seen_latch = true;
+
+ // The first entry represents the block with an outgoing edge
+ // that we will redirect to the jump threading path. Thus we
+ // don't care about that block's loop father.
+ if ((i > 0 && e->src->loop_father != loop)
+ || e->dest->loop_father != loop)
+ path_crosses_loops = true;
+
+ if (flag_checking && !m_backedge_threads)
+ gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
+ }
+
+ if (cfun->curr_properties & PROP_loop_opts_done)
+ return false;
+
+ if (seen_latch && empty_block_p (loop->latch))
+ {
+ cancel_thread (&path, "Threading through latch before loop opts "
+ "would create non-empty latch");
+ return true;
+ }
+ if (path_crosses_loops)
+ {
+ cancel_thread (&path, "Path crosses loops");
+ return true;
+ }
+ return false;
+}
+
/* Register a jump threading opportunity. We queue up all the jump
threading opportunities discovered by a pass and update the CFG
and SSA form all at once.
@@ -2776,19 +2828,8 @@ jt_path_registry::register_jump_thread (vec<jump_thread_edge *> *path)
return false;
}
- /* First make sure there are no NULL outgoing edges on the jump threading
- path. That can happen for jumping to a constant address. */
- for (unsigned int i = 0; i < path->length (); i++)
- {
- if ((*path)[i]->e == NULL)
- {
- cancel_thread (path, "Found NULL edge in jump threading path");
- return false;
- }
-
- if (flag_checking && !m_backedge_threads)
- gcc_assert (((*path)[i]->e->flags & EDGE_DFS_BACK) == 0);
- }
+ if (cancel_invalid_paths (*path))
+ return false;
if (dump_file && (dump_flags & TDF_DETAILS))
dump_jump_thread_path (dump_file, *path, true);
diff --git a/gcc/tree-ssa-threadupdate.h b/gcc/tree-ssa-threadupdate.h
index 8b48a671212..d68795c9f27 100644
--- a/gcc/tree-ssa-threadupdate.h
+++ b/gcc/tree-ssa-threadupdate.h
@@ -75,6 +75,7 @@ protected:
unsigned long m_num_threaded_edges;
private:
virtual bool update_cfg (bool peel_loop_headers) = 0;
+ bool cancel_invalid_paths (vec<jump_thread_edge *> &path);
jump_thread_path_allocator m_allocator;
// True if threading through back edges is allowed. This is only
// allowed in the generic copier in the backward threader.
</cut>
[TCWG CI] Regression caused by gcc: tree-optimization/102570 - teach VN about internal functions:
commit 55a3be2f5255d69e740d61b2c5aaba1ccbc643b8
Author: Richard Biener <rguenther(a)suse.de>
tree-optimization/102570 - teach VN about internal functions
Results regressed to
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_abe stage1:
-5
# build_abe qemu:
-2
# linux_n_obj:
18360
# First few build errors in logs:
# 00:01:21 ./include/linux/arm-smccc.h:460:40: error: ‘res.a0’ is used uninitialized [-Werror=uninitialized]
# 00:01:21 ./include/linux/arm-smccc.h:460:40: error: ‘res.a0’ is used uninitialized [-Werror=uninitialized]
# 00:01:21 ./include/linux/arm-smccc.h:460:40: error: ‘res.a0’ is used uninitialized [-Werror=uninitialized]
# 00:01:21 make[2]: *** [scripts/Makefile.build:288: arch/arm64/hyperv/hv_core.o] Error 1
# 00:01:22 crypto/asymmetric_keys/asymmetric_type.c:481:15: error: ‘restrict_method’ is used uninitialized [-Werror=uninitialized]
# 00:01:22 make[2]: *** [scripts/Makefile.build:288: crypto/asymmetric_keys/asymmetric_type.o] Error 1
# 00:01:22 ./include/trace/perf.h:38:25: error: ‘entry’ is used uninitialized [-Werror=uninitialized]
# 00:01:22 ./include/trace/perf.h:44:13: error: ‘__entry_size’ is used uninitialized [-Werror=uninitialized]
# 00:01:23 security/keys/encrypted-keys/encrypted.c:660:19: error: ‘mkey’ is used uninitialized [-Werror=uninitialized]
# 00:01:23 security/keys/encrypted-keys/encrypted.c:905:19: error: ‘epayload’ is used uninitialized [-Werror=uninitialized]
from
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_abe stage1:
-5
# build_abe qemu:
-2
# linux_n_obj:
21404
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_kernel/gnu-master-aarch64-next-allmodconfig
First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al…
Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al…
Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al…
Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al…
Reproduce builds:
<cut>
mkdir investigate-gcc-55a3be2f5255d69e740d61b2c5aaba1ccbc643b8
cd investigate-gcc-55a3be2f5255d69e740d61b2c5aaba1ccbc643b8
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach 55a3be2f5255d69e740d61b2c5aaba1ccbc643b8
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 22d34a2a50651d01669b6fbcdb9677c18d2197c5
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 55a3be2f5255d69e740d61b2c5aaba1ccbc643b8
Author: Richard Biener <rguenther(a)suse.de>
Date: Mon Oct 4 10:57:45 2021 +0200
tree-optimization/102570 - teach VN about internal functions
We're now using internal functions for a lot of stuff but there's
still missing VN support out of laziness. The following instantiates
support and adds testcases for FRE and PRE (hoisting).
2021-10-04 Richard Biener <rguenther(a)suse.de>
PR tree-optimization/102570
* tree-ssa-sccvn.h (vn_reference_op_struct): Document
we are using clique for the internal function code.
* tree-ssa-sccvn.c (vn_reference_op_eq): Compare the
internal function code.
(print_vn_reference_ops): Print the internal function code.
(vn_reference_op_compute_hash): Hash it.
(copy_reference_ops_from_call): Record it.
(visit_stmt): Remove the restriction around internal function
calls.
(fully_constant_vn_reference_p): Use fold_const_call and handle
internal functions.
(vn_reference_eq): Compare call return types.
* tree-ssa-pre.c (create_expression_by_pieces): Handle
generating calls to internal functions.
(compute_avail): Remove the restriction around internal function
calls.
* gcc.dg/tree-ssa/ssa-fre-96.c: New testcase.
* gcc.dg/tree-ssa/ssa-pre-33.c: Likewise.
---
gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-96.c | 14 +++++
gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-33.c | 15 +++++
gcc/tree-ssa-pre.c | 27 +++++----
gcc/tree-ssa-sccvn.c | 91 ++++++++++++++++++------------
gcc/tree-ssa-sccvn.h | 3 +-
5 files changed, 103 insertions(+), 47 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-96.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-96.c
new file mode 100644
index 00000000000..fd1d5713b5f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-96.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdump-tree-fre1" } */
+
+_Bool f1(unsigned x, unsigned y, unsigned *res)
+{
+ _Bool t = __builtin_add_overflow(x, y, res);
+ unsigned res1;
+ _Bool t1 = __builtin_add_overflow(x, y, &res1);
+ *res -= res1;
+ return t==t1;
+}
+
+/* { dg-final { scan-tree-dump-times "ADD_OVERFLOW" 1 "fre1" } } */
+/* { dg-final { scan-tree-dump "return 1;" "fre1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-33.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-33.c
new file mode 100644
index 00000000000..3b3bd629bc2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-33.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-pre" } */
+
+_Bool f1(unsigned x, unsigned y, unsigned *res, int flag, _Bool *t)
+{
+ if (flag)
+ *t = __builtin_add_overflow(x, y, res);
+ unsigned res1;
+ _Bool t1 = __builtin_add_overflow(x, y, &res1);
+ *res -= res1;
+ return *t==t1;
+}
+
+/* We should hoist the .ADD_OVERFLOW to before the check. */
+/* { dg-final { scan-tree-dump-times "ADD_OVERFLOW" 1 "pre" } } */
diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 08755847f66..1cc1aae694f 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -2855,9 +2855,13 @@ create_expression_by_pieces (basic_block block, pre_expr expr,
unsigned int operand = 1;
vn_reference_op_t currop = &ref->operands[0];
tree sc = NULL_TREE;
- tree fn = find_or_generate_expression (block, currop->op0, stmts);
- if (!fn)
- return NULL_TREE;
+ tree fn = NULL_TREE;
+ if (currop->op0)
+ {
+ fn = find_or_generate_expression (block, currop->op0, stmts);
+ if (!fn)
+ return NULL_TREE;
+ }
if (currop->op1)
{
sc = find_or_generate_expression (block, currop->op1, stmts);
@@ -2873,12 +2877,19 @@ create_expression_by_pieces (basic_block block, pre_expr expr,
return NULL_TREE;
args.quick_push (arg);
}
- gcall *call = gimple_build_call_vec (fn, args);
+ gcall *call;
+ if (currop->op0)
+ {
+ call = gimple_build_call_vec (fn, args);
+ gimple_call_set_fntype (call, currop->type);
+ }
+ else
+ call = gimple_build_call_internal_vec ((internal_fn)currop->clique,
+ args);
gimple_set_location (call, expr->loc);
- gimple_call_set_fntype (call, currop->type);
if (sc)
gimple_call_set_chain (call, sc);
- tree forcedname = make_ssa_name (TREE_TYPE (currop->type));
+ tree forcedname = make_ssa_name (ref->type);
gimple_call_set_lhs (call, forcedname);
/* There's no CCP pass after PRE which would re-compute alignment
information so make sure we re-materialize this here. */
@@ -4004,10 +4015,6 @@ compute_avail (function *fun)
vn_reference_s ref1;
pre_expr result = NULL;
- /* We can value number only calls to real functions. */
- if (gimple_call_internal_p (stmt))
- continue;
-
vn_reference_lookup_call (as_a <gcall *> (stmt), &ref, &ref1);
/* There is no point to PRE a call without a value. */
if (!ref || !ref->result)
diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index 416a5252144..0d942218279 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -70,6 +70,7 @@ along with GCC; see the file COPYING3. If not see
#include "tree-scalar-evolution.h"
#include "tree-ssa-loop-niter.h"
#include "builtins.h"
+#include "fold-const-call.h"
#include "tree-ssa-sccvn.h"
/* This algorithm is based on the SCC algorithm presented by Keith
@@ -212,7 +213,8 @@ vn_reference_op_eq (const void *p1, const void *p2)
TYPE_MAIN_VARIANT (vro2->type))))
&& expressions_equal_p (vro1->op0, vro2->op0)
&& expressions_equal_p (vro1->op1, vro2->op1)
- && expressions_equal_p (vro1->op2, vro2->op2));
+ && expressions_equal_p (vro1->op2, vro2->op2)
+ && (vro1->opcode != CALL_EXPR || vro1->clique == vro2->clique));
}
/* Free a reference operation structure VP. */
@@ -264,15 +266,18 @@ print_vn_reference_ops (FILE *outfile, const vec<vn_reference_op_s> ops)
&& TREE_CODE_CLASS (vro->opcode) != tcc_declaration)
{
fprintf (outfile, "%s", get_tree_code_name (vro->opcode));
- if (vro->op0)
+ if (vro->op0 || vro->opcode == CALL_EXPR)
{
fprintf (outfile, "<");
closebrace = true;
}
}
- if (vro->op0)
+ if (vro->op0 || vro->opcode == CALL_EXPR)
{
- print_generic_expr (outfile, vro->op0);
+ if (!vro->op0)
+ fprintf (outfile, internal_fn_name ((internal_fn)vro->clique));
+ else
+ print_generic_expr (outfile, vro->op0);
if (vro->op1)
{
fprintf (outfile, ",");
@@ -684,6 +689,8 @@ static void
vn_reference_op_compute_hash (const vn_reference_op_t vro1, inchash::hash &hstate)
{
hstate.add_int (vro1->opcode);
+ if (vro1->opcode == CALL_EXPR && !vro1->op0)
+ hstate.add_int (vro1->clique);
if (vro1->op0)
inchash::add_expr (vro1->op0, hstate);
if (vro1->op1)
@@ -769,11 +776,16 @@ vn_reference_eq (const_vn_reference_t const vr1, const_vn_reference_t const vr2)
if (vr1->type != vr2->type)
return false;
}
+ else if (vr1->type == vr2->type)
+ ;
else if (COMPLETE_TYPE_P (vr1->type) != COMPLETE_TYPE_P (vr2->type)
|| (COMPLETE_TYPE_P (vr1->type)
&& !expressions_equal_p (TYPE_SIZE (vr1->type),
TYPE_SIZE (vr2->type))))
return false;
+ else if (vr1->operands[0].opcode == CALL_EXPR
+ && !types_compatible_p (vr1->type, vr2->type))
+ return false;
else if (INTEGRAL_TYPE_P (vr1->type)
&& INTEGRAL_TYPE_P (vr2->type))
{
@@ -1270,6 +1282,8 @@ copy_reference_ops_from_call (gcall *call,
temp.type = gimple_call_fntype (call);
temp.opcode = CALL_EXPR;
temp.op0 = gimple_call_fn (call);
+ if (gimple_call_internal_p (call))
+ temp.clique = gimple_call_internal_fn (call);
temp.op1 = gimple_call_chain (call);
if (stmt_could_throw_p (cfun, call) && (lr = lookup_stmt_eh_lp (call)) > 0)
temp.op2 = size_int (lr);
@@ -1459,9 +1473,11 @@ fully_constant_vn_reference_p (vn_reference_t ref)
a call to a builtin function with at most two arguments. */
op = &operands[0];
if (op->opcode == CALL_EXPR
- && TREE_CODE (op->op0) == ADDR_EXPR
- && TREE_CODE (TREE_OPERAND (op->op0, 0)) == FUNCTION_DECL
- && fndecl_built_in_p (TREE_OPERAND (op->op0, 0))
+ && (!op->op0
+ || (TREE_CODE (op->op0) == ADDR_EXPR
+ && TREE_CODE (TREE_OPERAND (op->op0, 0)) == FUNCTION_DECL
+ && fndecl_built_in_p (TREE_OPERAND (op->op0, 0),
+ BUILT_IN_NORMAL)))
&& operands.length () >= 2
&& operands.length () <= 3)
{
@@ -1481,13 +1497,17 @@ fully_constant_vn_reference_p (vn_reference_t ref)
anyconst = true;
if (anyconst)
{
- tree folded = build_call_expr (TREE_OPERAND (op->op0, 0),
- arg1 ? 2 : 1,
- arg0->op0,
- arg1 ? arg1->op0 : NULL);
- if (folded
- && TREE_CODE (folded) == NOP_EXPR)
- folded = TREE_OPERAND (folded, 0);
+ combined_fn fn;
+ if (op->op0)
+ fn = as_combined_fn (DECL_FUNCTION_CODE
+ (TREE_OPERAND (op->op0, 0)));
+ else
+ fn = as_combined_fn ((internal_fn) op->clique);
+ tree folded;
+ if (arg1)
+ folded = fold_const_call (fn, ref->type, arg0->op0, arg1->op0);
+ else
+ folded = fold_const_call (fn, ref->type, arg0->op0);
if (folded
&& is_gimple_min_invariant (folded))
return folded;
@@ -5648,28 +5668,27 @@ visit_stmt (gimple *stmt, bool backedges_varying_p = false)
&& TREE_CODE (TREE_OPERAND (fn, 0)) == FUNCTION_DECL)
extra_fnflags = flags_from_decl_or_type (TREE_OPERAND (fn, 0));
}
- if (!gimple_call_internal_p (call_stmt)
- && (/* Calls to the same function with the same vuse
- and the same operands do not necessarily return the same
- value, unless they're pure or const. */
- ((gimple_call_flags (call_stmt) | extra_fnflags)
- & (ECF_PURE | ECF_CONST))
- /* If calls have a vdef, subsequent calls won't have
- the same incoming vuse. So, if 2 calls with vdef have the
- same vuse, we know they're not subsequent.
- We can value number 2 calls to the same function with the
- same vuse and the same operands which are not subsequent
- the same, because there is no code in the program that can
- compare the 2 values... */
- || (gimple_vdef (call_stmt)
- /* ... unless the call returns a pointer which does
- not alias with anything else. In which case the
- information that the values are distinct are encoded
- in the IL. */
- && !(gimple_call_return_flags (call_stmt) & ERF_NOALIAS)
- /* Only perform the following when being called from PRE
- which embeds tail merging. */
- && default_vn_walk_kind == VN_WALK)))
+ if (/* Calls to the same function with the same vuse
+ and the same operands do not necessarily return the same
+ value, unless they're pure or const. */
+ ((gimple_call_flags (call_stmt) | extra_fnflags)
+ & (ECF_PURE | ECF_CONST))
+ /* If calls have a vdef, subsequent calls won't have
+ the same incoming vuse. So, if 2 calls with vdef have the
+ same vuse, we know they're not subsequent.
+ We can value number 2 calls to the same function with the
+ same vuse and the same operands which are not subsequent
+ the same, because there is no code in the program that can
+ compare the 2 values... */
+ || (gimple_vdef (call_stmt)
+ /* ... unless the call returns a pointer which does
+ not alias with anything else. In which case the
+ information that the values are distinct are encoded
+ in the IL. */
+ && !(gimple_call_return_flags (call_stmt) & ERF_NOALIAS)
+ /* Only perform the following when being called from PRE
+ which embeds tail merging. */
+ && default_vn_walk_kind == VN_WALK))
changed = visit_reference_op_call (lhs, call_stmt);
else
changed = defs_to_varying (call_stmt);
diff --git a/gcc/tree-ssa-sccvn.h b/gcc/tree-ssa-sccvn.h
index 96100596d2e..8a1b649c726 100644
--- a/gcc/tree-ssa-sccvn.h
+++ b/gcc/tree-ssa-sccvn.h
@@ -106,7 +106,8 @@ typedef const struct vn_phi_s *const_vn_phi_t;
typedef struct vn_reference_op_struct
{
ENUM_BITFIELD(tree_code) opcode : 16;
- /* Dependence info, used for [TARGET_]MEM_REF only. */
+ /* Dependence info, used for [TARGET_]MEM_REF only. For internal
+ function calls clique is also used for the internal function code. */
unsigned short clique;
unsigned short base;
unsigned reverse : 1;
</cut>
After gcc commit a459ee44c0a74b0df0485ed7a56683816c02aae9
Author: Kyrylo Tkachov <kyrylo.tkachov(a)arm.com>
aarch64: Improve size heuristic for cpymem expansion
the following benchmarks grew in size by more than 1%:
- 458.sjeng grew in size by 4% from 105780 to 109944 bytes
- 459.GemsFDTD grew in size by 2% from 247504 to 251468 bytes
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: GCC + Glibc + GNU Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -Os -flto
- Hardware: APM Mustang 8x X-Gene1
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_gnu_apm/gnu-master-aarch64-spec2k6-Os_LTO
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Reproduce builds:
<cut>
mkdir investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9
cd investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach a459ee44c0a74b0df0485ed7a56683816c02aae9
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 8f95e3c04d659d541ca4937b3df2f1175a1c5f05
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit a459ee44c0a74b0df0485ed7a56683816c02aae9
Author: Kyrylo Tkachov <kyrylo.tkachov(a)arm.com>
Date: Wed Sep 29 11:21:45 2021 +0100
aarch64: Improve size heuristic for cpymem expansion
Similar to my previous patch for setmem this one does the same for the cpymem expansion.
We count the number of ops emitted and compare it against the alternative of just calling
the library function when optimising for size.
For the code:
void
cpy_127 (char *out, char *in)
{
__builtin_memcpy (out, in, 127);
}
void
cpy_128 (char *out, char *in)
{
__builtin_memcpy (out, in, 128);
}
we now emit a call to memcpy (with an extra MOV-immediate instruction for the size) instead of:
cpy_127(char*, char*):
ldp q0, q1, [x1]
stp q0, q1, [x0]
ldp q0, q1, [x1, 32]
stp q0, q1, [x0, 32]
ldp q0, q1, [x1, 64]
stp q0, q1, [x0, 64]
ldr q0, [x1, 96]
str q0, [x0, 96]
ldr q0, [x1, 111]
str q0, [x0, 111]
ret
cpy_128(char*, char*):
ldp q0, q1, [x1]
stp q0, q1, [x0]
ldp q0, q1, [x1, 32]
stp q0, q1, [x0, 32]
ldp q0, q1, [x1, 64]
stp q0, q1, [x0, 64]
ldp q0, q1, [x1, 96]
stp q0, q1, [x0, 96]
ret
which is a clear code size win. Speed optimisation heuristics remain unchanged.
2021-09-29 Kyrylo Tkachov <kyrylo.tkachov(a)arm.com>
* config/aarch64/aarch64.c (aarch64_expand_cpymem): Count number of
emitted operations and adjust heuristic for code size.
2021-09-29 Kyrylo Tkachov <kyrylo.tkachov(a)arm.com>
* gcc.target/aarch64/cpymem-size.c: New test.
---
gcc/config/aarch64/aarch64.c | 36 ++++++++++++++++++--------
gcc/testsuite/gcc.target/aarch64/cpymem-size.c | 29 +++++++++++++++++++++
2 files changed, 54 insertions(+), 11 deletions(-)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index ac17c1c88fb..a9a1800af53 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -23390,7 +23390,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
}
/* Expand cpymem, as if from a __builtin_memcpy. Return true if
- we succeed, otherwise return false. */
+ we succeed, otherwise return false, indicating that a libcall to
+ memcpy should be emitted. */
bool
aarch64_expand_cpymem (rtx *operands)
@@ -23407,11 +23408,13 @@ aarch64_expand_cpymem (rtx *operands)
unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
- /* Inline up to 256 bytes when optimizing for speed. */
+ /* Try to inline up to 256 bytes. */
unsigned HOST_WIDE_INT max_copy_size = 256;
- if (optimize_function_for_size_p (cfun))
- max_copy_size = 128;
+ bool size_p = optimize_function_for_size_p (cfun);
+
+ if (size > max_copy_size)
+ return false;
int copy_bits = 256;
@@ -23421,13 +23424,14 @@ aarch64_expand_cpymem (rtx *operands)
|| !TARGET_SIMD
|| (aarch64_tune_params.extra_tuning_flags
& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
- {
- copy_bits = 128;
- max_copy_size = max_copy_size / 2;
- }
+ copy_bits = 128;
- if (size > max_copy_size)
- return false;
+ /* Emit an inline load+store sequence and count the number of operations
+ involved. We use a simple count of just the loads and stores emitted
+ rather than rtx_insn count as all the pointer adjustments and reg copying
+ in this function will get optimized away later in the pipeline. */
+ start_sequence ();
+ unsigned nops = 0;
base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
dst = adjust_automodify_address (dst, VOIDmode, base, 0);
@@ -23456,7 +23460,8 @@ aarch64_expand_cpymem (rtx *operands)
cur_mode = V4SImode;
aarch64_copy_one_block_and_progress_pointers (&src, &dst, cur_mode);
-
+ /* A single block copy is 1 load + 1 store. */
+ nops += 2;
n -= mode_bits;
/* Emit trailing copies using overlapping unaligned accesses - this is
@@ -23471,7 +23476,16 @@ aarch64_expand_cpymem (rtx *operands)
n = n_bits;
}
}
+ rtx_insn *seq = get_insns ();
+ end_sequence ();
+
+ /* A memcpy libcall in the worst case takes 3 instructions to prepare the
+ arguments + 1 for the call. */
+ unsigned libcall_cost = 4;
+ if (size_p && libcall_cost < nops)
+ return false;
+ emit_insn (seq);
return true;
}
diff --git a/gcc/testsuite/gcc.target/aarch64/cpymem-size.c b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c
new file mode 100644
index 00000000000..4d488b74301
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+#include <stdlib.h>
+
+/*
+** cpy_127:
+** mov x2, 127
+** b memcpy
+*/
+void
+cpy_127 (char *out, char *in)
+{
+ __builtin_memcpy (out, in, 127);
+}
+
+/*
+** cpy_128:
+** mov x2, 128
+** b memcpy
+*/
+void
+cpy_128 (char *out, char *in)
+{
+ __builtin_memcpy (out, in, 128);
+}
+
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
</cut>
Hi Arthur,
Thanks for looking into this!
The flags to compile regexec.c were:
-O3 --target=aarch64-linux-gnu -fgnu89-inline
Clang was configured with (on x86_64-linux-gnu host):
cmake -G Ninja ../llvm/llvm '-DLLVM_ENABLE_PROJECTS=clang;lld' -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True -DCMAKE_INSTALL_PREFIX=../llvm-install -DLLVM_TARGETS_TO_BUILD=AArch64
Please let me know if the above doesn’t work for you.
Regards,
--
Maxim Kuvyrkov
https://www.linaro.org
> On 29 Sep 2021, at 20:47, Arthur Eubanks <aeubanks(a)google.com> wrote:
>
> Do you know the flags passed to Clang to compile the sources? I tried compiling the preprocessed sources but ran into the below, and couldn't find the flags in any of the logs.
>
> In file included from regexec.c:93:
> In file included from ./perl.h:384:
> In file included from /home/tcwg-buildslave/workspace/tcwg_bmk_0/abe/builds/destdir/x86_64-pc-linux-gnu/aarch64-linux-gnu/libc/usr/include/sys/types.h:144:
> /home/tcwg-buildslave/workspace/tcwg_bmk_0/llvm-install/lib/clang/14.0.0/include/stddef.h:46:27: error: typedef redefinition with different types ('unsigned long' vs 'unsigned long long')
> typedef long unsigned int size_t;
> ^
> 1 error generated.
>
>
>
> And yeah just moving the code around could cause major performance regressions, I've had other patches do the same for various benchmarks, there's not much we can do about that if that's actually the root cause. If I can compile the file I can check if the optimization actually created worse IR or not.
>
>
> On Wed, Sep 29, 2021 at 5:59 AM Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org> wrote:
> Hi Arthur,
>
> Pre-processed source is in the save-temps tarballs linked below; S_regmatch() is in regexec.i .
>
> The save-temps also have .s assembly file for before and after your patch, and the only code-gen difference is in S_reginclass() function — see the attached screenshot #1.
>
> Looking into profile of S_regmatch(), some of the extra cycles come from hot loop starting with “cbz w19,...” getting misaligned — before your patch it was starting at "2bce10", and after it starts at "2bce6c”.
>
> Maybe the added instructions in S_reginclass() pushed the loop in S_regmatch() in an unfortunate way?
>
> --
> Maxim Kuvyrkov
> https://www.linaro.org
>
>> On 27 Sep 2021, at 20:05, Arthur Eubanks <aeubanks(a)google.com> wrote:
>>
>> Could I get the source file with S_regmatch()?
>>
>> On Mon, Sep 27, 2021 at 6:07 AM Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org> wrote:
>> Hi Arthur,
>>
>> Your patch seems to be slowing down 400.perlbench by 6% — due to slow down of its hot function S_regmatch() by 14%.
>>
>> Could you take a look if this is easily fixable, please?
>>
>> Regards,
>>
>> --
>> Maxim Kuvyrkov
>> https://www.linaro.org
>>
>> > On 24 Sep 2021, at 15:07, ci_notify(a)linaro.org wrote:
>> >
>> > After llvm commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc
>> > Author: Arthur Eubanks <aeubanks(a)google.com>
>> >
>> > [SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest
>> >
>> > the following benchmarks slowed down by more than 2%:
>> > - 400.perlbench slowed down by 6% from 9730 to 10312 perf samples
>> > - 400.perlbench:[.] S_regmatch slowed down by 14% from 3660 to 4188 perf samples
>> >
>> > Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
>> >
>> > For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
>> > - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
>> > - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
>> > - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
>> >
>> > Configuration:
>> > - Benchmark: SPEC CPU2006
>> > - Toolchain: Clang + Glibc + LLVM Linker
>> > - Version: all components were built from their tip of trunk
>> > - Target: aarch64-linux-gnu
>> > - Compiler flags: -O3
>> > - Hardware: NVidia TX1 4x Cortex-A57
>> >
>> > This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
>
> <2021-09-29_15-44-27.png><2021-09-29_15-53-20.png>
Progress
* UM-2 [QEMU upstream maintainership]
+ Worked through my code-review backlog
+ Noticed that we never got round to making our emulated GICv3
support having redistributors in more than one contiguous region;
this prevents using more than 123 CPUs with the virt board. Sent
out a patchset which adds the necessary handling.
+ Generally trying to tie off loose ends pre-holiday :-)
-- PMM
Identified regression caused by *linux:30f349097897c115345beabeecc5e710b479ff1e*:
commit 30f349097897c115345beabeecc5e710b479ff1e
Merge: 9c566611ac5c f76c87e8c337
Author: Linus Torvalds <torvalds(a)linux-foundation.org>
Merge tag 'pm-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Results regressed to (for first_bad == 30f349097897c115345beabeecc5e710b479ff1e)
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_abe stage1:
-5
# build_abe qemu:
-2
# linux_n_obj:
21782
# First few build errors in logs:
from (for last_good == 9c566611ac5cc7b45af943632f7a9b1b6a642991)
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_abe stage1:
-5
# build_abe qemu:
-2
# linux_n_obj:
29893
# linux build successful:
all
This commit has regressed these CI configurations:
- tcwg_kernel/gnu-release-arm-mainline-allmodconfig
Artifacts of last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-mainline-a…
Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-mainline-a…
Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-mainline-a…
Reproduce builds:
<cut>
mkdir investigate-linux-30f349097897c115345beabeecc5e710b479ff1e
cd investigate-linux-30f349097897c115345beabeecc5e710b479ff1e
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-mainline-a… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-mainline-a… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-mainline-a… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /linux/ ./ ./bisect/baseline/
cd linux
# Reproduce first_bad build
git checkout --detach 30f349097897c115345beabeecc5e710b479ff1e
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 9c566611ac5cc7b45af943632f7a9b1b6a642991
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 30f349097897c115345beabeecc5e710b479ff1e
Merge: 9c566611ac5c f76c87e8c337
Author: Linus Torvalds <torvalds(a)linux-foundation.org>
Date: Wed Sep 8 16:38:25 2021 -0700
Merge tag 'pm-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These are mostly ARM cpufreq driver updates, including one new
MediaTek driver that has just passed all of the reviews, with the
addition of a revert of a recent intel_pstate commit, some core
cpufreq changes and a DT-related update of the operating performance
points (OPP) support code.
Specifics:
- Add new cpufreq driver for the MediaTek MT6779 platform called
mediatek-hw along with corresponding DT bindings (Hector.Yuan).
- Add DCVS interrupt support to the qcom-cpufreq-hw driver (Thara
Gopinath).
- Make the qcom-cpufreq-hw driver set the dvfs_possible_from_any_cpu
policy flag (Taniya Das).
- Blocklist more Qualcomm platforms in cpufreq-dt-platdev (Bjorn
Andersson).
- Make the vexpress cpufreq driver set the CPUFREQ_IS_COOLING_DEV
flag (Viresh Kumar).
- Add new cpufreq driver callback to allow drivers to register with
the Energy Model in a consistent way and make several drivers use
it (Viresh Kumar).
- Change the remaining users of the .ready() cpufreq driver callback
to move the code from it elsewhere and drop it from the cpufreq
core (Viresh Kumar).
- Revert recent intel_pstate change adding HWP guaranteed performance
change notification support to it that led to problems, because the
notification in question is triggered prematurely on some systems
(Rafael Wysocki).
- Convert the OPP DT bindings to DT schema and clean them up while at
it (Rob Herring)"
* tag 'pm-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification"
cpufreq: mediatek-hw: Add support for CPUFREQ HW
cpufreq: Add of_perf_domain_get_sharing_cpumask
dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW
cpufreq: Remove ready() callback
cpufreq: sh: Remove sh_cpufreq_cpu_ready()
cpufreq: acpi: Remove acpi_cpufreq_cpu_ready()
cpufreq: qcom-hw: Set dvfs_possible_from_any_cpu cpufreq driver flag
cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev
cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support
cpufreq: scmi: Use .register_em() to register with energy model
cpufreq: vexpress: Use .register_em() to register with energy model
cpufreq: scpi: Use .register_em() to register with energy model
dt-bindings: opp: Convert to DT schema
dt-bindings: Clean-up OPP binding node names in examples
ARM: dts: omap: Drop references to opp.txt
cpufreq: qcom-cpufreq-hw: Use .register_em() to register with energy model
cpufreq: omap: Use .register_em() to register with energy model
cpufreq: mediatek: Use .register_em() to register with energy model
cpufreq: imx6q: Use .register_em() to register with energy model
...
Documentation/cpu-freq/cpu-drivers.rst | 3 -
.../devicetree/bindings/cpufreq/cpufreq-dt.txt | 2 +-
.../bindings/cpufreq/cpufreq-mediatek-hw.yaml | 70 +++
.../bindings/cpufreq/cpufreq-mediatek.txt | 2 +-
.../devicetree/bindings/cpufreq/cpufreq-st.txt | 6 +-
.../bindings/cpufreq/nvidia,tegra20-cpufreq.txt | 2 +-
.../devicetree/bindings/devfreq/rk3399_dmc.txt | 2 +-
.../devicetree/bindings/gpu/arm,mali-bifrost.yaml | 2 +-
.../devicetree/bindings/gpu/arm,mali-midgard.yaml | 2 +-
.../bindings/interconnect/fsl,imx8m-noc.yaml | 4 +-
.../opp/allwinner,sun50i-h6-operating-points.yaml | 4 +
Documentation/devicetree/bindings/opp/opp-v1.yaml | 51 ++
.../devicetree/bindings/opp/opp-v2-base.yaml | 214 +++++++
Documentation/devicetree/bindings/opp/opp-v2.yaml | 475 ++++++++++++++++
Documentation/devicetree/bindings/opp/opp.txt | 622 ---------------------
Documentation/devicetree/bindings/opp/qcom-opp.txt | 2 +-
.../bindings/opp/ti-omap5-opp-supply.txt | 2 +-
.../devicetree/bindings/power/power-domain.yaml | 2 +-
.../translations/zh_CN/cpu-freq/cpu-drivers.rst | 2 -
arch/arm/boot/dts/omap34xx.dtsi | 1 -
arch/arm/boot/dts/omap36xx.dtsi | 1 -
drivers/base/arch_topology.c | 2 +
drivers/cpufreq/Kconfig.arm | 12 +
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/acpi-cpufreq.c | 14 +-
drivers/cpufreq/cpufreq-dt-platdev.c | 4 +
drivers/cpufreq/cpufreq-dt.c | 3 +-
drivers/cpufreq/cpufreq.c | 17 +-
drivers/cpufreq/imx6q-cpufreq.c | 2 +-
drivers/cpufreq/intel_pstate.c | 39 --
drivers/cpufreq/mediatek-cpufreq-hw.c | 308 ++++++++++
drivers/cpufreq/mediatek-cpufreq.c | 3 +-
drivers/cpufreq/omap-cpufreq.c | 2 +-
drivers/cpufreq/qcom-cpufreq-hw.c | 151 ++++-
drivers/cpufreq/scmi-cpufreq.c | 65 ++-
drivers/cpufreq/scpi-cpufreq.c | 3 +-
drivers/cpufreq/sh-cpufreq.c | 11 -
drivers/cpufreq/vexpress-spc-cpufreq.c | 25 +-
include/linux/cpufreq.h | 75 ++-
39 files changed, 1441 insertions(+), 767 deletions(-)
</cut>
Successfully identified regression in *linux* in CI configuration tcwg_kernel/llvm-release-aarch64-next-allnoconfig. So far, this commit has regressed CI configurations:
- tcwg_kernel/llvm-release-aarch64-next-allnoconfig
Culprit:
<cut>
commit 8633ef82f101c040427b57d4df7b706261420b94
Author: Javier Martinez Canillas <javierm(a)redhat.com>
Date: Fri Jun 25 15:13:59 2021 +0200
drivers/firmware: consolidate EFI framebuffer setup for all arches
The register_gop_device() function registers an "efi-framebuffer" platform
device to match against the efifb driver, to have an early framebuffer for
EFI platforms.
But there is already support to do exactly the same by the Generic System
Framebuffers (sysfb) driver. This used to be only for X86 but it has been
moved to drivers/firmware and could be reused by other architectures.
Also, besides supporting registering an "efi-framebuffer", this driver can
register a "simple-framebuffer" allowing to use the siple{fb,drm} drivers
on non-X86 EFI platforms. For example, on aarch64 these drivers can only
be used with DT and doesn't have code to register a "simple-frambuffer"
platform device when booting with EFI.
For these reasons, let's remove the register_gop_device() duplicated code
and instead move the platform specific logic that's there to sysfb driver.
Signed-off-by: Javier Martinez Canillas <javierm(a)redhat.com>
Acked-by: Borislav Petkov <bp(a)suse.de>
Acked-by: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20210625131359.1804394-1-javi…
</cut>
Results regressed to (for first_bad == 8633ef82f101c040427b57d4df7b706261420b94)
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_llvm:
-5
# build_abe qemu:
-2
# linux_n_obj:
600
# First few build errors in logs:
# 00:00:38 ld.lld: error: undefined symbol: screen_info
# 00:00:38 make: *** [vmlinux] Error 1
from (for last_good == d391c58271072d0b0fad93c82018d495b2633448)
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_llvm:
-5
# build_abe qemu:
-2
# linux_n_obj:
601
# linux build successful:
all
# linux boot successful:
boot
Artifacts of last_good build: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next…
Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next…
Build top page/logs: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next…
Configuration details:
rr[linux_git]="https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git#ff11764…"
Reproduce builds:
<cut>
mkdir investigate-linux-8633ef82f101c040427b57d4df7b706261420b94
cd investigate-linux-8633ef82f101c040427b57d4df7b706261420b94
git clone https://git.linaro.org/toolchain/jenkins-scripts
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /linux/ ./ ./bisect/baseline/
cd linux
# Reproduce first_bad build
git checkout --detach 8633ef82f101c040427b57d4df7b706261420b94
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach d391c58271072d0b0fad93c82018d495b2633448
../artifacts/test.sh
cd ..
</cut>
History of pending regressions and results: https://git.linaro.org/toolchain/ci/base-artifacts.git/log/?h=linaro-local/…
Artifacts: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next…
Build log: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next…
Full commit (up to 1000 lines):
<cut>
commit 8633ef82f101c040427b57d4df7b706261420b94
Author: Javier Martinez Canillas <javierm(a)redhat.com>
Date: Fri Jun 25 15:13:59 2021 +0200
drivers/firmware: consolidate EFI framebuffer setup for all arches
The register_gop_device() function registers an "efi-framebuffer" platform
device to match against the efifb driver, to have an early framebuffer for
EFI platforms.
But there is already support to do exactly the same by the Generic System
Framebuffers (sysfb) driver. This used to be only for X86 but it has been
moved to drivers/firmware and could be reused by other architectures.
Also, besides supporting registering an "efi-framebuffer", this driver can
register a "simple-framebuffer" allowing to use the siple{fb,drm} drivers
on non-X86 EFI platforms. For example, on aarch64 these drivers can only
be used with DT and doesn't have code to register a "simple-frambuffer"
platform device when booting with EFI.
For these reasons, let's remove the register_gop_device() duplicated code
and instead move the platform specific logic that's there to sysfb driver.
Signed-off-by: Javier Martinez Canillas <javierm(a)redhat.com>
Acked-by: Borislav Petkov <bp(a)suse.de>
Acked-by: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20210625131359.1804394-1-javi…
---
arch/arm/include/asm/efi.h | 5 +--
arch/arm64/include/asm/efi.h | 5 +--
arch/riscv/include/asm/efi.h | 5 +--
drivers/firmware/Kconfig | 8 ++--
drivers/firmware/Makefile | 2 +-
drivers/firmware/efi/efi-init.c | 90 ---------------------------------------
drivers/firmware/efi/sysfb_efi.c | 76 ++++++++++++++++++++++++++++++++-
drivers/firmware/sysfb.c | 35 ++++++++++-----
drivers/firmware/sysfb_simplefb.c | 31 ++++++++++----
drivers/gpu/drm/tiny/Kconfig | 4 +-
include/linux/sysfb.h | 26 +++++------
11 files changed, 143 insertions(+), 144 deletions(-)
diff --git a/arch/arm/include/asm/efi.h b/arch/arm/include/asm/efi.h
index 9de7ab2ce05d..a6f3b179e8a9 100644
--- a/arch/arm/include/asm/efi.h
+++ b/arch/arm/include/asm/efi.h
@@ -17,6 +17,7 @@
#ifdef CONFIG_EFI
void efi_init(void);
+extern void efifb_setup_from_dmi(struct screen_info *si, const char *opt);
int efi_create_mapping(struct mm_struct *mm, efi_memory_desc_t *md);
int efi_set_mapping_permissions(struct mm_struct *mm, efi_memory_desc_t *md);
@@ -52,10 +53,6 @@ void efi_virtmap_unload(void);
struct screen_info *alloc_screen_info(void);
void free_screen_info(struct screen_info *si);
-static inline void efifb_setup_from_dmi(struct screen_info *si, const char *opt)
-{
-}
-
/*
* A reasonable upper bound for the uncompressed kernel size is 32 MBytes,
* so we will reserve that amount of memory. We have no easy way to tell what
diff --git a/arch/arm64/include/asm/efi.h b/arch/arm64/include/asm/efi.h
index 3578aba9c608..42d673a011c8 100644
--- a/arch/arm64/include/asm/efi.h
+++ b/arch/arm64/include/asm/efi.h
@@ -14,6 +14,7 @@
#ifdef CONFIG_EFI
extern void efi_init(void);
+extern void efifb_setup_from_dmi(struct screen_info *si, const char *opt);
#else
#define efi_init()
#endif
@@ -85,10 +86,6 @@ static inline void free_screen_info(struct screen_info *si)
{
}
-static inline void efifb_setup_from_dmi(struct screen_info *si, const char *opt)
-{
-}
-
#define EFI_ALLOC_ALIGN SZ_64K
/*
diff --git a/arch/riscv/include/asm/efi.h b/arch/riscv/include/asm/efi.h
index 6d98cd999680..7a8f0d45b13a 100644
--- a/arch/riscv/include/asm/efi.h
+++ b/arch/riscv/include/asm/efi.h
@@ -13,6 +13,7 @@
#ifdef CONFIG_EFI
extern void efi_init(void);
+extern void efifb_setup_from_dmi(struct screen_info *si, const char *opt);
#else
#define efi_init()
#endif
@@ -39,10 +40,6 @@ static inline void free_screen_info(struct screen_info *si)
{
}
-static inline void efifb_setup_from_dmi(struct screen_info *si, const char *opt)
-{
-}
-
void efi_virtmap_load(void);
void efi_virtmap_unload(void);
diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
index 71f3d97f0c39..af6719cc576b 100644
--- a/drivers/firmware/Kconfig
+++ b/drivers/firmware/Kconfig
@@ -254,9 +254,9 @@ config QCOM_SCM_DOWNLOAD_MODE_DEFAULT
config SYSFB
bool
default y
- depends on X86 || COMPILE_TEST
+ depends on X86 || ARM || ARM64 || RISCV || COMPILE_TEST
-config X86_SYSFB
+config SYSFB_SIMPLEFB
bool "Mark VGA/VBE/EFI FB as generic system framebuffer"
depends on SYSFB
help
@@ -264,10 +264,10 @@ config X86_SYSFB
bootloader or kernel can show basic video-output during boot for
user-guidance and debugging. Historically, x86 used the VESA BIOS
Extensions and EFI-framebuffers for this, which are mostly limited
- to x86.
+ to x86 BIOS or EFI systems.
This option, if enabled, marks VGA/VBE/EFI framebuffers as generic
framebuffers so the new generic system-framebuffer drivers can be
- used on x86. If the framebuffer is not compatible with the generic
+ used instead. If the framebuffer is not compatible with the generic
modes, it is advertised as fallback platform framebuffer so legacy
drivers like efifb, vesafb and uvesafb can pick it up.
If this option is not selected, all system framebuffers are always
diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile
index ad78f78ffa8d..6ac637e422b9 100644
--- a/drivers/firmware/Makefile
+++ b/drivers/firmware/Makefile
@@ -19,7 +19,7 @@ obj-$(CONFIG_RASPBERRYPI_FIRMWARE) += raspberrypi.o
obj-$(CONFIG_FW_CFG_SYSFS) += qemu_fw_cfg.o
obj-$(CONFIG_QCOM_SCM) += qcom_scm.o qcom_scm-smc.o qcom_scm-legacy.o
obj-$(CONFIG_SYSFB) += sysfb.o
-obj-$(CONFIG_X86_SYSFB) += sysfb_simplefb.o
+obj-$(CONFIG_SYSFB_SIMPLEFB) += sysfb_simplefb.o
obj-$(CONFIG_TI_SCI_PROTOCOL) += ti_sci.o
obj-$(CONFIG_TRUSTED_FOUNDATIONS) += trusted_foundations.o
obj-$(CONFIG_TURRIS_MOX_RWTM) += turris-mox-rwtm.o
diff --git a/drivers/firmware/efi/efi-init.c b/drivers/firmware/efi/efi-init.c
index a552a08a1741..b19ce1a83f91 100644
--- a/drivers/firmware/efi/efi-init.c
+++ b/drivers/firmware/efi/efi-init.c
@@ -275,93 +275,3 @@ void __init efi_init(void)
}
#endif
}
-
-static bool efifb_overlaps_pci_range(const struct of_pci_range *range)
-{
- u64 fb_base = screen_info.lfb_base;
-
- if (screen_info.capabilities & VIDEO_CAPABILITY_64BIT_BASE)
- fb_base |= (u64)(unsigned long)screen_info.ext_lfb_base << 32;
-
- return fb_base >= range->cpu_addr &&
- fb_base < (range->cpu_addr + range->size);
-}
-
-static struct device_node *find_pci_overlap_node(void)
-{
- struct device_node *np;
-
- for_each_node_by_type(np, "pci") {
- struct of_pci_range_parser parser;
- struct of_pci_range range;
- int err;
-
- err = of_pci_range_parser_init(&parser, np);
- if (err) {
- pr_warn("of_pci_range_parser_init() failed: %d\n", err);
- continue;
- }
-
- for_each_of_pci_range(&parser, &range)
- if (efifb_overlaps_pci_range(&range))
- return np;
- }
- return NULL;
-}
-
-/*
- * If the efifb framebuffer is backed by a PCI graphics controller, we have
- * to ensure that this relation is expressed using a device link when
- * running in DT mode, or the probe order may be reversed, resulting in a
- * resource reservation conflict on the memory window that the efifb
- * framebuffer steals from the PCIe host bridge.
- */
-static int efifb_add_links(struct fwnode_handle *fwnode)
-{
- struct device_node *sup_np;
-
- sup_np = find_pci_overlap_node();
-
- /*
- * If there's no PCI graphics controller backing the efifb, we are
- * done here.
- */
- if (!sup_np)
- return 0;
-
- fwnode_link_add(fwnode, of_fwnode_handle(sup_np));
- of_node_put(sup_np);
-
- return 0;
-}
-
-static const struct fwnode_operations efifb_fwnode_ops = {
- .add_links = efifb_add_links,
-};
-
-static struct fwnode_handle efifb_fwnode;
-
-static int __init register_gop_device(void)
-{
- struct platform_device *pd;
- int err;
-
- if (screen_info.orig_video_isVGA != VIDEO_TYPE_EFI)
- return 0;
-
- pd = platform_device_alloc("efi-framebuffer", 0);
- if (!pd)
- return -ENOMEM;
-
- if (IS_ENABLED(CONFIG_PCI)) {
- fwnode_init(&efifb_fwnode, &efifb_fwnode_ops);
- pd->dev.fwnode = &efifb_fwnode;
- }
-
- err = platform_device_add_data(pd, &screen_info, sizeof(screen_info));
- if (err)
- return err;
-
- return platform_device_add(pd);
-}
-subsys_initcall(register_gop_device);
diff --git a/drivers/firmware/efi/sysfb_efi.c b/drivers/firmware/efi/sysfb_efi.c
index 9f035b15501c..f51865e1b876 100644
--- a/drivers/firmware/efi/sysfb_efi.c
+++ b/drivers/firmware/efi/sysfb_efi.c
@@ -1,6 +1,6 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/*
- * Generic System Framebuffers on x86
+ * Generic System Framebuffers
* Copyright (c) 2012-2013 David Herrmann <dh.herrmann(a)gmail.com>
*
* EFI Quirks Copyright (c) 2006 Edgar Hucek <gimli(a)dark-green.com>
@@ -19,7 +19,9 @@
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/mm.h>
+#include <linux/of_address.h>
#include <linux/pci.h>
+#include <linux/platform_device.h>
#include <linux/screen_info.h>
#include <linux/sysfb.h>
#include <video/vga.h>
@@ -267,7 +269,72 @@ static const struct dmi_system_id efifb_dmi_swap_width_height[] __initconst = {
{},
};
-__init void sysfb_apply_efi_quirks(void)
+static bool efifb_overlaps_pci_range(const struct of_pci_range *range)
+{
+ u64 fb_base = screen_info.lfb_base;
+
+ if (screen_info.capabilities & VIDEO_CAPABILITY_64BIT_BASE)
+ fb_base |= (u64)(unsigned long)screen_info.ext_lfb_base << 32;
+
+ return fb_base >= range->cpu_addr &&
+ fb_base < (range->cpu_addr + range->size);
+}
+
+static struct device_node *find_pci_overlap_node(void)
+{
+ struct device_node *np;
+
+ for_each_node_by_type(np, "pci") {
+ struct of_pci_range_parser parser;
+ struct of_pci_range range;
+ int err;
+
+ err = of_pci_range_parser_init(&parser, np);
+ if (err) {
+ pr_warn("of_pci_range_parser_init() failed: %d\n", err);
+ continue;
+ }
+
+ for_each_of_pci_range(&parser, &range)
+ if (efifb_overlaps_pci_range(&range))
+ return np;
+ }
+ return NULL;
+}
+
+/*
+ * If the efifb framebuffer is backed by a PCI graphics controller, we have
+ * to ensure that this relation is expressed using a device link when
+ * running in DT mode, or the probe order may be reversed, resulting in a
+ * resource reservation conflict on the memory window that the efifb
+ * framebuffer steals from the PCIe host bridge.
+ */
+static int efifb_add_links(struct fwnode_handle *fwnode)
+{
+ struct device_node *sup_np;
+
+ sup_np = find_pci_overlap_node();
+
+ /*
+ * If there's no PCI graphics controller backing the efifb, we are
+ * done here.
+ */
+ if (!sup_np)
+ return 0;
+
+ fwnode_link_add(fwnode, of_fwnode_handle(sup_np));
+ of_node_put(sup_np);
+
+ return 0;
+}
+
+static const struct fwnode_operations efifb_fwnode_ops = {
+ .add_links = efifb_add_links,
+};
+
+static struct fwnode_handle efifb_fwnode;
+
+__init void sysfb_apply_efi_quirks(struct platform_device *pd)
{
if (screen_info.orig_video_isVGA != VIDEO_TYPE_EFI ||
!(screen_info.capabilities & VIDEO_CAPABILITY_SKIP_QUIRKS))
@@ -281,4 +348,9 @@ __init void sysfb_apply_efi_quirks(void)
screen_info.lfb_height = temp;
screen_info.lfb_linelength = 4 * screen_info.lfb_width;
}
+
+ if (screen_info.orig_video_isVGA == VIDEO_TYPE_EFI && IS_ENABLED(CONFIG_PCI)) {
+ fwnode_init(&efifb_fwnode, &efifb_fwnode_ops);
+ pd->dev.fwnode = &efifb_fwnode;
+ }
}
diff --git a/drivers/firmware/sysfb.c b/drivers/firmware/sysfb.c
index 1337515963d5..2bfbb05f7d89 100644
--- a/drivers/firmware/sysfb.c
+++ b/drivers/firmware/sysfb.c
@@ -1,11 +1,11 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/*
- * Generic System Framebuffers on x86
+ * Generic System Framebuffers
* Copyright (c) 2012-2013 David Herrmann <dh.herrmann(a)gmail.com>
*/
/*
- * Simple-Framebuffer support for x86 systems
+ * Simple-Framebuffer support
* Create a platform-device for any available boot framebuffer. The
* simple-framebuffer platform device is already available on DT systems, so
* this module parses the global "screen_info" object and creates a suitable
@@ -16,12 +16,12 @@
* to pick these devices up without messing with simple-framebuffer drivers.
* The global "screen_info" is still valid at all times.
*
- * If CONFIG_X86_SYSFB is not selected, we never register "simple-framebuffer"
+ * If CONFIG_SYSFB_SIMPLEFB is not selected, never register "simple-framebuffer"
* platform devices, but only use legacy framebuffer devices for
* backwards compatibility.
*
* TODO: We set the dev_id field of all platform-devices to 0. This allows
- * other x86 OF/DT parsers to create such devices, too. However, they must
+ * other OF/DT parsers to create such devices, too. However, they must
* start at offset 1 for this to work.
*/
@@ -43,12 +43,10 @@ static __init int sysfb_init(void)
bool compatible;
int ret;
- sysfb_apply_efi_quirks();
-
/* try to create a simple-framebuffer device */
- compatible = parse_mode(si, &mode);
+ compatible = sysfb_parse_mode(si, &mode);
if (compatible) {
- ret = create_simplefb(si, &mode);
+ ret = sysfb_create_simplefb(si, &mode);
if (!ret)
return 0;
}
@@ -61,9 +59,24 @@ static __init int sysfb_init(void)
else
name = "platform-framebuffer";
- pd = platform_device_register_resndata(NULL, name, 0,
- NULL, 0, si, sizeof(*si));
- return PTR_ERR_OR_ZERO(pd);
+ pd = platform_device_alloc(name, 0);
+ if (!pd)
+ return -ENOMEM;
+
+ sysfb_apply_efi_quirks(pd);
+
+ ret = platform_device_add_data(pd, si, sizeof(*si));
+ if (ret)
+ goto err;
+
+ ret = platform_device_add(pd);
+ if (ret)
+ goto err;
+
+ return 0;
+err:
+ platform_device_put(pd);
+ return ret;
}
/* must execute after PCI subsystem for EFI quirks */
diff --git a/drivers/firmware/sysfb_simplefb.c b/drivers/firmware/sysfb_simplefb.c
index df892444ea17..b86761904949 100644
--- a/drivers/firmware/sysfb_simplefb.c
+++ b/drivers/firmware/sysfb_simplefb.c
@@ -1,6 +1,6 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/*
- * Generic System Framebuffers on x86
+ * Generic System Framebuffers
* Copyright (c) 2012-2013 David Herrmann <dh.herrmann(a)gmail.com>
*/
@@ -23,9 +23,9 @@
static const char simplefb_resname[] = "BOOTFB";
static const struct simplefb_format formats[] = SIMPLEFB_FORMATS;
-/* try parsing x86 screen_info into a simple-framebuffer mode struct */
-__init bool parse_mode(const struct screen_info *si,
- struct simplefb_platform_data *mode)
+/* try parsing screen_info into a simple-framebuffer mode struct */
+__init bool sysfb_parse_mode(const struct screen_info *si,
+ struct simplefb_platform_data *mode)
{
const struct simplefb_format *f;
__u8 type;
@@ -57,13 +57,14 @@ __init bool parse_mode(const struct screen_info *si,
return false;
}
-__init int create_simplefb(const struct screen_info *si,
- const struct simplefb_platform_data *mode)
+__init int sysfb_create_simplefb(const struct screen_info *si,
+ const struct simplefb_platform_data *mode)
{
struct platform_device *pd;
struct resource res;
u64 base, size;
u32 length;
+ int ret;
/*
* If the 64BIT_BASE capability is set, ext_lfb_base will contain the
@@ -105,7 +106,19 @@ __init int create_simplefb(const struct screen_info *si,
if (res.end <= res.start)
return -EINVAL;
- pd = platform_device_register_resndata(NULL, "simple-framebuffer", 0,
- &res, 1, mode, sizeof(*mode));
- return PTR_ERR_OR_ZERO(pd);
+ pd = platform_device_alloc("simple-framebuffer", 0);
+ if (!pd)
+ return -ENOMEM;
+
+ sysfb_apply_efi_quirks(pd);
+
+ ret = platform_device_add_resources(pd, &res, 1);
+ if (ret)
+ return ret;
+
+ ret = platform_device_add_data(pd, mode, sizeof(*mode));
+ if (ret)
+ return ret;
+
+ return platform_device_add(pd);
}
diff --git a/drivers/gpu/drm/tiny/Kconfig b/drivers/gpu/drm/tiny/Kconfig
index 5593128eeff9..d31be274a2bd 100644
--- a/drivers/gpu/drm/tiny/Kconfig
+++ b/drivers/gpu/drm/tiny/Kconfig
@@ -64,8 +64,8 @@ config DRM_SIMPLEDRM
buffer, size, and display format must be provided via device tree,
UEFI, VESA, etc.
- On x86 and compatible, you should also select CONFIG_X86_SYSFB to
- use UEFI and VESA framebuffers.
+ On x86 BIOS or UEFI systems, you should also select SYSFB_SIMPLEFB
+ to use UEFI and VESA framebuffers.
config TINYDRM_HX8357D
tristate "DRM support for HX8357D display panels"
diff --git a/include/linux/sysfb.h b/include/linux/sysfb.h
index 3e5355769dc3..b0dcfa26d07b 100644
--- a/include/linux/sysfb.h
+++ b/include/linux/sysfb.h
@@ -58,37 +58,37 @@ struct efifb_dmi_info {
#ifdef CONFIG_EFI
extern struct efifb_dmi_info efifb_dmi_list[];
-void sysfb_apply_efi_quirks(void);
+void sysfb_apply_efi_quirks(struct platform_device *pd);
#else /* CONFIG_EFI */
-static inline void sysfb_apply_efi_quirks(void)
+static inline void sysfb_apply_efi_quirks(struct platform_device *pd)
{
}
#endif /* CONFIG_EFI */
-#ifdef CONFIG_X86_SYSFB
+#ifdef CONFIG_SYSFB_SIMPLEFB
-bool parse_mode(const struct screen_info *si,
- struct simplefb_platform_data *mode);
-int create_simplefb(const struct screen_info *si,
- const struct simplefb_platform_data *mode);
+bool sysfb_parse_mode(const struct screen_info *si,
+ struct simplefb_platform_data *mode);
+int sysfb_create_simplefb(const struct screen_info *si,
+ const struct simplefb_platform_data *mode);
-#else /* CONFIG_X86_SYSFB */
+#else /* CONFIG_SYSFB_SIMPLE */
-static inline bool parse_mode(const struct screen_info *si,
- struct simplefb_platform_data *mode)
+static inline bool sysfb_parse_mode(const struct screen_info *si,
+ struct simplefb_platform_data *mode)
{
return false;
}
-static inline int create_simplefb(const struct screen_info *si,
- const struct simplefb_platform_data *mode)
+static inline int sysfb_create_simplefb(const struct screen_info *si,
+ const struct simplefb_platform_data *mode)
{
return -EINVAL;
}
-#endif /* CONFIG_X86_SYSFB */
+#endif /* CONFIG_SYSFB_SIMPLE */
#endif /* _LINUX_SYSFB_H */
</cut>
Hi Greg,
This appears to have been a fluke. Boot-testing succeeded before the merge and failed after. Boot-testing on allmodconfig doesn’t seem to be stable, so we are going to disable it.
Regards,
--
Maxim Kuvyrkov
https://www.linaro.org
> On 18 Aug 2021, at 08:38, Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> wrote:
>
> On Wed, Aug 18, 2021 at 05:22:07AM +0000, ci_notify(a)linaro.org wrote:
>> Successfully identified regression in *linux* in CI configuration tcwg_kernel/llvm-master-aarch64-lts-allmodconfig. So far, this commit has regressed CI configurations:
>> - tcwg_kernel/llvm-master-aarch64-lts-allmodconfig
>>
>> Culprit:
>> <cut>
>> commit 132a8267adabd645476b542b3b132c1b91988fe8
>> Author: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
>> Date: Thu Aug 12 13:22:21 2021 +0200
>>
>> Linux 5.10.58
>
> <snip>
>
> And what am I supposed to do with this information?
>
> --
> You received this message because you are subscribed to the Google Groups "Clang Built Linux" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe(a)googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/YRyczv2OCq51edQh%40kroa….
[TCWG CI] Regression caused by binutils: [gdb/testsuite] Add gdb.testsuite/dump-system-info.exp:
commit b4e4386a2e58ba6ce8d02b952f1bc6ceb8fc95d1
Author: Tom de Vries <tdevries(a)suse.de>
[gdb/testsuite] Add gdb.testsuite/dump-system-info.exp
Results regressed to
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_abe stage1 -- --set gcc_override_configure=--disable-libsanitizer --set gcc_override_configure=--disable-multilib --set gcc_override_configure=--with-cpu=cortex-m4 --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--with-float=hard:
-8
# build_abe newlib:
-6
# build_abe stage2 -- --patch linaro-local/vect-metric-branch --set gcc_override_configure=--disable-libsanitizer --set gcc_override_configure=--disable-multilib --set gcc_override_configure=--with-cpu=cortex-m4 --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--with-float=hard:
-5
# true:
0
# benchmark -- -O3_VECT_mthumb artifacts/build-b4e4386a2e58ba6ce8d02b952f1bc6ceb8fc95d1/results_id:
1
from
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_abe stage1 -- --set gcc_override_configure=--disable-libsanitizer --set gcc_override_configure=--disable-multilib --set gcc_override_configure=--with-cpu=cortex-m4 --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--with-float=hard:
-8
# build_abe newlib:
-6
# build_abe stage2 -- --patch linaro-local/vect-metric-branch --set gcc_override_configure=--disable-libsanitizer --set gcc_override_configure=--disable-multilib --set gcc_override_configure=--with-cpu=cortex-m4 --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--with-float=hard:
-5
# true:
0
# benchmark -- -O3_VECT_mthumb artifacts/build-baseline/results_id:
1
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_gnu_eabi_stm32/gnu_eabi-master-arm_eabi-coremark-O3_VECT
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea…
Reproduce builds:
<cut>
mkdir investigate-binutils-b4e4386a2e58ba6ce8d02b952f1bc6ceb8fc95d1
cd investigate-binutils-b4e4386a2e58ba6ce8d02b952f1bc6ceb8fc95d1
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /binutils/ ./ ./bisect/baseline/
cd binutils
# Reproduce first_bad build
git checkout --detach b4e4386a2e58ba6ce8d02b952f1bc6ceb8fc95d1
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 3814a9e1fe77c01c7e872c25afa198537d4ac780
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit b4e4386a2e58ba6ce8d02b952f1bc6ceb8fc95d1
Author: Tom de Vries <tdevries(a)suse.de>
Date: Fri Sep 24 12:39:14 2021 +0200
[gdb/testsuite] Add gdb.testsuite/dump-system-info.exp
When interpreting the testsuite results, it's often relevant what kind of
machine the testsuite ran on. On a local machine one can just do
/proc/cpuinfo, but in case of running tests using a remote system
that distributes test runs to other remote systems that are not directly
accessible, that's not possible.
Fix this by dumping /proc/cpuinfo into the gdb.log, as well as lsb_release -a
and uname -a.
We could do this at the start of each test run, by putting it into unix.exp
or some such. However, this might be too verbose, so we choose to put it into
its own test-case, such that it get triggered in a full testrun, but not when
running one or a subset of tests.
We put the test-case into the gdb.testsuite directory, which is currently the
only place in the testsuite where we do not test gdb. [ Though perhaps this
could be put into a new gdb.info directory, since the test-case doesn't
actually test the testsuite. ]
Tested on x86_64-linux.
---
gdb/testsuite/gdb.testsuite/dump-system-info.exp | 48 ++++++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/gdb/testsuite/gdb.testsuite/dump-system-info.exp b/gdb/testsuite/gdb.testsuite/dump-system-info.exp
new file mode 100644
index 00000000000..bf181469bd5
--- /dev/null
+++ b/gdb/testsuite/gdb.testsuite/dump-system-info.exp
@@ -0,0 +1,48 @@
+# Copyright 2021 Free Software Foundation, Inc.
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+# The purpose of this test-case is to dump /proc/cpuinfo and similar system
+# info into gdb.log.
+
+# Check if /proc/cpuinfo is available.
+set res [remote_exec target "test -r /proc/cpuinfo"]
+set status [lindex $res 0]
+set output [lindex $res 1]
+
+if { $status == 0 && $output == "" } {
+ verbose -log "Cpuinfo available, dumping:"
+ remote_exec target "cat /proc/cpuinfo"
+} else {
+ verbose -log "Cpuinfo not available"
+}
+
+set res [remote_exec target "lsb_release -a"]
+set status [lindex $res 0]
+set output [lindex $res 1]
+
+if { $status == 0 } {
+ verbose -log "lsb_release -a availabe, dumping:\n$output"
+} else {
+ verbose -log "lsb_release -a not available"
+}
+
+set res [remote_exec target "uname -a"]
+set status [lindex $res 0]
+set output [lindex $res 1]
+
+if { $status == 0 } {
+ verbose -log "uname -a availabe, dumping:\n$output"
+} else {
+ verbose -log "uname -a not available"
+}
</cut>
After llvm commit e8e2edd8ca88f8b0a7dba141349b2aa83284f3af
Author: Erich Keane <erich.keane(a)intel.com>
Fix test from 8dd42f, capitalization in test
the following benchmarks slowed down by more than 2%:
- 464.h264ref slowed down by 3% from 10973 to 11249 perf samples
- 464.h264ref:[.] FastFullPelBlockMotionSearch slowed down by 12% from 1446 to 1619 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O3
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-…
Reproduce builds:
<cut>
mkdir investigate-llvm-e8e2edd8ca88f8b0a7dba141349b2aa83284f3af
cd investigate-llvm-e8e2edd8ca88f8b0a7dba141349b2aa83284f3af
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach e8e2edd8ca88f8b0a7dba141349b2aa83284f3af
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 77d200a546136c2855063613ff4bca1f682fb23a
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit e8e2edd8ca88f8b0a7dba141349b2aa83284f3af
Author: Erich Keane <erich.keane(a)intel.com>
Date: Fri Sep 24 10:24:17 2021 -0700
Fix test from 8dd42f, capitalization in test
---
clang/test/CXX/drs/dr17xx.cpp | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/clang/test/CXX/drs/dr17xx.cpp b/clang/test/CXX/drs/dr17xx.cpp
index 42303c83ae3c..c8648908ebda 100644
--- a/clang/test/CXX/drs/dr17xx.cpp
+++ b/clang/test/CXX/drs/dr17xx.cpp
@@ -129,7 +129,7 @@ namespace dr1778 { // dr1778: 9
namespace dr1762 { // dr1762: 14
#if __cplusplus >= 201103L
float operator ""_E(const char *);
- // expected-error@+2 {{invalid suffix on literal; c++11 requires a space between literal and identifier}}
+ // expected-error@+2 {{invalid suffix on literal; C++11 requires a space between literal and identifier}}
// expected-warning@+1 {{user-defined literal suffixes not starting with '_' are reserved; no literal will invoke this operator}}
float operator ""E(const char *);
#endif
</cut>
Thanks, Stanislav,
FWIW, it will be, probably, easier for you to just rebuild the compiler, it is an x86_64-linux-gnu -> arm-linux-gnueabihf cross. This link has the build log [1].
cmake -G Ninja ../llvm/llvm '-DLLVM_ENABLE_PROJECTS=clang;lld' -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True -DCMAKE_INSTALL_PREFIX=../llvm-install -DLLVM_TARGETS_TO_BUILD=ARM
Then compile the pre-processed source with plain -O2 or -O3 optimisation settings.
[1] https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-…
Regards,
--
Maxim Kuvyrkov
https://www.linaro.org
> On 24 Sep 2021, at 20:30, Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com> wrote:
>
> [AMD Official Use Only]
>
> I have reverted the whole change. There was yet another perf regression report.
>
> Stas
>
> From: Mekhanoshin, Stanislav
> Sent: Thursday, September 23, 2021 11:48
> To: Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org>
> Cc: linaro-toolchain <linaro-toolchain(a)lists.linaro.org>
> Subject: RE: [TCWG CI] 456.hmmer slowed down by 6% after llvm: Allow rematerialization of virtual reg uses
>
> Thanks. I see the reload. There shall not be extra pressure since that is the whole idea, make pressure less. However, I see more spills in that specific file, fast_algorithms.s if I get it right.
> Can I get the IR for it? Something to feed llc.
>
> Stas
>
> From: Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org>
> Sent: Thursday, September 23, 2021 2:31
> To: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com>
> Cc: linaro-toolchain <linaro-toolchain(a)lists.linaro.org>
> Subject: Re: [TCWG CI] 456.hmmer slowed down by 6% after llvm: Allow rematerialization of virtual reg uses
>
> [CAUTION: External Email]
>
> Thanks, Stanislav.
>
> I’ve looked into profile dumps, and 456.hmmer’s hot loop get several additional reloads. E.g., "ldr r1, [sp, #84]” generates 203 additional samples, which translates into 20 seconds of time just for that one instruction.
>
> See the attached profile dumps and the the screenshot with the hot loop highlighted.
>
> Maybe your patch increases register pressure too much?
>
> Regards,
>
> --
> Maxim Kuvyrkov
> https://www.linaro.org
>
> > On 22 Sep 2021, at 22:35, Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com> wrote:
> >
> > [AMD Official Use Only]
> >
> > There are actually couple things worth to try if that is easy:
> >
> > https://reviews.llvm.org/D109077
> > https://reviews.llvm.org/differential/diff/374324/
> >
> > Both may slightly change spill weights and then spilling pattern.
> >
> > Stas
> >
> > -----Original Message-----
> > From: Mekhanoshin, Stanislav
> > Sent: Wednesday, September 22, 2021 12:09
> > To: Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org>
> > Cc: linaro-toolchain <linaro-toolchain(a)lists.linaro.org>
> > Subject: RE: [TCWG CI] 456.hmmer slowed down by 6% after llvm: Allow rematerialization of virtual reg uses
> >
> > I assume some of the newly rematerialized instructions caused perf drops. Probably some very specific ones. I would appreciate if you could point them to me.
> > In addition I believe I would need to have a linked or optimized bitcode to feed into llc.
> >
> > Stas
> >
> > -----Original Message-----
> > From: Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org>
> > Sent: Wednesday, September 22, 2021 12:06
> > To: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com>
> > Cc: linaro-toolchain <linaro-toolchain(a)lists.linaro.org>
> > Subject: Re: [TCWG CI] 456.hmmer slowed down by 6% after llvm: Allow rematerialization of virtual reg uses
> >
> > [CAUTION: External Email]
> >
> > Hi Stanislav,
> >
> > That's fair; I or someone from Linaro will try to analyze this and follow up here.
> >
> > On a more general note, what info would you like to see in these benchmarking regression reports?
> >
> > Thanks,
> >
> > --
> > Maxim Kuvyrkov
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linar…
> >
> >
> >> On Sep 22, 2021, at 9:40 PM, Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com> wrote:
> >>
> >> [AMD Official Use Only]
> >>
> >> Hm... I'd really like to help, but I do not think I can do anything with megabytes of code in an asm which I do not understand and tons of differences in 48 asm files.
> >> What I can see there is overall less spilling code which was the intent in the first place: hmmer has 4 less spill opcodes overall and sphinx has 27 less of them.
> >> I doubt I could say much more without someone pointing to the actual root cause.
> >>
> >> Stas
> >>
> >> -----Original Message-----
> >> From: Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org>
> >> Sent: Wednesday, September 22, 2021 5:16
> >> To: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com>
> >> Cc: linaro-toolchain <linaro-toolchain(a)lists.linaro.org>
> >> Subject: Re: [TCWG CI] 456.hmmer slowed down by 6% after llvm: Allow rematerialization of virtual reg uses
> >>
> >> [CAUTION: External Email]
> >>
> >> Hi Stanislav,
> >>
> >> Attached is a tarball with -save-temps output (pre-processed source and generated assembly) for first-bad run (your commit) and last-good run (immediate parent of your commit).
> >>
> >> --
> >> Maxim Kuvyrkov
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linar…
> >>
> >>> On 20 Sep 2021, at 23:15, Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com> wrote:
> >>>
> >>> [AMD Official Use Only]
> >>>
> >>> Thanks for letting me know. Some regressions are inevitable, however do you happen to have any analysis and dumps? I myself do not understand ARM ISA well...
> >>>
> >>> Stas
> >>>
> >>> -----Original Message-----
> >>> From: Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org>
> >>> Sent: Wednesday, September 15, 2021 5:52
> >>> To: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com>
> >>> Cc: linaro-toolchain <linaro-toolchain(a)lists.linaro.org>
> >>> Subject: Re: [TCWG CI] 456.hmmer slowed down by 6% after llvm: Allow rematerialization of virtual reg uses
> >>>
> >>> [CAUTION: External Email]
> >>>
> >>> Hi Stanislav,
> >>>
> >>> FYI, your patch seems to be slowing down two of SPEC CPU2006 tests on 32-bit ARM at -O2 and -O3 optimization levels.
> >>>
> >>> --
> >>> Maxim Kuvyrkov
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linar…
> >>>
> >>
> >>
> >
> <image001.png>
I’m trying to import these files into Ds-5. After unzipping files, it still will not show up in ds-5 search. Below is the error that I keep receiving:
Sent from my iPhone
After gcc commit 4a960d548b7d7d942f316c5295f6d849b74214f5
Author: Aldy Hernandez <aldyh(a)redhat.com>
Avoid invalid loop transformations in jump threading registry.
the following benchmarks grew in size by more than 1%:
- 450.soplex grew in size by 2% from 207260 to 211436 bytes
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: GCC + Glibc + GNU Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -Os -flto
- Hardware: APM Mustang 8x X-Gene1
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_gnu_apm/gnu-master-aarch64-spec2k6-Os_LTO
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa…
Reproduce builds:
<cut>
mkdir investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
cd investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach 4a960d548b7d7d942f316c5295f6d849b74214f5
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 29c92857039d0a105281be61c10c9e851aaeea4a
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 4a960d548b7d7d942f316c5295f6d849b74214f5
Author: Aldy Hernandez <aldyh(a)redhat.com>
Date: Thu Sep 23 10:59:24 2021 +0200
Avoid invalid loop transformations in jump threading registry.
My upcoming improvements to the forward jump threader make it thread
more aggressively. In investigating some "regressions", I noticed
that it has always allowed threading through empty latches and across
loop boundaries. As we have discussed recently, this should be avoided
until after loop optimizations have run their course.
Note that this wasn't much of a problem before because DOM/VRP
couldn't find these opportunities, but with a smarter solver, we trip
over them more easily.
Because the forward threader doesn't have an independent localized cost
model like the new threader (profitable_path_p), it is difficult to
catch these things at discovery. However, we can catch them at
registration time, with the added benefit that all the threaders
(forward and backward) can share the handcuffs.
This patch is an adaptation of what we do in the backward threader, but
it is not meant to catch everything we do there, as some of the
restrictions there are due to limitations of the different block
copiers (for example, the generic copier does not re-use existing
threading paths).
We could ideally remove the now redundant bits in profitable_path_p, but
I would prefer not to for two reasons. First, the backward threader uses
profitable_path_p as it discovers paths to avoid discovering paths in
unprofitable directions. Second, I would like to merge all the forward
cost restrictions into the profitability class in the backward threader,
not the other way around. Alas, that reshuffling will have to wait for
the next release.
As usual, there are quite a few tests that needed adjustments. It seems
we were quite happily threading improper scenarios. With most of them,
as can be seen in pr77445-2.c, we're merely shifting the threading to
after loop optimizations.
Tested on x86-64 Linux.
gcc/ChangeLog:
* tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths):
New.
(jt_path_registry::register_jump_thread): Call
cancel_invalid_paths.
* tree-ssa-threadupdate.h (class jt_path_registry): Add
cancel_invalid_paths.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/20030714-2.c: Adjust.
* gcc.dg/tree-ssa/pr66752-3.c: Adjust.
* gcc.dg/tree-ssa/pr77445-2.c: Adjust.
* gcc.dg/tree-ssa/ssa-dom-thread-18.c: Adjust.
* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust.
* gcc.dg/vect/bb-slp-16.c: Adjust.
---
gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c | 7 ++-
gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c | 19 ++++---
gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c | 4 +-
gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c | 4 +-
gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 4 +-
gcc/testsuite/gcc.dg/vect/bb-slp-16.c | 7 ---
gcc/tree-ssa-threadupdate.c | 67 ++++++++++++++++++-----
gcc/tree-ssa-threadupdate.h | 1 +
8 files changed, 78 insertions(+), 35 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c
index eb663f2ff5b..9585ff11307 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c
@@ -32,7 +32,8 @@ get_alias_set (t)
}
}
-/* There should be exactly three IF conditionals if we thread jumps
- properly. */
-/* { dg-final { scan-tree-dump-times "if " 3 "dom2"} } */
+/* There should be exactly 4 IF conditionals if we thread jumps
+ properly. There used to be 3, but one thread was crossing
+ loops. */
+/* { dg-final { scan-tree-dump-times "if " 4 "dom2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
index e1464e21170..922a331b217 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
@@ -1,5 +1,5 @@
/* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */
+/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3" } */
extern int status, pt;
extern int count;
@@ -32,10 +32,15 @@ foo (int N, int c, int b, int *a)
pt--;
}
-/* There are 4 jump threading opportunities, all of which will be
- realized, which will eliminate testing of FLAG, completely. */
-/* { dg-final { scan-tree-dump-times "Registering jump" 4 "thread1"} } */
+/* There are 2 jump threading opportunities (which don't cross loops),
+ all of which will be realized, which will eliminate testing of
+ FLAG, completely. */
+/* { dg-final { scan-tree-dump-times "Registering jump" 2 "thread1"} } */
-/* There should be no assignments or references to FLAG, verify they're
- eliminated as early as possible. */
-/* { dg-final { scan-tree-dump-not "if .flag" "dce2"} } */
+/* We used to remove references to FLAG by DCE2, but this was
+ depending on early threaders threading through loop boundaries
+ (which we shouldn't do). However, the late threading passes, which
+ run after loop optimizations , can successfully eliminate the
+ references to FLAG. Verify that ther are no references by the late
+ threading passes. */
+/* { dg-final { scan-tree-dump-not "if .flag" "thread3"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
index f9fc212f49e..01a0f1f197d 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
@@ -123,8 +123,8 @@ enum STATES FMS( u8 **in , u32 *transitions) {
aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough
to change decisions in switch expansion which in turn can expose new
jump threading opportunities. Skip the later tests on aarch64. */
-/* { dg-final { scan-tree-dump "Jumps threaded: 1\[1-9\]" "thread1" } } */
-/* { dg-final { scan-tree-dump-times "Invalid sum" 4 "thread1" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread1" } } */
+/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */
/* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */
/* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */
/* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target { ! aarch64*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
index 60d4f76f076..2d78d045516 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
@@ -21,5 +21,7 @@
condition.
All the cases are picked up by VRP1 as jump threads. */
-/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
+
+/* There used to be 6 jump threads found by thread1, but they all
+ depended on threading through distinct loops in ethread. */
/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
index e3d4b311c03..16abcde5053 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
@@ -1,8 +1,8 @@
/* { dg-do compile } */
/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 18" "thread1" } } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread3" { target { ! aarch64*-*-* } } } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread1" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! aarch64*-*-* } } } } */
/* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */
/* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
index 664e93e9b60..e68a9b62535 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
@@ -1,8 +1,5 @@
/* { dg-require-effective-target vect_int } */
-/* See note below as to why we disable threading. */
-/* { dg-additional-options "-fdisable-tree-thread1" } */
-
#include <stdarg.h>
#include "tree-vect.h"
@@ -30,10 +27,6 @@ main1 (int dummy)
*pout++ = *pin++ + a;
*pout++ = *pin++ + a;
*pout++ = *pin++ + a;
- /* In some architectures like ppc64, jump threading may thread
- the iteration where i==0 such that we no longer optimize the
- BB. Another alternative to disable jump threading would be
- to wrap the read from `i' into a function returning i. */
if (arr[i] = i)
a = i;
else
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index baac11280fa..2b9b8f81274 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -2757,6 +2757,58 @@ fwd_jt_path_registry::update_cfg (bool may_peel_loop_headers)
return retval;
}
+bool
+jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
+{
+ gcc_checking_assert (!path.is_empty ());
+ edge taken_edge = path[path.length () - 1]->e;
+ loop_p loop = taken_edge->src->loop_father;
+ bool seen_latch = false;
+ bool path_crosses_loops = false;
+
+ for (unsigned int i = 0; i < path.length (); i++)
+ {
+ edge e = path[i]->e;
+
+ if (e == NULL)
+ {
+ // NULL outgoing edges on a path can happen for jumping to a
+ // constant address.
+ cancel_thread (&path, "Found NULL edge in jump threading path");
+ return true;
+ }
+
+ if (loop->latch == e->src || loop->latch == e->dest)
+ seen_latch = true;
+
+ // The first entry represents the block with an outgoing edge
+ // that we will redirect to the jump threading path. Thus we
+ // don't care about that block's loop father.
+ if ((i > 0 && e->src->loop_father != loop)
+ || e->dest->loop_father != loop)
+ path_crosses_loops = true;
+
+ if (flag_checking && !m_backedge_threads)
+ gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
+ }
+
+ if (cfun->curr_properties & PROP_loop_opts_done)
+ return false;
+
+ if (seen_latch && empty_block_p (loop->latch))
+ {
+ cancel_thread (&path, "Threading through latch before loop opts "
+ "would create non-empty latch");
+ return true;
+ }
+ if (path_crosses_loops)
+ {
+ cancel_thread (&path, "Path crosses loops");
+ return true;
+ }
+ return false;
+}
+
/* Register a jump threading opportunity. We queue up all the jump
threading opportunities discovered by a pass and update the CFG
and SSA form all at once.
@@ -2776,19 +2828,8 @@ jt_path_registry::register_jump_thread (vec<jump_thread_edge *> *path)
return false;
}
- /* First make sure there are no NULL outgoing edges on the jump threading
- path. That can happen for jumping to a constant address. */
- for (unsigned int i = 0; i < path->length (); i++)
- {
- if ((*path)[i]->e == NULL)
- {
- cancel_thread (path, "Found NULL edge in jump threading path");
- return false;
- }
-
- if (flag_checking && !m_backedge_threads)
- gcc_assert (((*path)[i]->e->flags & EDGE_DFS_BACK) == 0);
- }
+ if (cancel_invalid_paths (*path))
+ return false;
if (dump_file && (dump_flags & TDF_DETAILS))
dump_jump_thread_path (dump_file, *path, true);
diff --git a/gcc/tree-ssa-threadupdate.h b/gcc/tree-ssa-threadupdate.h
index 8b48a671212..d68795c9f27 100644
--- a/gcc/tree-ssa-threadupdate.h
+++ b/gcc/tree-ssa-threadupdate.h
@@ -75,6 +75,7 @@ protected:
unsigned long m_num_threaded_edges;
private:
virtual bool update_cfg (bool peel_loop_headers) = 0;
+ bool cancel_invalid_paths (vec<jump_thread_edge *> &path);
jump_thread_path_allocator m_allocator;
// True if threading through back edges is allowed. This is only
// allowed in the generic copier in the backward threader.
</cut>
Progress
* UM-2 [QEMU upstream maintainership]
+ Still looking at the mess that is non-unique bus names. Worked
through exactly which devices and machine types are affected for
the i2c bus.
+ Sent a patchset which tries to make the "create a bus" function
names a bit more regular across different bus types.
* QEMU-406 [QEMU support for MVE (M-profile Vector Extension; Helium)]
+ Luis figured out why GDB was crashing when fed the MVE XML by
QEMU's gdbstub; this was a combination of QEMU giving GDB some
non-standard extra registers in its "vfp" XML feature and GDB
not being robust enough against those unexpected extras. Sent
out a patchset which cleans up QEMU's XML in this area and also
implements the extra XML for MVE. (This will only go into QEMU
once the GDB patches have landed and the XML format is nailed down.)
-- PMM
After gcc commit f92901a508305f291fcf2acae0825379477724de
Author: Richard Biener <rguenther(a)suse.de>
tree-optimization/65206 - dependence analysis on mixed pointer/array
the following benchmarks slowed down by more than 2%:
- 482.sphinx3 slowed down by 4% from 20816 to 21661 perf samples
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-master-aa…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-master-aa…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-master-aa…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: GCC + Glibc + GNU Linker
- Version: all components were built from their tip of trunk
- Target: aarch64-linux-gnu
- Compiler flags: -O3
- Hardware: NVidia TX1 4x Cortex-A57
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_gnu_tx1/gnu-master-aarch64-spec2k6-O3
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-master-aa…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-master-aa…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-master-aa…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-master-aa…
Reproduce builds:
<cut>
mkdir investigate-gcc-f92901a508305f291fcf2acae0825379477724de
cd investigate-gcc-f92901a508305f291fcf2acae0825379477724de
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-master-aa… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-master-aa… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-master-aa… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach f92901a508305f291fcf2acae0825379477724de
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach abdf63d782cba82b5ecf264248518cbb065650ed
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit f92901a508305f291fcf2acae0825379477724de
Author: Richard Biener <rguenther(a)suse.de>
Date: Wed Sep 8 14:42:31 2021 +0200
tree-optimization/65206 - dependence analysis on mixed pointer/array
This adds the capability to analyze the dependence of mixed
pointer/array accesses. The example is from where using a masked
load/store creates the pointer-based access when an otherwise
unconditional access is array based. Other examples would include
accesses to an array mixed with accesses from inlined helpers
that work on pointers.
The idea is quite simple and old - analyze the data-ref indices
as if the reference was pointer-based. The following change does
this by changing dr_analyze_indices to work on the indices
sub-structure and storing an alternate indices substructure in
each data reference. That alternate set of indices is analyzed
lazily by initialize_data_dependence_relation when it fails to
match-up the main set of indices of two data references.
initialize_data_dependence_relation is refactored into a head
and a tail worker and changed to work on one of the indices
structures and thus away from using DR_* access macros which
continue to reference the main indices substructure.
There are quite some vectorization and loop distribution opportunities
unleashed in SPEC CPU 2017, notably 520.omnetpp_r, 548.exchange2_r,
510.parest_r, 511.povray_r, 521.wrf_r, 526.blender_r, 527.cam4_r and
544.nab_r see amendments in what they report with -fopt-info-loop while
the rest of the specrate set sees no changes there. Measuring runtime
for the set where changes were reported reveals nothing off-noise
besides 511.povray_r which seems to regress slightly for me
(on a Zen2 machine with -Ofast -march=native).
2021-09-08 Richard Biener <rguenther(a)suse.de>
PR tree-optimization/65206
* tree-data-ref.h (struct data_reference): Add alt_indices,
order it last.
* tree-data-ref.c (free_data_ref): Release alt_indices.
(dr_analyze_indices): Work on struct indices and get DR_REF as tree.
(create_data_ref): Adjust.
(initialize_data_dependence_relation): Split into head
and tail. When the base objects fail to match up try
again with pointer-based analysis of indices.
* tree-vectorizer.c (vec_info_shared::check_datarefs): Do
not compare the lazily computed alternate set of indices.
* gcc.dg/torture/20210916.c: New testcase.
* gcc.dg/vect/pr65206.c: Likewise.
---
gcc/testsuite/gcc.dg/torture/20210916.c | 20 ++++
gcc/testsuite/gcc.dg/vect/pr65206.c | 22 ++++
gcc/tree-data-ref.c | 174 +++++++++++++++++++++-----------
gcc/tree-data-ref.h | 9 +-
gcc/tree-vectorizer.c | 3 +-
5 files changed, 168 insertions(+), 60 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/torture/20210916.c b/gcc/testsuite/gcc.dg/torture/20210916.c
new file mode 100644
index 00000000000..0ea6d45e463
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/20210916.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+
+typedef union tree_node *tree;
+struct tree_base {
+ unsigned : 1;
+ unsigned lang_flag_2 : 1;
+};
+struct tree_type {
+ tree main_variant;
+};
+union tree_node {
+ struct tree_base base;
+ struct tree_type type;
+};
+tree finish_struct_t, finish_struct_x;
+void finish_struct()
+{
+ for (; finish_struct_t->type.main_variant;)
+ finish_struct_x->base.lang_flag_2 = 0;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/pr65206.c b/gcc/testsuite/gcc.dg/vect/pr65206.c
new file mode 100644
index 00000000000..3b6262622c0
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr65206.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-additional-options "-fno-trapping-math -fno-allow-store-data-races" } */
+/* { dg-additional-options "-mavx" { target avx } } */
+
+#define N 1024
+
+double a[N], b[N];
+
+void foo ()
+{
+ for (int i = 0; i < N; ++i)
+ if (b[i] < 3.)
+ a[i] += b[i];
+}
+
+/* We get a .MASK_STORE because while the load of a[i] does not trap
+ the store would introduce store data races. Make sure we still
+ can handle the data dependence with zero distance. */
+
+/* { dg-final { scan-tree-dump-not "versioning for alias required" "vect" { target { vect_masked_store || avx } } } } */
+/* { dg-final { scan-tree-dump "vectorized 1 loops in function" "vect" { target { vect_masked_store || avx } } } } */
diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
index e061baa7c20..18307a554fc 100644
--- a/gcc/tree-data-ref.c
+++ b/gcc/tree-data-ref.c
@@ -99,6 +99,7 @@ along with GCC; see the file COPYING3. If not see
#include "internal-fn.h"
#include "vr-values.h"
#include "range-op.h"
+#include "tree-ssa-loop-ivopts.h"
static struct datadep_stats
{
@@ -1300,22 +1301,18 @@ base_supports_access_fn_components_p (tree base)
DR, analyzed in LOOP and instantiated before NEST. */
static void
-dr_analyze_indices (struct data_reference *dr, edge nest, loop_p loop)
+dr_analyze_indices (struct indices *dri, tree ref, edge nest, loop_p loop)
{
- vec<tree> access_fns = vNULL;
- tree ref, op;
- tree base, off, access_fn;
-
/* If analyzing a basic-block there are no indices to analyze
and thus no access functions. */
if (!nest)
{
- DR_BASE_OBJECT (dr) = DR_REF (dr);
- DR_ACCESS_FNS (dr).create (0);
+ dri->base_object = ref;
+ dri->access_fns.create (0);
return;
}
- ref = DR_REF (dr);
+ vec<tree> access_fns = vNULL;
/* REALPART_EXPR and IMAGPART_EXPR can be handled like accesses
into a two element array with a constant index. The base is
@@ -1338,8 +1335,8 @@ dr_analyze_indices (struct data_reference *dr, edge nest, loop_p loop)
{
if (TREE_CODE (ref) == ARRAY_REF)
{
- op = TREE_OPERAND (ref, 1);
- access_fn = analyze_scalar_evolution (loop, op);
+ tree op = TREE_OPERAND (ref, 1);
+ tree access_fn = analyze_scalar_evolution (loop, op);
access_fn = instantiate_scev (nest, loop, access_fn);
access_fns.safe_push (access_fn);
}
@@ -1370,16 +1367,16 @@ dr_analyze_indices (struct data_reference *dr, edge nest, loop_p loop)
analyzed nest, add it as an additional independent access-function. */
if (TREE_CODE (ref) == MEM_REF)
{
- op = TREE_OPERAND (ref, 0);
- access_fn = analyze_scalar_evolution (loop, op);
+ tree op = TREE_OPERAND (ref, 0);
+ tree access_fn = analyze_scalar_evolution (loop, op);
access_fn = instantiate_scev (nest, loop, access_fn);
if (TREE_CODE (access_fn) == POLYNOMIAL_CHREC)
{
- tree orig_type;
tree memoff = TREE_OPERAND (ref, 1);
- base = initial_condition (access_fn);
- orig_type = TREE_TYPE (base);
+ tree base = initial_condition (access_fn);
+ tree orig_type = TREE_TYPE (base);
STRIP_USELESS_TYPE_CONVERSION (base);
+ tree off;
split_constant_offset (base, &base, &off);
STRIP_USELESS_TYPE_CONVERSION (base);
/* Fold the MEM_REF offset into the evolutions initial
@@ -1424,7 +1421,7 @@ dr_analyze_indices (struct data_reference *dr, edge nest, loop_p loop)
base, memoff);
MR_DEPENDENCE_CLIQUE (ref) = MR_DEPENDENCE_CLIQUE (old);
MR_DEPENDENCE_BASE (ref) = MR_DEPENDENCE_BASE (old);
- DR_UNCONSTRAINED_BASE (dr) = true;
+ dri->unconstrained_base = true;
access_fns.safe_push (access_fn);
}
}
@@ -1436,8 +1433,8 @@ dr_analyze_indices (struct data_reference *dr, edge nest, loop_p loop)
build_int_cst (reference_alias_ptr_type (ref), 0));
}
- DR_BASE_OBJECT (dr) = ref;
- DR_ACCESS_FNS (dr) = access_fns;
+ dri->base_object = ref;
+ dri->access_fns = access_fns;
}
/* Extracts the alias analysis information from the memory reference DR. */
@@ -1463,6 +1460,8 @@ void
free_data_ref (data_reference_p dr)
{
DR_ACCESS_FNS (dr).release ();
+ if (dr->alt_indices.base_object)
+ dr->alt_indices.access_fns.release ();
free (dr);
}
@@ -1497,7 +1496,7 @@ create_data_ref (edge nest, loop_p loop, tree memref, gimple *stmt,
dr_analyze_innermost (&DR_INNERMOST (dr), memref,
nest != NULL ? loop : NULL, stmt);
- dr_analyze_indices (dr, nest, loop);
+ dr_analyze_indices (&dr->indices, DR_REF (dr), nest, loop);
dr_analyze_alias (dr);
if (dump_file && (dump_flags & TDF_DETAILS))
@@ -3066,41 +3065,30 @@ access_fn_components_comparable_p (tree ref_a, tree ref_b)
TREE_TYPE (TREE_OPERAND (ref_b, 0)));
}
-/* Initialize a data dependence relation between data accesses A and
- B. NB_LOOPS is the number of loops surrounding the references: the
- size of the classic distance/direction vectors. */
+/* Initialize a data dependence relation RES in LOOP_NEST. USE_ALT_INDICES
+ is true when the main indices of A and B were not comparable so we try again
+ with alternate indices computed on an indirect reference. */
struct data_dependence_relation *
-initialize_data_dependence_relation (struct data_reference *a,
- struct data_reference *b,
- vec<loop_p> loop_nest)
+initialize_data_dependence_relation (struct data_dependence_relation *res,
+ vec<loop_p> loop_nest,
+ bool use_alt_indices)
{
- struct data_dependence_relation *res;
+ struct data_reference *a = DDR_A (res);
+ struct data_reference *b = DDR_B (res);
unsigned int i;
- res = XCNEW (struct data_dependence_relation);
- DDR_A (res) = a;
- DDR_B (res) = b;
- DDR_LOOP_NEST (res).create (0);
- DDR_SUBSCRIPTS (res).create (0);
- DDR_DIR_VECTS (res).create (0);
- DDR_DIST_VECTS (res).create (0);
-
- if (a == NULL || b == NULL)
+ struct indices *indices_a = &a->indices;
+ struct indices *indices_b = &b->indices;
+ if (use_alt_indices)
{
- DDR_ARE_DEPENDENT (res) = chrec_dont_know;
- return res;
+ if (TREE_CODE (DR_REF (a)) != MEM_REF)
+ indices_a = &a->alt_indices;
+ if (TREE_CODE (DR_REF (b)) != MEM_REF)
+ indices_b = &b->alt_indices;
}
-
- /* If the data references do not alias, then they are independent. */
- if (!dr_may_alias_p (a, b, loop_nest.exists () ? loop_nest[0] : NULL))
- {
- DDR_ARE_DEPENDENT (res) = chrec_known;
- return res;
- }
-
- unsigned int num_dimensions_a = DR_NUM_DIMENSIONS (a);
- unsigned int num_dimensions_b = DR_NUM_DIMENSIONS (b);
+ unsigned int num_dimensions_a = indices_a->access_fns.length ();
+ unsigned int num_dimensions_b = indices_b->access_fns.length ();
if (num_dimensions_a == 0 || num_dimensions_b == 0)
{
DDR_ARE_DEPENDENT (res) = chrec_dont_know;
@@ -3125,9 +3113,9 @@ initialize_data_dependence_relation (struct data_reference *a,
the a and b accesses have a single ARRAY_REF component reference [0]
but have two subscripts. */
- if (DR_UNCONSTRAINED_BASE (a))
+ if (indices_a->unconstrained_base)
num_dimensions_a -= 1;
- if (DR_UNCONSTRAINED_BASE (b))
+ if (indices_b->unconstrained_base)
num_dimensions_b -= 1;
/* These structures describe sequences of component references in
@@ -3210,6 +3198,10 @@ initialize_data_dependence_relation (struct data_reference *a,
B: [3, 4] (i.e. s.e) */
while (index_a < num_dimensions_a && index_b < num_dimensions_b)
{
+ /* The alternate indices form always has a single dimension
+ with unconstrained base. */
+ gcc_assert (!use_alt_indices);
+
/* REF_A and REF_B must be one of the component access types
allowed by dr_analyze_indices. */
gcc_checking_assert (access_fn_component_p (ref_a));
@@ -3280,11 +3272,12 @@ initialize_data_dependence_relation (struct data_reference *a,
/* See whether FULL_SEQ ends at the base and whether the two bases
are equal. We do not care about TBAA or alignment info so we can
use OEP_ADDRESS_OF to avoid false negatives. */
- tree base_a = DR_BASE_OBJECT (a);
- tree base_b = DR_BASE_OBJECT (b);
+ tree base_a = indices_a->base_object;
+ tree base_b = indices_b->base_object;
bool same_base_p = (full_seq.start_a + full_seq.length == num_dimensions_a
&& full_seq.start_b + full_seq.length == num_dimensions_b
- && DR_UNCONSTRAINED_BASE (a) == DR_UNCONSTRAINED_BASE (b)
+ && (indices_a->unconstrained_base
+ == indices_b->unconstrained_base)
&& operand_equal_p (base_a, base_b, OEP_ADDRESS_OF)
&& (types_compatible_p (TREE_TYPE (base_a),
TREE_TYPE (base_b))
@@ -3323,7 +3316,7 @@ initialize_data_dependence_relation (struct data_reference *a,
both lvalues are distinct from the object's declared type. */
if (same_base_p)
{
- if (DR_UNCONSTRAINED_BASE (a))
+ if (indices_a->unconstrained_base)
full_seq.length += 1;
}
else
@@ -3332,8 +3325,41 @@ initialize_data_dependence_relation (struct data_reference *a,
/* Punt if we didn't find a suitable sequence. */
if (full_seq.length == 0)
{
- DDR_ARE_DEPENDENT (res) = chrec_dont_know;
- return res;
+ if (use_alt_indices
+ || (TREE_CODE (DR_REF (a)) == MEM_REF
+ && TREE_CODE (DR_REF (b)) == MEM_REF)
+ || may_be_nonaddressable_p (DR_REF (a))
+ || may_be_nonaddressable_p (DR_REF (b)))
+ {
+ /* Fully exhausted possibilities. */
+ DDR_ARE_DEPENDENT (res) = chrec_dont_know;
+ return res;
+ }
+
+ /* Try evaluating both DRs as dereferences of pointers. */
+ if (!a->alt_indices.base_object
+ && TREE_CODE (DR_REF (a)) != MEM_REF)
+ {
+ tree alt_ref = build2 (MEM_REF, TREE_TYPE (DR_REF (a)),
+ build1 (ADDR_EXPR, ptr_type_node, DR_REF (a)),
+ build_int_cst
+ (reference_alias_ptr_type (DR_REF (a)), 0));
+ dr_analyze_indices (&a->alt_indices, alt_ref,
+ loop_preheader_edge (loop_nest[0]),
+ loop_containing_stmt (DR_STMT (a)));
+ }
+ if (!b->alt_indices.base_object
+ && TREE_CODE (DR_REF (b)) != MEM_REF)
+ {
+ tree alt_ref = build2 (MEM_REF, TREE_TYPE (DR_REF (b)),
+ build1 (ADDR_EXPR, ptr_type_node, DR_REF (b)),
+ build_int_cst
+ (reference_alias_ptr_type (DR_REF (b)), 0));
+ dr_analyze_indices (&b->alt_indices, alt_ref,
+ loop_preheader_edge (loop_nest[0]),
+ loop_containing_stmt (DR_STMT (b)));
+ }
+ return initialize_data_dependence_relation (res, loop_nest, true);
}
if (!same_base_p)
@@ -3381,8 +3407,8 @@ initialize_data_dependence_relation (struct data_reference *a,
struct subscript *subscript;
subscript = XNEW (struct subscript);
- SUB_ACCESS_FN (subscript, 0) = DR_ACCESS_FN (a, full_seq.start_a + i);
- SUB_ACCESS_FN (subscript, 1) = DR_ACCESS_FN (b, full_seq.start_b + i);
+ SUB_ACCESS_FN (subscript, 0) = indices_a->access_fns[full_seq.start_a + i];
+ SUB_ACCESS_FN (subscript, 1) = indices_b->access_fns[full_seq.start_b + i];
SUB_CONFLICTS_IN_A (subscript) = conflict_fn_not_known ();
SUB_CONFLICTS_IN_B (subscript) = conflict_fn_not_known ();
SUB_LAST_CONFLICT (subscript) = chrec_dont_know;
@@ -3393,6 +3419,40 @@ initialize_data_dependence_relation (struct data_reference *a,
return res;
}
+/* Initialize a data dependence relation between data accesses A and
+ B. NB_LOOPS is the number of loops surrounding the references: the
+ size of the classic distance/direction vectors. */
+
+struct data_dependence_relation *
+initialize_data_dependence_relation (struct data_reference *a,
+ struct data_reference *b,
+ vec<loop_p> loop_nest)
+{
+ data_dependence_relation *res = XCNEW (struct data_dependence_relation);
+ DDR_A (res) = a;
+ DDR_B (res) = b;
+ DDR_LOOP_NEST (res).create (0);
+ DDR_SUBSCRIPTS (res).create (0);
+ DDR_DIR_VECTS (res).create (0);
+ DDR_DIST_VECTS (res).create (0);
+
+ if (a == NULL || b == NULL)
+ {
+ DDR_ARE_DEPENDENT (res) = chrec_dont_know;
+ return res;
+ }
+
+ /* If the data references do not alias, then they are independent. */
+ if (!dr_may_alias_p (a, b, loop_nest.exists () ? loop_nest[0] : NULL))
+ {
+ DDR_ARE_DEPENDENT (res) = chrec_known;
+ return res;
+ }
+
+ return initialize_data_dependence_relation (res, loop_nest, false);
+}
+
+
/* Frees memory used by the conflict function F. */
static void
diff --git a/gcc/tree-data-ref.h b/gcc/tree-data-ref.h
index 685f33d85ae..74f579c9f3f 100644
--- a/gcc/tree-data-ref.h
+++ b/gcc/tree-data-ref.h
@@ -166,14 +166,19 @@ struct data_reference
and runs to completion. */
bool is_conditional_in_stmt;
+ /* Alias information for the data reference. */
+ struct dr_alias alias;
+
/* Behavior of the memory reference in the innermost loop. */
struct innermost_loop_behavior innermost;
/* Subscripts of this data reference. */
struct indices indices;
- /* Alias information for the data reference. */
- struct dr_alias alias;
+ /* Alternate subscripts initialized lazily and used by data-dependence
+ analysis only when the main indices of two DRs are not comparable.
+ Keep last to keep vec_info_shared::check_datarefs happy. */
+ struct indices alt_indices;
};
#define DR_STMT(DR) (DR)->stmt
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 3aa3e2a6783..20daa31187d 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -507,7 +507,8 @@ vec_info_shared::check_datarefs ()
return;
gcc_assert (datarefs.length () == datarefs_copy.length ());
for (unsigned i = 0; i < datarefs.length (); ++i)
- if (memcmp (&datarefs_copy[i], datarefs[i], sizeof (data_reference)) != 0)
+ if (memcmp (&datarefs_copy[i], datarefs[i],
+ offsetof (data_reference, alt_indices)) != 0)
gcc_unreachable ();
}
</cut>
After llvm commit 669ddd1e9b1226432b003dbba05b99f8e992285b
Author: Arthur Eubanks <aeubanks(a)google.com>
Turn on the new pass manager by default
the following benchmarks grew in size by more than 1%:
- 403.gcc grew in size by 2% from 2586180 to 2648252 bytes
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their latest release branch
- Target: aarch64-linux-gnu
- Compiler flags: -Os -flto
- Hardware: APM Mustang 8x X-Gene1
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_apm/llvm-release-aarch64-spec2k6-Os_LTO
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Reproduce builds:
<cut>
mkdir investigate-llvm-669ddd1e9b1226432b003dbba05b99f8e992285b
cd investigate-llvm-669ddd1e9b1226432b003dbba05b99f8e992285b
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach 669ddd1e9b1226432b003dbba05b99f8e992285b
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach b15cbaf5a03d0b32dbc32c37766e32ccf66e6c87
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 669ddd1e9b1226432b003dbba05b99f8e992285b
Author: Arthur Eubanks <aeubanks(a)google.com>
Date: Mon Jan 25 11:00:56 2021 -0800
Turn on the new pass manager by default
This turns on the new pass manager by default for the optimization pipeline in
Clang and ThinLTO in various LLD backends. This also makes uses of `opt
-instcombine` use the new pass manager (unless specifically opted out).
This does not affect the backend target-dependent codegen pipeline.
If this causes regressions, you can opt out of the new pass manager
either via the -DENABLE_EXPERIMENTAL_NEW_PASS_MANAGER=OFF CMake flag
while building LLVM, or via various compiler flags, e.g.
-flegacy-pass-manager for Clang or -Wl,--lto-legacy-pass-manager for
ELF LLD. Please file bugs for any regressions.
Major differences:
* The inliner works slightly differently
* -O1 does some amount of inlining
* LCSSA and LoopSimplify are run before all loop passes
* Loop unswitching is implemented slightly differently
* A new SpeculateAroundPHIs pass is added to the pipeline
https://lists.llvm.org/pipermail/llvm-dev/2021-January/148098.html
Reviewed By: asbirlea, ychen, MaskRay, echristo
Differential Revision: https://reviews.llvm.org/D95380
---
llvm/CMakeLists.txt | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/llvm/CMakeLists.txt b/llvm/CMakeLists.txt
index 1affc289e64b..f5298de9f7ca 100644
--- a/llvm/CMakeLists.txt
+++ b/llvm/CMakeLists.txt
@@ -688,8 +688,8 @@ else()
endif()
option(LLVM_ENABLE_PLUGINS "Enable plugin support" ${LLVM_ENABLE_PLUGINS_default})
-set(ENABLE_EXPERIMENTAL_NEW_PASS_MANAGER FALSE CACHE BOOL
- "Enable the experimental new pass manager by default.")
+set(ENABLE_EXPERIMENTAL_NEW_PASS_MANAGER TRUE CACHE BOOL
+ "Enable the new pass manager by default.")
include(HandleLLVMOptions)
</cut>
After gcc commit 3c57e692357c79ee7623dfc1586652aee2aefb8f
Author: Patrick Palka <ppalka(a)redhat.com>
libstdc++: Add floating-point std::to_chars implementation
the following hot functions grew in size by more than 10% (but their benchmarks grew in size by less than 1%):
- 447.dealII:libstdc++.so.6.0.29 grew in size by 12% from 1245370 to 1391240 bytes
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
- First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
- Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
- Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: Clang + Glibc + LLVM Linker
- Version: all components were built from their latest release branch
- Target: arm-linux-gnueabihf
- Compiler flags: -Os -mthumb
- Hardware: APM Mustang 8x X-Gene1
This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_apm/llvm-release-arm-spec2k6-Os
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Reproduce builds:
<cut>
mkdir investigate-gcc-3c57e692357c79ee7623dfc1586652aee2aefb8f
cd investigate-gcc-3c57e692357c79ee7623dfc1586652aee2aefb8f
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach 3c57e692357c79ee7623dfc1586652aee2aefb8f
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 5033506993ef92589373270a8e8dbbf50e3ebef1
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 3c57e692357c79ee7623dfc1586652aee2aefb8f
Author: Patrick Palka <ppalka(a)redhat.com>
Date: Thu Dec 17 23:11:34 2020 -0500
libstdc++: Add floating-point std::to_chars implementation
This implements the floating-point std::to_chars overloads for float,
double and long double. We use the Ryu library to compute the shortest
round-trippable fixed and scientific forms for float, double and long
double. We also use Ryu for performing explicit-precision fixed and
scientific formatting for float and double. For explicit-precision
formatting for long double we fall back to using printf. Hexadecimal
formatting for float, double and long double is implemented from
scratch.
The supported long double binary formats are binary64, binary80 (x86
80-bit extended precision), binary128 and ibm128.
Much of the complexity of the implementation is in computing the exact
output length before handing it off to Ryu (which doesn't do bounds
checking). In some cases it's hard to compute the output length
beforehand, so in these cases we instead compute an upper bound on the
output length and use a sufficiently-sized intermediate buffer only if
necessary.
Another source of complexity is in the general-with-precision formatting
mode, where we need to do zero-trimming of the string returned by Ryu,
and where we also take care to avoid having to format the number through
Ryu a second time when the general formatting mode resolves to fixed
(which we determine by doing a scientific formatting first and
inspecting the scientific exponent). We avoid going through Ryu twice
by instead transforming the scientific form to the corresponding fixed
form via in-place string manipulation.
This implementation is non-conforming in a couple of ways:
1. For the shortest hexadecimal formatting, we currently follow the
Microsoft implementation's decision to be consistent with the
output of printf's '%a' specifier at the expense of sometimes not
printing the shortest representation. For example, the shortest hex
form for the number 1.08p+0 is 2.1p-1, but we output the former
instead of the latter, as does printf.
2. The Ryu routine generic_binary_to_decimal that we use for performing
shortest formatting for large floating point types is implemented
using the __int128 type, but some targets with a large long double
type lack __int128 (e.g. i686), so we can't perform shortest
formatting of long double on such targets through Ryu. As a
temporary stopgap this patch makes the long double to_chars overloads
just dispatch to the double overloads on these targets, which means
we lose precision in the output. (We could potentially fix this by
writing a specialized version of Ryu's generic_binary_to_decimal
routine that uses uint64_t instead of __int128.) [Though I wonder if
there's a better way to work around the lack of __int128 on i686
specifically?]
3. Our shortest formatting for __ibm128 doesn't guarantee the round-trip
property if the difference between the high- and low-order exponent
is large. This is because we treat __ibm128 as if it has a
contiguous 105-bit mantissa by merging the mantissas of the high-
and low-order parts (using code extracted from glibc), so we
potentially lose precision from the low-order part. This seems to be
consistent with how glibc printf formats __ibm128.
libstdc++-v3/ChangeLog:
* config/abi/pre/gnu.ver: Add new exports.
* include/std/charconv (to_chars): Declare the floating-point
overloads for float, double and long double.
* src/c++17/Makefile.am (sources): Add floating_to_chars.cc.
* src/c++17/Makefile.in: Regenerate.
* src/c++17/floating_to_chars.cc: New file.
(to_chars): Define for float, double and long double.
* testsuite/20_util/to_chars/long_double.cc: New test.
---
libstdc++-v3/config/abi/pre/gnu.ver | 7 +
libstdc++-v3/include/std/charconv | 24 +
libstdc++-v3/src/c++17/Makefile.am | 1 +
libstdc++-v3/src/c++17/Makefile.in | 3 +-
libstdc++-v3/src/c++17/floating_to_chars.cc | 1563 ++++++++++++++++++++
.../testsuite/20_util/to_chars/long_double.cc | 199 +++
6 files changed, 1796 insertions(+), 1 deletion(-)
diff --git a/libstdc++-v3/config/abi/pre/gnu.ver b/libstdc++-v3/config/abi/pre/gnu.ver
index 4b4bd8ab6da..05e0a512247 100644
--- a/libstdc++-v3/config/abi/pre/gnu.ver
+++ b/libstdc++-v3/config/abi/pre/gnu.ver
@@ -2393,6 +2393,13 @@ GLIBCXX_3.4.29 {
# std::once_flag::_M_finish(bool)
_ZNSt9once_flag9_M_finishEb;
+ # std::to_chars(char*, char*, [float|double|long double])
+ _ZSt8to_charsPcS_[defg];
+ # std::to_chars(char*, char*, [float|double|long double], chars_format)
+ _ZSt8to_charsPcS_[defg]St12chars_format;
+ # std::to_chars(char*, char*, [float|double|long double], chars_format, int)
+ _ZSt8to_charsPcS_[defg]St12chars_formati;
+
} GLIBCXX_3.4.28;
# Symbols in the support library (libsupc++) have their own tag.
diff --git a/libstdc++-v3/include/std/charconv b/libstdc++-v3/include/std/charconv
index dd1ebdf8322..b57b0a16db2 100644
--- a/libstdc++-v3/include/std/charconv
+++ b/libstdc++-v3/include/std/charconv
@@ -702,6 +702,30 @@ namespace __detail
chars_format __fmt = chars_format::general) noexcept;
#endif
+ // Floating-point std::to_chars
+
+ // Overloads for float.
+ to_chars_result to_chars(char* __first, char* __last, float __value) noexcept;
+ to_chars_result to_chars(char* __first, char* __last, float __value,
+ chars_format __fmt) noexcept;
+ to_chars_result to_chars(char* __first, char* __last, float __value,
+ chars_format __fmt, int __precision) noexcept;
+
+ // Overloads for double.
+ to_chars_result to_chars(char* __first, char* __last, double __value) noexcept;
+ to_chars_result to_chars(char* __first, char* __last, double __value,
+ chars_format __fmt) noexcept;
+ to_chars_result to_chars(char* __first, char* __last, double __value,
+ chars_format __fmt, int __precision) noexcept;
+
+ // Overloads for long double.
+ to_chars_result to_chars(char* __first, char* __last, long double __value)
+ noexcept;
+ to_chars_result to_chars(char* __first, char* __last, long double __value,
+ chars_format __fmt) noexcept;
+ to_chars_result to_chars(char* __first, char* __last, long double __value,
+ chars_format __fmt, int __precision) noexcept;
+
_GLIBCXX_END_NAMESPACE_VERSION
} // namespace std
#endif // C++14
diff --git a/libstdc++-v3/src/c++17/Makefile.am b/libstdc++-v3/src/c++17/Makefile.am
index 37cdb53c076..2ec5ed621ca 100644
--- a/libstdc++-v3/src/c++17/Makefile.am
+++ b/libstdc++-v3/src/c++17/Makefile.am
@@ -51,6 +51,7 @@ endif
sources = \
floating_from_chars.cc \
+ floating_to_chars.cc \
fs_dir.cc \
fs_ops.cc \
fs_path.cc \
diff --git a/libstdc++-v3/src/c++17/Makefile.in b/libstdc++-v3/src/c++17/Makefile.in
index ccae721ab3f..9b36b7a916c 100644
--- a/libstdc++-v3/src/c++17/Makefile.in
+++ b/libstdc++-v3/src/c++17/Makefile.in
@@ -124,7 +124,7 @@ LTLIBRARIES = $(noinst_LTLIBRARIES)
libc__17convenience_la_LIBADD =
@ENABLE_DUAL_ABI_TRUE@am__objects_1 = cow-fs_dir.lo cow-fs_ops.lo \
@ENABLE_DUAL_ABI_TRUE@ cow-fs_path.lo
-am__objects_2 = floating_from_chars.lo fs_dir.lo fs_ops.lo fs_path.lo \
+am__objects_2 = floating_from_chars.lo floating_to_chars.lo fs_dir.lo fs_ops.lo fs_path.lo \
memory_resource.lo $(am__objects_1)
@ENABLE_DUAL_ABI_TRUE@am__objects_3 = cow-string-inst.lo
@ENABLE_EXTERN_TEMPLATE_TRUE@am__objects_4 = ostream-inst.lo \
@@ -440,6 +440,7 @@ headers =
sources = \
floating_from_chars.cc \
+ floating_to_chars.cc \
fs_dir.cc \
fs_ops.cc \
fs_path.cc \
diff --git a/libstdc++-v3/src/c++17/floating_to_chars.cc b/libstdc++-v3/src/c++17/floating_to_chars.cc
new file mode 100644
index 00000000000..dd83f5eea93
--- /dev/null
+++ b/libstdc++-v3/src/c++17/floating_to_chars.cc
@@ -0,0 +1,1563 @@
+// std::to_chars implementation for floating-point types -*- C++ -*-
+
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library. This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1, as published by the Free Software Foundation.
+
+// You should have received a copy of the GNU General Public License and
+// a copy of the GCC Runtime Library Exception along with this program;
+// see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
+// <http://www.gnu.org/licenses/>.
+
+// Activate __glibcxx_assert within this file to shake out any bugs.
+#define _GLIBCXX_ASSERTIONS 1
+
+#include <charconv>
+
+#include <bit>
+#include <cfenv>
+#include <cassert>
+#include <cmath>
+#include <cstdio>
+#include <cstring>
+#include <langinfo.h>
+#include <optional>
+#include <string_view>
+#include <type_traits>
+
+// Determine the binary format of 'long double'.
+
+// We support the binary64, float80 (i.e. x86 80-bit extended precision),
+// binary128, and ibm128 formats.
+#define LDK_UNSUPPORTED 0
+#define LDK_BINARY64 1
+#define LDK_FLOAT80 2
+#define LDK_BINARY128 3
+#define LDK_IBM128 4
+
+#if __LDBL_MANT_DIG__ == __DBL_MANT_DIG__
+# define LONG_DOUBLE_KIND LDK_BINARY64
+#elif defined(__SIZEOF_INT128__)
+// The Ryu routines need a 128-bit integer type in order to do shortest
+// formatting of types larger than 64-bit double, so without __int128 we can't
+// support any large long double format. This is the case for e.g. i386.
+# if __LDBL_MANT_DIG__ == 64
+# define LONG_DOUBLE_KIND LDK_FLOAT80
+# elif __LDBL_MANT_DIG__ == 113
+# define LONG_DOUBLE_KIND LDK_BINARY128
+# elif __LDBL_MANT_DIG__ == 106
+# define LONG_DOUBLE_KIND LDK_IBM128
+# endif
+#endif
+#if !defined(LONG_DOUBLE_KIND)
+# define LONG_DOUBLE_KIND LDK_UNSUPPORTED
+#endif
+
+namespace
+{
+ namespace ryu
+ {
+#include "ryu/common.h"
+#include "ryu/digit_table.h"
+#include "ryu/d2s_intrinsics.h"
+#include "ryu/d2s_full_table.h"
+#include "ryu/d2fixed_full_table.h"
+#include "ryu/f2s_intrinsics.h"
+#include "ryu/d2s.c"
+#include "ryu/d2fixed.c"
+#include "ryu/f2s.c"
+
+#ifdef __SIZEOF_INT128__
+ namespace generic128
+ {
+ // Put the generic Ryu bits in their own namespace to avoid name conflicts.
+# include "ryu/generic_128.h"
+# include "ryu/ryu_generic_128.h"
+# include "ryu/generic_128.c"
+ } // namespace generic128
+
+ using generic128::floating_decimal_128;
+ using generic128::generic_binary_to_decimal;
+
+ int
+ to_chars(const floating_decimal_128 v, char* const result)
+ { return generic128::generic_to_chars(v, result); }
+#endif
+ } // namespace ryu
+
+ // A traits class that contains pertinent information about the binary
+ // format of each of the floating-point types we support.
+ template<typename T>
+ struct floating_type_traits
+ { };
+
+ template<>
+ struct floating_type_traits<float>
+ {
+ // We (and Ryu) assume float has the IEEE binary32 format.
+ static_assert(__FLT_MANT_DIG__ == 24);
+ static constexpr int mantissa_bits = 23;
+ static constexpr int exponent_bits = 8;
+ static constexpr bool has_implicit_leading_bit = true;
+ using mantissa_t = uint32_t;
+ using shortest_scientific_t = ryu::floating_decimal_32;
+
+ static constexpr uint64_t pow10_adjustment_tab[]
+ = { 0b0000000000011101011100110101100101101110000000000000000000000000 };
+ };
+
+ template<>
+ struct floating_type_traits<double>
+ {
+ // We (and Ryu) assume double has the IEEE binary64 format.
+ static_assert(__DBL_MANT_DIG__ == 53);
+ static constexpr int mantissa_bits = 52;
+ static constexpr int exponent_bits = 11;
+ static constexpr bool has_implicit_leading_bit = true;
+ using mantissa_t = uint64_t;
+ using shortest_scientific_t = ryu::floating_decimal_64;
+
+ static constexpr uint64_t pow10_adjustment_tab[]
+ = { 0b0000000000000000000000011000110101110111000001100101110000111100,
+ 0b0111100011110101011000011110000000110110010101011000001110011111,
+ 0b0101101100000000011100100100111100110110110100010001010101110000,
+ 0b0011110010111000101111110101100011101100010001010000000101100111,
+ 0b0001010000011001011100100001010000010101101000001101000000000000 };
+ };
+
+#if LONG_DOUBLE_KIND == LDK_BINARY64
+ // When long double is equivalent to double, we just forward the long double
+ // overloads to the double overloads, so we don't need to define a a
+ // floating_type_traits<long double> specialization in this case.
+#elif LONG_DOUBLE_KIND == LDK_FLOAT80
+ template<>
+ struct floating_type_traits<long double>
+ {
+ static constexpr int mantissa_bits = 64;
+ static constexpr int exponent_bits = 15;
+ static constexpr bool has_implicit_leading_bit = false;
+ using mantissa_t = uint64_t;
+ using shortest_scientific_t = ryu::floating_decimal_128;
+
+ static constexpr uint64_t pow10_adjustment_tab[]
+ = { 0b0000000000000000000000000000110101011111110100010100110000011101,
+ 0b1001100101001111010011011111101000101111110001011001011101110000,
+ 0b0000101111111011110010001000001010111101011110111111010100011001,
+ 0b0011100000011111001101101011111001111100100010000101001111101001,
+ 0b0100100100000000100111010010101110011000110001101101110011001010,
+ 0b0111100111100010100000010011000010010110101111110101000011110100,
+ 0b1010100111100010011110000011011101101100010110000110101010101010,
+ 0b0000001111001111000000101100111011011000101000110011101100110010,
+ 0b0111000011100100101101010100001101111110101111001000010011111111,
+ 0b0010111000100110100100100010101100111010110001101010010111001000,
+ 0b0000100000010110000011001001000111000001111010100101101000001111,
+ 0b0010101011101000111100001011000010011101000101010010010000101111,
+ 0b1011111011101101110010101011010001111000101000101101011001100011,
+ 0b1010111011011011110111110011001010000010011001110100101101000101,
+ 0b0011000001110110011010010000011100100011001011001100001101010110,
+ 0b0100011111011000111111101000011110000010111110101001000000001001,
+ 0b1110000001110001001101101110011000100000001010000111100010111010,
+ 0b1110001001010011101000111000001000010100110000010110100011110000,
+ 0b0000011010110000110001111000011111000011001101001101001001000110,
+ 0b1010010111001000101001100101010110100100100010010010000101000010,
+ 0b1011001110000111100010100110000011100011111001110111001100000101,
+ 0b0110101001001000010110001000010001010101110101100001111100011001,
+ 0b1111100011110101011110011010101001010010100011000010110001101001,
+ 0b0100000100001000111101011100010011011111011001000000001100011000,
+ 0b1110111111000111100101110111110000000011001110011100011011011001,
+ 0b1100001100100000010001100011011000111011110000110011010101000011,
+ 0b1111111011100111011101001111111000010000001111010111110010000100,
+ 0b1110111001111110101111000101000000001010001110011010001000111010,
+ 0b1000010001011000101111111010110011111101110101101001111000111010,
+ 0b0100000111101001000111011001101000001010111011101001101111000100,
+ 0b0000011100110001000111011100111100110001101111111010110111100000,
+ 0b0000011101011100100110010011110101010100010011110010010111010000,
+ 0b0011011001100111110101111100001001101110101101001110110011110110,
+ 0b1011000101000001110100111001100100111100110011110000000001101000,
+ 0b1011100011110100001001110101010110111001000000001011101001011110,
+ 0b1111001010010010100000010110101010101011101000101000000000001100,
+ 0b1000001111100100111001110101100001010011111111000001000011110000,
+ 0b0001011101001000010000101101111000001110101100110011001100110111,
+ 0b1110011100000010101011011111001010111101111110100000011100000011,
+ 0b1001110110011100101010011110100010110001001110110000101011100110,
+ 0b1001101000100011100111010000011011100001000000110101100100001001,
+ 0b1010111000101000101101010111000010001100001010100011111100000100,
+ 0b0111101000100011000101101011111011100010001101110111001111001011,
+ 0b1110100111010110001110110110000000010110100011110000010001111100,
+ 0b1100010100011010001011001000111001010101011110100101011001000000,
+ 0b0000110001111001100110010110111010101101001101000000000010010101,
+ 0b0001110111101000001111101010110010010000111110111100000111110100,
+ 0b0111110111001001111000110001101101001010101110110101111110000100,
+ 0b0000111110111010101111100010111010011100010110011011011001000001,
+ 0b1010010100100100101110111111111000101100000010111111101101000110,
+ 0b1000100111111101100011001101000110001000000100010101010100001101,
+ 0b1100101010101000111100101100001000110001110010100000000010110101,
+ 0b1010000100111101100100101010010110100010000000110101101110000100,
+ 0b1011111011110001110000100100000000001010111010001101100000100100,
+ 0b0111101101100011001110011100000001000101101101111000100111011111,
+ 0b0100111010010011011001010011110100001100111010010101111111100011,
+ 0b0010001001011000111000001100110111110111110010100011000110110110,
+ 0b0101010110000000010000100000110100111011111101000100000111010010,
+ 0b0110000011011101000001010100110101101110011100110101000000001001,
+ 0b1101100110100000011000001111000100100100110001100110101010101100,
+ 0b0010100101010110010010001010101000011111111111001011001010001111,
+ 0b0111001010001111001100111001010101001000110101000011110000001000,
+ 0b0110010011001001001111110001010010001011010010001101110110110011,
+ 0b0110010100111011000100111000001001101011111001110010111110111111,
+ 0b0101110111001001101100110100101001110010101110011001101110001000,
+ 0b0100110101010111011010001100010111100011010011111001010100111000,
+ 0b0111000110110111011110100100010111000110000110110110110001111110,
+ 0b1000101101010100100100111110100011110110110010011001110011110101,
+ 0b1001101110101001010100111101101011000101000010110101101111110000,
+ 0b0100100101001011011001001011000010001101001010010001010110101000,
+ 0b0010100001001011100110101000010110000111000111000011100101011011,
+ 0b0110111000011001111101101011111010001000000010101000101010011110,
+ 0b1000110110100001111011000001111100001001000000010110010100100100,
+ 0b1001110100011111100111101011010000010101011100101000010010100110,
+ 0b0001010110101110100010101010001110110110100011101010001001111100,
+ 0b1010100101101100000010110011100110100010010000100100001110000100,
+ 0b0001000000010000001010000010100110000001110100111001110111101101,
+ 0b1100000000000000000000000000000000000000000000000000000000000000 };
+ };
+#elif LONG_DOUBLE_KIND == LDK_BINARY128
+ template<>
+ struct floating_type_traits<long double>
+ {
+ static constexpr int mantissa_bits = 112;
+ static constexpr int exponent_bits = 15;
+ static constexpr bool has_implicit_leading_bit = true;
+ using mantissa_t = unsigned __int128;
+ using shortest_scientific_t = ryu::floating_decimal_128;
+
+ static constexpr uint64_t pow10_adjustment_tab[]
+ = { 0b0000000000000000000000000000000000000000000000000100000010000000,
+ 0b1011001111110100000100010101101110011100100110000110010110011000,
+ 0b1010100010001101111111000000001101010010100010010000111011110111,
+ 0b1011111001110001111000011111000010110111000111110100101010100101,
+ 0b0110100110011110011011000011000010011001110001001001010011100011,
+ 0b0000011111110010101111101011101010000110011111100111001110100111,
+ 0b0100010101010110000010111011110100000010011001001010001110111101,
+ 0b1101110111000010001101100000110100000111001001101011000101011011,
+ 0b0100111011101101010000001101011000101100101110010010110000101011,
+ 0b0100000110111000000110101000010011101000110100010110000011101101,
+ 0b1011001101001000100001010001100100001111011101010101110001010110,
+ 0b1000000001000000101001110010110010001111101101010101001100000110,
+ 0b0101110110100110000110000001001010111110001110010000111111010011,
+ 0b1010001111100111000100011100100100111100100101000001011001000111,
+ 0b1010011000011100110101100111001011100101111111100001110100000100,
+ 0b1100011100100010100000110001001010000000100000001001010111011101,
+ 0b0101110000100011001111101101000000100110000010010111010001111010,
+ 0b0100111100011010110111101000100110000111001001101100000001111100,
+ 0b1100100100111110101011000100000101011010110111000111110100110101,
+ 0b0110010000010111010100110011000000111010000010111011010110000100,
+ 0b0101001001010010110111010111000101011100000111100111000001110010,
+ 0b1101111111001011101010110001000111011010111101001011010110100100,
+ 0b0001000100110000011111101011001101110010110110010000000011100100,
+ 0b0001000000000101001001001000000000011000100011001110101001001110,
+ 0b0010010010001000111010011011100001000110011011011110110100111000,
+ 0b0000100110101100000111100010100100011100110111011100001111001100,
+ 0b1011111010001110001100000011110111111111100000001011111111101100,
+ 0b0000011100001111010101110000100110111100101101110111101001000001,
+ 0b1100010001110110111100001001001101101000011100000010110101001011,
+ 0b0100101001101011111001011110101101100011011111011100101010101111,
+ 0b0001101001111001110000101101101100001011010001011110011101000010,
+ 0b1111000000101001101111011010110011101110100001011011001011100010,
+ 0b0101001010111101101100001111100010010110001101001000001101100100,
+ 0b0101100101011110001100101011111000111001111001001001101101100001,
+ 0b1111001101010010100100011011000110110010001111000111010001001101,
+ 0b0001110010011000000001000110110111011000011100001000011001110111,
+ 0b0100001011011011011011110011101100100101111111101100101000001110,
+ 0b0101011110111101010111100111101111000101111111111110100011011010,
+ 0b1110101010001001110100000010110111010111111010111110100110010110,
+ 0b1010001111100001001100101000110100001100011100110010000011010111,
+ 0b1111111101101111000100111100000101011000001110011011101010111001,
+ 0b1111101100001110100101111101011001000100000101110000110010100011,
+ 0b1001010110110101101101000101010001010000101011011111010011010000,
+ 0b0111001110110011101001100111000001000100001010110000010000001101,
+ 0b0101111100111110100111011001111001111011011110010111010011101010,
+ 0b1110111000000001100100111001100100110001011011001110101111110111,
+ 0b0001010001001101010111101010011111000011110001101101011001111111,
+ 0b0101000011100011010010001101100001011101011010100110101100100010,
+ 0b0001000101011000100101111100110110000101101101111000110001001011,
+ 0b0101100101001011011000010101000000010100011100101101000010011111,
+ 0b1000010010001011101001011010100010111011110100110011011000100111,
+ 0b1000011011100001010111010111010011101100100010010010100100101001,
+ 0b1001001001010111110101000010111010000000101111010100001010010010,
+ 0b0011011110110010010101111011000001000000000011011111000011111011,
+ 0b1011000110100011001110000001000100000001011100010111010010011110,
+ 0b0111101110110101110111110000011000000100011100011000101101101110,
+ 0b1001100101111011011100011110101011001111100111101010101010110111,
+ 0b1100110010010001100011001111010000000100011101001111011101001111,
+ 0b1000111001111010100101000010000100000001001100101010001011001101,
+ 0b0011101011110000110010100101010100110010100001000010101011111101,
+ 0b1100000000000110000010101011000000011101000110011111100010111111,
+ 0b0010100110000011011100010110111100010110101100110011101110001101,
+ 0b0010111101010011111000111001111100110111111100100011110001101110,
+ 0b1001110111001001101001001001011000010100110001000000100011010110,
+ 0b0011110101100111011011111100001000011001010100111100100101111010,
+ 0b0010001101000011000010100101110000010101101000100110000100001010,
+ 0b0010000010100110010101100101110011101111000111111111001001100001,
+ 0b0100111111011011011011100111111011000010011101101111011111110110,
+ 0b1111111111010110101011101000100101110100001110001001101011100111,
+ 0b1011111101000101110000111100100010111010100001010000010010110010,
+ 0b1111010101001011101011101010000100110110001110111100100110111111,
+ 0b1011001101000001001101000010101010010110010001100001011100011010,
+ 0b0101001011011101010001110100010000010001111100100100100001001101,
+ 0b0010100000111001100011000101100101000001111100111001101000000010,
+ 0b1011001111010101011001000100100110100100110111110100000110111000,
+ 0b0101011111010011100011010010111101110010100001111111100010001001,
+ 0b0010111011101100100000000000001111111010011101100111100001001101,
+ 0b1101000000000000000000000000000000000000000000000000000000000000 };
+ };
+#elif LONG_DOUBLE_KIND == LDK_IBM128
+ template<>
+ struct floating_type_traits<long double>
+ {
+ static constexpr int mantissa_bits = 105;
+ static constexpr int exponent_bits = 11;
+ static constexpr bool has_implicit_leading_bit = true;
+ using mantissa_t = unsigned __int128;
+ using shortest_scientific_t = ryu::floating_decimal_128;
+
+ static constexpr uint64_t pow10_adjustment_tab[]
+ = { 0b0000000000000000000000000000000000000000000000001000000100000000,
+ 0b0000000000000000000100000000000000000000001000000000000000000010,
+ 0b0000100000000000000000001001000000000000000001100100000000000000,
+ 0b0011000000000000000000000000000001110000010000000000000000000000,
+ 0b0000100000000000001000000000000000000000000000100000000000000000 };
+ };
+#endif
+
+ // An IEEE-style decomposition of a floating-point value of type T.
+ template<typename T>
+ struct ieee_t
+ {
+ typename floating_type_traits<T>::mantissa_t mantissa;
+ uint32_t biased_exponent;
+ bool sign;
+ };
+
+ // Decompose the floating-point value into its IEEE components.
+ template<typename T>
+ ieee_t<T>
+ get_ieee_repr(const T value)
+ {
+ constexpr int mantissa_bits = floating_type_traits<T>::mantissa_bits;
+ constexpr int exponent_bits = floating_type_traits<T>::exponent_bits;
+ constexpr int total_bits = mantissa_bits + exponent_bits + 1;
+
+ constexpr auto get_uint_t = [] {
+ if constexpr (total_bits <= 32)
+ return uint32_t{};
+ else if constexpr (total_bits <= 64)
+ return uint64_t{};
+#ifdef __SIZEOF_INT128__
+ else if constexpr (total_bits <= 128)
+ return (unsigned __int128){};
+#endif
+ };
+ using uint_t = decltype(get_uint_t());
+ uint_t value_bits = 0;
+ memcpy(&value_bits, &value, sizeof(value));
+
+ ieee_t<T> ieee_repr;
+ ieee_repr.mantissa = value_bits & ((uint_t{1} << mantissa_bits) - 1u);
+ ieee_repr.biased_exponent
+ = (value_bits >> mantissa_bits) & ((uint_t{1} << exponent_bits) - 1u);
+ ieee_repr.sign = (value_bits >> (mantissa_bits + exponent_bits)) & 1;
+ return ieee_repr;
+ }
+
+#if LONG_DOUBLE_KIND == LDK_IBM128
+ template<>
+ ieee_t<long double>
+ get_ieee_repr(const long double value)
+ {
+ // The layout of __ibm128 isn't compatible with the standard IEEE format.
+ // So we transform it into an IEEE-compatible format, suitable for
+ // consumption by the generic Ryu API, with an 11-bit exponent and 105-bit
+ // mantissa (plus an implicit leading bit). We use the exponent and sign
+ // of the high part, and we merge the mantissa of the high part with the
+ // mantissa (and the implicit leading bit) of the low part.
+ using uint_t = unsigned __int128;
+ uint_t value_bits = 0;
+ memcpy(&value_bits, &value, sizeof(value_bits));
+
+ const uint64_t value_hi = value_bits;
+ const uint64_t value_lo = value_bits >> 64;
+
+ uint64_t mantissa_hi = value_hi & ((1ull << 52) - 1);
+ unsigned exponent_hi = (value_hi >> 52) & ((1ull << 11) - 1);
+ const int sign_hi = (value_hi >> 63) & 1;
+
+ uint64_t mantissa_lo = value_lo & ((1ull << 52) - 1);
+ const unsigned exponent_lo = (value_lo >> 52) & ((1ull << 11) - 1);
+ const int sign_lo = (value_lo >> 63) & 1;
+
+ {
+ // The following code for adjusting the low-part mantissa to combine
+ // it with the high-part mantissa is taken from the glibc source file
+ // sysdeps/ieee754/ldbl-128ibm/printf_fphex.c.
+ mantissa_lo <<= 7;
+ if (exponent_lo != 0)
+ mantissa_lo |= (1ull << (52 + 7));
+ else
+ mantissa_lo <<= 1;
+
+ const int ediff = exponent_hi - exponent_lo - 53;
+ if (ediff > 63)
+ mantissa_lo = 0;
+ else if (ediff > 0)
+ mantissa_lo >>= ediff;
+ else if (ediff < 0)
+ mantissa_lo <<= -ediff;
+
+ if (sign_lo != sign_hi && mantissa_lo != 0)
+ {
+ mantissa_lo = (1ull << 60) - mantissa_lo;
+ if (mantissa_hi == 0)
+ {
+ mantissa_hi = 0xffffffffffffeLL | (mantissa_lo >> 59);
+ mantissa_lo = 0xfffffffffffffffLL & (mantissa_lo << 1);
+ exponent_hi--;
+ }
+ else
+ mantissa_hi--;
+ }
+ }
+
+ ieee_t<long double> ieee_repr;
+ ieee_repr.mantissa = ((uint_t{mantissa_hi} << 64)
+ | (uint_t{mantissa_lo} << 4)) >> 11;
+ ieee_repr.biased_exponent = exponent_hi;
+ ieee_repr.sign = sign_hi;
+ return ieee_repr;
+ }
+#endif
+
+ // Invoke Ryu to obtain the shortest scientific form for the given
+ // floating-point number.
+ template<typename T>
+ typename floating_type_traits<T>::shortest_scientific_t
+ floating_to_shortest_scientific(const T value)
+ {
+ if constexpr (std::is_same_v<T, float>)
+ return ryu::floating_to_fd32(value);
+ else if constexpr (std::is_same_v<T, double>)
+ return ryu::floating_to_fd64(value);
+#ifdef __SIZEOF_INT128__
+ else if constexpr (std::is_same_v<T, long double>)
+ {
+ constexpr int mantissa_bits
+ = floating_type_traits<T>::mantissa_bits;
+ constexpr int exponent_bits
+ = floating_type_traits<T>::exponent_bits;
+ constexpr bool has_implicit_leading_bit
+ = floating_type_traits<T>::has_implicit_leading_bit;
+
+ const auto [mantissa, exponent, sign] = get_ieee_repr(value);
+ return ryu::generic_binary_to_decimal(mantissa, exponent, sign,
+ mantissa_bits, exponent_bits,
+ !has_implicit_leading_bit);
+ }
+#endif
+ }
+
+ // This subroutine returns true if the shortest scientific form fd is a
+ // positive power of 10, and the floating-point number that has this shortest
+ // scientific form is smaller than this power of 10.
+ //
+ // For instance, the exactly-representable 64-bit number
+ // 99999999999999991611392.0 has the shortest scientific form 1e23, so its
+ // exact value is smaller than its shortest scientific form.
+ //
+ // For these powers of 10 the length of the fixed form is one digit less
+ // than what the scientific exponent suggests.
+ //
+ // This subroutine inspects a lookup table to detect when fd is such a
+ // "rounded up" power of 10.
+ template<typename T>
+ bool
+ is_rounded_up_pow10_p(const typename
+ floating_type_traits<T>::shortest_scientific_t fd)
+ {
+ if (fd.exponent < 0 || fd.mantissa != 1) [[likely]]
+ return false;
+
+ constexpr auto& pow10_adjustment_tab
+ = floating_type_traits<T>::pow10_adjustment_tab;
+ __glibcxx_assert(fd.exponent/64 < (int)std::size(pow10_adjustment_tab));
+ return (pow10_adjustment_tab[fd.exponent/64]
+ & (1ull << (63 - fd.exponent%64)));
+ }
+
+ int
+ get_mantissa_length(const ryu::floating_decimal_32 fd)
+ { return ryu::decimalLength9(fd.mantissa); }
+
+ int
+ get_mantissa_length(const ryu::floating_decimal_64 fd)
+ { return ryu::decimalLength17(fd.mantissa); }
+
+#ifdef __SIZEOF_INT128__
+ int
+ get_mantissa_length(const ryu::floating_decimal_128 fd)
+ { return ryu::generic128::decimalLength(fd.mantissa); }
+#endif
+} // anon namespace
+
+namespace std _GLIBCXX_VISIBILITY(default)
+{
+_GLIBCXX_BEGIN_NAMESPACE_VERSION
+
+// This subroutine of __floating_to_chars_* handles writing nan, inf and 0 in
+// all formatting modes.
+template<typename T>
+ static optional<to_chars_result>
+ __handle_special_value(char* first, char* const last, const T value,
+ const chars_format fmt, const int precision)
+ {
+ __glibcxx_assert(precision >= 0);
+
+ string_view str;
+ switch (__builtin_fpclassify(FP_NAN, FP_INFINITE, FP_NORMAL, FP_SUBNORMAL,
+ FP_ZERO, value))
+ {
+ case FP_INFINITE:
+ str = "-inf";
+ break;
+
+ case FP_NAN:
+ str = "-nan";
+ break;
+
+ case FP_ZERO:
+ break;
+
+ default:
+ case FP_SUBNORMAL:
+ case FP_NORMAL: [[likely]]
+ return nullopt;
+ }
+
+ if (!str.empty())
+ {
+ // We're formatting +-inf or +-nan.
+ if (!__builtin_signbit(value))
+ str.remove_prefix(strlen("-"));
+
+ if (last - first < (int)str.length())
+ return {{last, errc::value_too_large}};
+
+ memcpy(first, &str[0], str.length());
+ first += str.length();
+ return {{first, errc{}}};
+ }
+
+ // We're formatting 0.
+ __glibcxx_assert(value == 0);
+ const auto orig_first = first;
+ const bool sign = __builtin_signbit(value);
+ int expected_output_length;
+ switch (fmt)
+ {
+ case chars_format::fixed:
+ case chars_format::scientific:
+ case chars_format::hex:
+ expected_output_length = sign + 1;
+ if (precision)
+ expected_output_length += strlen(".") + precision;
+ if (fmt == chars_format::scientific)
+ expected_output_length += strlen("e+00");
+ else if (fmt == chars_format::hex)
+ expected_output_length += strlen("p+0");
+ if (last - first < expected_output_length)
+ return {{last, errc::value_too_large}};
+
+ if (sign)
+ *first++ = '-';
+ *first++ = '0';
+ if (precision)
+ {
+ *first++ = '.';
+ memset(first, '0', precision);
+ first += precision;
+ }
+ if (fmt == chars_format::scientific)
+ {
+ memcpy(first, "e+00", 4);
+ first += 4;
+ }
+ else if (fmt == chars_format::hex)
+ {
+ memcpy(first, "p+0", 3);
+ first += 3;
+ }
+ break;
+
+ case chars_format::general:
+ default: // case chars_format{}:
+ expected_output_length = sign + 1;
+ if (last - first < expected_output_length)
+ return {{last, errc::value_too_large}};
+
+ if (sign)
+ *first++ = '-';
+ *first++ = '0';
+ break;
+ }
+ __glibcxx_assert(first - orig_first == expected_output_length);
+ return {{first, errc{}}};
+ }
+
+// This subroutine of the floating-point to_chars overloads performs
+// hexadecimal formatting.
+template<typename T>
+ static to_chars_result
+ __floating_to_chars_hex(char* first, char* const last, const T value,
+ const optional<int> precision)
+ {
+ if (precision.has_value() && precision.value() < 0) [[unlikely]]
+ // A negative precision argument is treated as if it were omitted.
+ return __floating_to_chars_hex(first, last, value, nullopt);
+
+ __glibcxx_requires_valid_range(first, last);
+
+ constexpr int mantissa_bits = floating_type_traits<T>::mantissa_bits;
+ constexpr bool has_implicit_leading_bit
+ = floating_type_traits<T>::has_implicit_leading_bit;
+ constexpr int exponent_bits = floating_type_traits<T>::exponent_bits;
+ constexpr int exponent_bias = (1u << (exponent_bits - 1)) - 1;
+ using mantissa_t = typename floating_type_traits<T>::mantissa_t;
+ constexpr int mantissa_t_width = sizeof(mantissa_t) * __CHAR_BIT__;
+
+ if (auto result = __handle_special_value(first, last, value,
+ chars_format::hex,
+ precision.value_or(0)))
+ return *result;
+
+ // Extract the sign, mantissa and exponent from the value.
+ const auto [ieee_mantissa, biased_exponent, sign] = get_ieee_repr(value);
+ const bool is_normal_number = (biased_exponent != 0);
+
+ // Calculate the unbiased exponent.
+ const int32_t unbiased_exponent = (is_normal_number
+ ? biased_exponent - exponent_bias
+ : 1 - exponent_bias);
+
+ // Shift the mantissa so that its bitwidth is a multiple of 4.
+ constexpr unsigned rounded_mantissa_bits = (mantissa_bits + 3) / 4 * 4;
+ static_assert(mantissa_t_width >= rounded_mantissa_bits);
+ mantissa_t effective_mantissa
+ = ieee_mantissa << (rounded_mantissa_bits - mantissa_bits);
+ if (is_normal_number)
+ {
+ if constexpr (has_implicit_leading_bit)
+ // Restore the mantissa's implicit leading bit.
+ effective_mantissa |= mantissa_t{1} << rounded_mantissa_bits;
+ else
+ // The explicit mantissa bit should already be set.
+ __glibcxx_assert(effective_mantissa & (mantissa_t{1} << (mantissa_bits
+ - 1u)));
+ }
+
+ // Compute the shortest precision needed to print this value exactly,
+ // disregarding trailing zeros.
+ constexpr int full_hex_precision = (has_implicit_leading_bit
+ ? (mantissa_bits + 3) / 4
+ // With an explicit leading bit, we
+ // use the four leading nibbles as the
+ // hexit before the decimal point.
+ : (mantissa_bits - 4 + 3) / 4);
+ const int trailing_zeros = __countr_zero(effective_mantissa) / 4;
+ const int shortest_full_precision = full_hex_precision - trailing_zeros;
+ __glibcxx_assert(shortest_full_precision >= 0);
+
+ int written_exponent = unbiased_exponent;
+ const int effective_precision = precision.value_or(shortest_full_precision);
+ if (effective_precision < shortest_full_precision)
+ {
+ // When limiting the precision, we need to determine how to round the
+ // least significant printed hexit. The following branchless
+ // bit-level-parallel technique computes whether to round up the
+ // mantissa bit at index N (according to round-to-nearest rules) when
+ // dropping N bits of precision, for each index N in the bit vector.
+ // This technique is borrowed from the MSVC implementation.
+ using bitvec = mantissa_t;
+ const bitvec round_bit = effective_mantissa << 1;
+ const bitvec has_tail_bits = round_bit - 1;
+ const bitvec lsb_bit = effective_mantissa;
+ const bitvec should_round = round_bit & (has_tail_bits | lsb_bit);
+
+ const int dropped_bits = 4*(full_hex_precision - effective_precision);
+ // Mask out the dropped nibbles.
+ effective_mantissa >>= dropped_bits;
+ effective_mantissa <<= dropped_bits;
+ if (should_round & (mantissa_t{1} << dropped_bits))
+ {
+ // Round up the least significant nibble.
+ effective_mantissa += mantissa_t{1} << dropped_bits;
+ // Check and adjust for overflow of the leading nibble. When the
+ // type has an implicit leading bit, then the leading nibble
+ // before rounding is either 0 or 1, so it can't overflow.
+ if constexpr (!has_implicit_leading_bit)
+ {
+ // The only supported floating-point type with explicit
+ // leading mantissa bit is LDK_FLOAT80, i.e. x86 80-bit
+ // extended precision, and so we hardcode the below overflow
+ // check+adjustment for this type.
+ static_assert(mantissa_t_width == 64
+ && rounded_mantissa_bits == 64);
+ if (effective_mantissa == 0)
+ {
+ // We rounded up the least significant nibble and the
+ // mantissa overflowed, e.g f.fcp+10 with precision=1
+ // became 10.0p+10. Absorb this extra hexit into the
+ // exponent to obtain 1.0p+14.
+ effective_mantissa
+ = mantissa_t{1} << (rounded_mantissa_bits - 4);
+ written_exponent += 4;
+ }
+ }
+ }
+ }
+
+ // Compute the leading hexit and mask it out from the mantissa.
+ char leading_hexit;
+ if constexpr (has_implicit_leading_bit)
+ {
+ const unsigned nibble = effective_mantissa >> rounded_mantissa_bits;
+ __glibcxx_assert(nibble <= 2);
+ leading_hexit = '0' + nibble;
+ effective_mantissa &= ~(mantissa_t{0b11} << rounded_mantissa_bits);
+ }
+ else
+ {
+ const unsigned nibble = effective_mantissa >> (rounded_mantissa_bits-4);
+ __glibcxx_assert(nibble < 16);
+ leading_hexit = "0123456789abcdef"[nibble];
+ effective_mantissa &= ~(mantissa_t{0b1111} << (rounded_mantissa_bits-4));
+ written_exponent -= 3;
+ }
+
+ // Now before we start writing the string, determine the total length of
+ // the output string and perform a single bounds check.
+ int expected_output_length = sign + 1;
+ if (effective_precision != 0)
+ expected_output_length += strlen(".") + effective_precision;
+ const int abs_written_exponent = abs(written_exponent);
+ expected_output_length += (abs_written_exponent >= 10000 ? strlen("p+ddddd")
+ : abs_written_exponent >= 1000 ? strlen("p+dddd")
+ : abs_written_exponent >= 100 ? strlen("p+ddd")
+ : abs_written_exponent >= 10 ? strlen("p+dd")
+ : strlen("p+d"));
+ if (last - first < expected_output_length)
+ return {last, errc::value_too_large};
+
+ const auto saved_first = first;
+ // Write the negative sign and the leading hexit.
+ if (sign)
+ *first++ = '-';
+ *first++ = leading_hexit;
+
+ if (effective_precision > 0)
+ {
+ *first++ = '.';
+ int written_hexits = 0;
+ // Extract and mask out the leading nibble after the decimal point,
+ // write its corresponding hexit, and repeat until the mantissa is
+ // empty.
+ int nibble_offset = rounded_mantissa_bits;
+ if constexpr (!has_implicit_leading_bit)
+ // We already printed the entire leading hexit.
+ nibble_offset -= 4;
+ while (effective_mantissa != 0)
+ {
+ nibble_offset -= 4;
+ const unsigned nibble = effective_mantissa >> nibble_offset;
+ __glibcxx_assert(nibble < 16);
+ *first++ = "0123456789abcdef"[nibble];
+ ++written_hexits;
+ effective_mantissa &= ~(mantissa_t{0b1111} << nibble_offset);
+ }
+ __glibcxx_assert(nibble_offset >= 0);
+ __glibcxx_assert(written_hexits <= effective_precision);
+ // Since the mantissa is now empty, every hexit hereafter must be '0'.
+ if (int remaining_hexits = effective_precision - written_hexits)
+ {
+ memset(first, '0', remaining_hexits);
+ first += remaining_hexits;
+ }
+ }
+
+ // Finally, write the exponent.
+ *first++ = 'p';
+ if (written_exponent >= 0)
+ *first++ = '+';
+ const to_chars_result result = to_chars(first, last, written_exponent);
+ __glibcxx_assert(result.ec == errc{}
+ && result.ptr == saved_first + expected_output_length);
+ return result;
+ }
+
+template<typename T>
+ static to_chars_result
+ __floating_to_chars_shortest(char* first, char* const last, const T value,
+ chars_format fmt)
+ {
+ if (fmt == chars_format::hex)
+ return __floating_to_chars_hex(first, last, value, nullopt);
+
+ __glibcxx_assert(fmt == chars_format::fixed
+ || fmt == chars_format::scientific
</cut>
[TCWG CI] Regression caused by binutils: PR28149, debug info with wrong file association:
commit 51298b330327a568358da069d9808f51c6cb1672
Author: Alan Modra <amodra(a)gmail.com>
PR28149, debug info with wrong file association
Results regressed to
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_abe stage1:
-5
# build_abe qemu:
-2
# linux_n_obj:
6363
# First few build errors in logs:
from
# reset_artifacts:
-10
# build_abe binutils:
-9
# build_abe stage1:
-5
# build_abe qemu:
-2
# linux_n_obj:
7116
# linux build successful:
all
# linux boot successful:
boot
THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
This commit has regressed these CI configurations:
- tcwg_kernel/gnu-master-aarch64-lts-defconfig
First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-lts-def…
Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-lts-def…
Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-lts-def…
Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-lts-def…
Reproduce builds:
<cut>
mkdir investigate-binutils-51298b330327a568358da069d9808f51c6cb1672
cd investigate-binutils-51298b330327a568358da069d9808f51c6cb1672
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-lts-def… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-lts-def… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-lts-def… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /binutils/ ./ ./bisect/baseline/
cd binutils
# Reproduce first_bad build
git checkout --detach 51298b330327a568358da069d9808f51c6cb1672
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 5cdb4f14426a99ec8fcba843fa503efdc55fa078
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 51298b330327a568358da069d9808f51c6cb1672
Author: Alan Modra <amodra(a)gmail.com>
Date: Fri Sep 17 09:08:15 2021 +0930
PR28149, debug info with wrong file association
gcc-11 and gcc-12 pass -gdwarf-5 to gas, in order to prime gas for
DWARF 5 level debug info. Unfortunately it seems there are cases
where the compiler does not emit a .file or .loc dwarf debug directive
before any machine instructions. (Note that the .file directive
typically emitted as the first line of assembly output doesn't count as
a dwarf debug directive. The dwarf .file has a file number before the
file name string.)
This patch delays allocation of file numbers for gas generated line
debug info until the end of assembly, thus avoiding any clashes with
compiler generated file numbers. Two fixes for test case source are
necessary; A .loc can't use a file number that hasn't already been
specified with .file.
A followup patch will remove all the gas generated line info on
seeing a .file directive.
PR 28149
* dwarf2dbg.c (num_of_auto_assigned): Delete.
(current): Update initialisation.
(set_or_check_view): Replace all accesses to view with u.view.
(dwarf2_consume_line_info): Likewise.
(dwarf2_directive_loc): Likewise. Assert that we aren't generating
line info.
(dwarf2_gen_line_info_1): Don't call set_or_check_view on
gas generated line entries.
(dwarf2_gen_line_info): Set and track filenames for gas generated
line entries. Simplify generation of labels.
(get_directory_table_entry): Use filename_cmp when comparing dirs.
(do_allocate_filenum): New function.
(dwarf2_where): Set u.filename and filenum to -1 for gas generated
line entries.
(dwarf2_directive_filename): Remove num_of_auto_assigned handling.
(process_entries): Update view field access. Call
do_allocate_filenum.
* dwarf2dbg.h (struct dwarf2_line_info): Add filename field in
union aliasing view.
* testsuite/gas/i386/dwarf2-line-3.s: Add .file directive.
* testsuite/gas/i386/dwarf2-line-4.s: Likewise.
* testsuite/gas/i386/dwarf2-line-4.d: Update expected output.
* testsuite/gas/i386/dwarf4-line-1.d: Likewise.
* testsuite/gas/i386/dwarf5-line-1.d: Likewise.
* testsuite/gas/i386/dwarf5-line-2.d: Likewise.
---
gas/dwarf2dbg.c | 152 ++++++++++++++++++---------------
gas/dwarf2dbg.h | 7 +-
gas/testsuite/gas/i386/dwarf2-line-3.s | 1 +
gas/testsuite/gas/i386/dwarf2-line-4.d | 5 +-
gas/testsuite/gas/i386/dwarf2-line-4.s | 1 +
gas/testsuite/gas/i386/dwarf4-line-1.d | 4 +-
gas/testsuite/gas/i386/dwarf5-line-1.d | 4 +-
gas/testsuite/gas/i386/dwarf5-line-2.d | 3 +-
8 files changed, 105 insertions(+), 72 deletions(-)
diff --git a/gas/dwarf2dbg.c b/gas/dwarf2dbg.c
index 9e3437b8948..c6303ba94a6 100644
--- a/gas/dwarf2dbg.c
+++ b/gas/dwarf2dbg.c
@@ -207,7 +207,6 @@ struct file_entry
static struct file_entry *files;
static unsigned int files_in_use;
static unsigned int files_allocated;
-static unsigned int num_of_auto_assigned;
/* Table of directories used by .debug_line. */
static char ** dirs = NULL;
@@ -233,7 +232,7 @@ static struct dwarf2_line_info current =
{
1, 1, 0, 0,
DWARF2_LINE_DEFAULT_IS_STMT ? DWARF2_FLAG_IS_STMT : 0,
- 0, NULL
+ 0, { NULL }
};
/* This symbol is used to recognize view number forced resets in loc
@@ -342,7 +341,7 @@ set_or_check_view (struct line_entry *e, struct line_entry *p,
/* First, compute !(E->label > P->label), to tell whether or not
we're to reset the view number. If we can't resolve it to a
constant, keep it symbolic. */
- if (!p || (e->loc.view == force_reset_view && force_reset_view))
+ if (!p || (e->loc.u.view == force_reset_view && force_reset_view))
{
viewx.X_op = O_constant;
viewx.X_add_number = 0;
@@ -367,9 +366,9 @@ set_or_check_view (struct line_entry *e, struct line_entry *p,
}
}
- if (S_IS_DEFINED (e->loc.view) && symbol_constant_p (e->loc.view))
+ if (S_IS_DEFINED (e->loc.u.view) && symbol_constant_p (e->loc.u.view))
{
- expressionS *value = symbol_get_value_expression (e->loc.view);
+ expressionS *value = symbol_get_value_expression (e->loc.u.view);
/* We can't compare the view numbers at this point, because in
VIEWX we've only determined whether we're to reset it so
far. */
@@ -404,16 +403,16 @@ set_or_check_view (struct line_entry *e, struct line_entry *p,
{
expressionS incv;
- if (!p->loc.view)
+ if (!p->loc.u.view)
{
- p->loc.view = symbol_temp_make ();
- gas_assert (!S_IS_DEFINED (p->loc.view));
+ p->loc.u.view = symbol_temp_make ();
+ gas_assert (!S_IS_DEFINED (p->loc.u.view));
}
memset (&incv, 0, sizeof (incv));
incv.X_unsigned = 1;
incv.X_op = O_symbol;
- incv.X_add_symbol = p->loc.view;
+ incv.X_add_symbol = p->loc.u.view;
incv.X_add_number = 1;
if (viewx.X_op == O_constant)
@@ -430,16 +429,16 @@ set_or_check_view (struct line_entry *e, struct line_entry *p,
}
}
- if (!S_IS_DEFINED (e->loc.view))
+ if (!S_IS_DEFINED (e->loc.u.view))
{
- symbol_set_value_expression (e->loc.view, &viewx);
- S_SET_SEGMENT (e->loc.view, expr_section);
- symbol_set_frag (e->loc.view, &zero_address_frag);
+ symbol_set_value_expression (e->loc.u.view, &viewx);
+ S_SET_SEGMENT (e->loc.u.view, expr_section);
+ symbol_set_frag (e->loc.u.view, &zero_address_frag);
}
/* Define and attempt to simplify any earlier views needed to
compute E's. */
- if (h && p && p->loc.view && !S_IS_DEFINED (p->loc.view))
+ if (h && p && p->loc.u.view && !S_IS_DEFINED (p->loc.u.view))
{
struct line_entry *h2;
/* Reverse the list to avoid quadratic behavior going backwards
@@ -459,7 +458,9 @@ set_or_check_view (struct line_entry *e, struct line_entry *p,
break;
set_or_check_view (r, r->next, NULL);
}
- while (r->next && r->next->loc.view && !S_IS_DEFINED (r->next->loc.view)
+ while (r->next
+ && r->next->loc.u.view
+ && !S_IS_DEFINED (r->next->loc.u.view)
&& (r = r->next));
/* Unreverse the list, so that we can go forward again. */
@@ -475,14 +476,14 @@ set_or_check_view (struct line_entry *e, struct line_entry *p,
view of the previous subsegment. */
if (r == h)
continue;
- gas_assert (S_IS_DEFINED (r->loc.view));
- resolve_expression (symbol_get_value_expression (r->loc.view));
+ gas_assert (S_IS_DEFINED (r->loc.u.view));
+ resolve_expression (symbol_get_value_expression (r->loc.u.view));
}
while (r != p && (r = r->next));
/* Now that we've defined and computed all earlier views that might
be needed to compute E's, attempt to simplify it. */
- resolve_expression (symbol_get_value_expression (e->loc.view));
+ resolve_expression (symbol_get_value_expression (e->loc.u.view));
}
}
@@ -518,10 +519,8 @@ dwarf2_gen_line_info_1 (symbolS *label, struct dwarf2_line_info *loc)
/* Subseg heads are chained to previous subsegs in
dwarf2_finish. */
- if (loc->view && lss->head)
- set_or_check_view (e,
- (struct line_entry *)lss->ptail,
- lss->head);
+ if (loc->filenum != -1u && loc->u.view && lss->head)
+ set_or_check_view (e, (struct line_entry *) lss->ptail, lss->head);
*lss->ptail = e;
lss->ptail = &e->next;
@@ -532,9 +531,6 @@ dwarf2_gen_line_info_1 (symbolS *label, struct dwarf2_line_info *loc)
void
dwarf2_gen_line_info (addressT ofs, struct dwarf2_line_info *loc)
{
- static unsigned int line = -1;
- static unsigned int filenum = -1;
-
symbolS *sym;
/* Early out for as-yet incomplete location information. */
@@ -552,20 +548,35 @@ dwarf2_gen_line_info (addressT ofs, struct dwarf2_line_info *loc)
symbols apply to assembler code. It is necessary to emit
duplicate line symbols when a compiler asks for them, because GDB
uses them to determine the end of the prologue. */
- if (debug_type == DEBUG_DWARF2
- && line == loc->line && filenum == loc->filenum)
- return;
+ if (debug_type == DEBUG_DWARF2)
+ {
+ static unsigned int line = -1;
+ static const char *filename = NULL;
+
+ if (line == loc->line)
+ {
+ if (filename == loc->u.filename)
+ return;
+ if (filename_cmp (filename, loc->u.filename) == 0)
+ {
+ filename = loc->u.filename;
+ return;
+ }
+ }
- line = loc->line;
- filenum = loc->filenum;
+ line = loc->line;
+ filename = loc->u.filename;
+ }
if (linkrelax)
{
- char name[120];
+ static int label_num = 0;
+ char name[32];
/* Use a non-fake name for the line number location,
so that it can be referred to by relocations. */
- sprintf (name, ".Loc.%u.%u", line, filenum);
+ sprintf (name, ".Loc.%u", label_num);
+ label_num++;
sym = symbol_new (name, now_seg, frag_now, ofs);
}
else
@@ -624,13 +635,15 @@ get_directory_table_entry (const char *dirname,
{
const char * pwd = file0_dirname ? file0_dirname : getpwd ();
- if (dwarf_level >= 5 && strcmp (dirname, pwd) != 0)
+ if (dwarf_level >= 5 && filename_cmp (dirname, pwd) != 0)
{
- /* In DWARF-5 the 0 entry in the directory table is expected to be
- the same as the DW_AT_comp_dir (which is set to the current build
- directory). Since we are about to create a directory entry that
- is not the same, allocate the current directory first.
- FIXME: Alternatively we could generate an error message here. */
+ /* In DWARF-5 the 0 entry in the directory table is
+ expected to be the same as the DW_AT_comp_dir (which
+ is set to the current build directory). Since we are
+ about to create a directory entry that is not the
+ same, allocate the current directory first.
+ FIXME: Alternatively we could generate an error
+ message here. */
(void) get_directory_table_entry (pwd, NULL, strlen (pwd),
true);
d = 1;
@@ -745,14 +758,30 @@ allocate_filenum (const char * pathname)
if (!assign_file_to_slot (i, file, dir))
return -1;
- num_of_auto_assigned++;
-
last_used = i;
last_used_dir_len = dir_len;
return i;
}
+/* Run through the list of line entries starting at E, allocating
+ file entries for gas generated debug. */
+
+static void
+do_allocate_filenum (struct line_entry *e)
+{
+ do
+ {
+ if (e->loc.filenum == -1u)
+ {
+ e->loc.filenum = allocate_filenum (e->loc.u.filename);
+ e->loc.u.view = NULL;
+ }
+ e = e->next;
+ }
+ while (e);
+}
+
/* Allocate slot NUM in the .debug_line file table to FILENAME.
If DIRNAME is not NULL or there is a directory component to FILENAME
then this will be stored in the directory table, if not already present.
@@ -929,17 +958,12 @@ dwarf2_where (struct dwarf2_line_info *line)
{
if (debug_type == DEBUG_DWARF2)
{
- const char *filename;
-
- memset (line, 0, sizeof (*line));
- filename = as_where (&line->line);
- line->filenum = allocate_filenum (filename);
- /* FIXME: We should check the return value from allocate_filenum. */
+ line->u.filename = as_where (&line->line);
+ line->filenum = -1u;
line->column = 0;
line->flags = DWARF2_FLAG_IS_STMT;
line->isa = current.isa;
line->discriminator = current.discriminator;
- line->view = NULL;
}
else
*line = current;
@@ -1018,7 +1042,7 @@ dwarf2_consume_line_info (void)
| DWARF2_FLAG_PROLOGUE_END
| DWARF2_FLAG_EPILOGUE_BEGIN);
current.discriminator = 0;
- current.view = NULL;
+ current.u.view = NULL;
}
/* Called for each (preferably code) label. If dwarf2_loc_mark_labels
@@ -1060,7 +1084,6 @@ dwarf2_directive_filename (void)
char *filename;
const char * dirname = NULL;
int filename_len;
- unsigned int i;
/* Continue to accept a bare string and pass it off. */
SKIP_WHITESPACE ();
@@ -1132,18 +1155,6 @@ dwarf2_directive_filename (void)
return NULL;
}
- if (num_of_auto_assigned)
- {
- /* Clear slots auto-assigned before the first .file <NUMBER>
- directive was seen. */
- if (files_in_use != (num_of_auto_assigned + 1))
- abort ();
- for (i = 1; i < files_in_use; i++)
- files[i].filename = NULL;
- files_in_use = 0;
- num_of_auto_assigned = 0;
- }
-
if (! allocate_filename_to_slot (dirname, filename, (unsigned int) num,
with_md5))
return NULL;
@@ -1191,6 +1202,11 @@ dwarf2_directive_loc (int dummy ATTRIBUTE_UNUSED)
return;
}
+ /* debug_type will be turned off by dwarf2_directive_filename, and
+ if we don't have a dwarf style .file then files_in_use will be
+ zero and the above error will trigger. */
+ gas_assert (debug_type == DEBUG_NONE);
+
current.filenum = filenum;
current.line = line;
current.discriminator = 0;
@@ -1333,7 +1349,7 @@ dwarf2_directive_loc (int dummy ATTRIBUTE_UNUSED)
S_SET_VALUE (sym, 0);
symbol_set_frag (sym, &zero_address_frag);
}
- current.view = sym;
+ current.u.view = sym;
}
else
{
@@ -1347,10 +1363,9 @@ dwarf2_directive_loc (int dummy ATTRIBUTE_UNUSED)
demand_empty_rest_of_line ();
dwarf2_any_loc_directive_seen = dwarf2_loc_directive_seen = true;
- debug_type = DEBUG_NONE;
/* If we were given a view id, emit the row right away. */
- if (current.view)
+ if (current.u.view)
dwarf2_emit_insn (0);
}
@@ -1984,7 +1999,7 @@ process_entries (segT seg, struct line_entry *e)
frag_ofs = S_GET_VALUE (lab);
if (last_frag == NULL
- || (e->loc.view == force_reset_view && force_reset_view
+ || (e->loc.u.view == force_reset_view && force_reset_view
/* If we're going to reset the view, but we know we're
advancing the PC, we don't have to force with
set_address. We know we do when we're at the same
@@ -2850,16 +2865,19 @@ dwarf2_finish (void)
struct line_subseg *lss = s->head;
struct line_entry **ptail = lss->ptail;
+ if (lss->head && SEG_NORMAL (s->seg))
+ do_allocate_filenum (lss->head);
+
/* Reset the initial view of the first subsection of the
section. */
- if (lss->head && lss->head->loc.view)
+ if (lss->head && lss->head->loc.u.view)
set_or_check_view (lss->head, NULL, NULL);
while ((lss = lss->next) != NULL)
{
/* Link the first view of subsequent subsections to the
previous view. */
- if (lss->head && lss->head->loc.view)
+ if (lss->head && lss->head->loc.u.view)
set_or_check_view (lss->head,
!s->head ? NULL : (struct line_entry *)ptail,
s->head ? s->head->head : NULL);
diff --git a/gas/dwarf2dbg.h b/gas/dwarf2dbg.h
index 14d770c40dd..700d9dec5cb 100644
--- a/gas/dwarf2dbg.h
+++ b/gas/dwarf2dbg.h
@@ -36,7 +36,12 @@ struct dwarf2_line_info
unsigned int isa;
unsigned int flags;
unsigned int discriminator;
- symbolS *view;
+ /* filenum == -1u chooses filename, otherwise view. */
+ union
+ {
+ symbolS *view;
+ const char *filename;
+ } u;
};
/* Implements the .file FILENO "FILENAME" directive. FILENO can be 0
diff --git a/gas/testsuite/gas/i386/dwarf2-line-3.s b/gas/testsuite/gas/i386/dwarf2-line-3.s
index 2085ef93940..e933719fbc3 100644
--- a/gas/testsuite/gas/i386/dwarf2-line-3.s
+++ b/gas/testsuite/gas/i386/dwarf2-line-3.s
@@ -7,6 +7,7 @@
main:
.cfi_startproc
nop
+ .file 1 "dwarf2-test.c"
.loc 1 1
ret
.cfi_endproc
diff --git a/gas/testsuite/gas/i386/dwarf2-line-4.d b/gas/testsuite/gas/i386/dwarf2-line-4.d
index c0c85f4639f..a01fd0540f3 100644
--- a/gas/testsuite/gas/i386/dwarf2-line-4.d
+++ b/gas/testsuite/gas/i386/dwarf2-line-4.d
@@ -33,11 +33,14 @@ Raw dump of debug contents of section \.z?debug_line:
The File Name Table \(offset 0x.*\):
Entry Dir Time Size Name
- 1 1 0 0 dwarf2-line-4.s
+ 1 0 0 0 dwarf2-test.c
+ 2 1 0 0 dwarf2-line-4.s
Line Number Statements:
+ \[0x.*\] Set File Name to entry 2 in the File Name Table
\[0x.*\] Extended opcode 2: set Address to 0x0
\[0x.*\] Special opcode 13: advance Address by 0 to 0x0 and Line by 8 to 9
+ \[0x.*\] Set File Name to entry 1 in the File Name Table
\[0x.*\] Advance Line by -8 to 1
\[0x.*\] Special opcode 19: advance Address by 1 to 0x1 and Line by 0 to 1
\[0x.*\] Advance PC by 1 to 0x2
diff --git a/gas/testsuite/gas/i386/dwarf2-line-4.s b/gas/testsuite/gas/i386/dwarf2-line-4.s
index 89bb62d9db7..7348f4be62c 100644
--- a/gas/testsuite/gas/i386/dwarf2-line-4.s
+++ b/gas/testsuite/gas/i386/dwarf2-line-4.s
@@ -7,6 +7,7 @@
main:
.cfi_startproc
nop
+ .file 1 "dwarf2-test.c"
.loc 1 1
ret
.cfi_endproc
diff --git a/gas/testsuite/gas/i386/dwarf4-line-1.d b/gas/testsuite/gas/i386/dwarf4-line-1.d
index 4f8321e9bfd..8199efbb0c2 100644
--- a/gas/testsuite/gas/i386/dwarf4-line-1.d
+++ b/gas/testsuite/gas/i386/dwarf4-line-1.d
@@ -36,12 +36,14 @@ Raw dump of debug contents of section \.z?debug_line:
Entry Dir Time Size Name
1 0 0 0 foo.c
2 0 0 0 foo.h
+ 3 1 0 0 dwarf4-line-1.s
Line Number Statements:
+ \[0x.*\] Set File Name to entry 2 in the File Name Table
\[0x.*\] Extended opcode 2: set Address to 0x0
\[0x.*\] Advance Line by 81 to 82
\[0x.*\] Copy
- \[0x.*\] Set File Name to entry 2 in the File Name Table
+ \[0x.*\] Set File Name to entry 3 in the File Name Table
\[0x.*\] Advance Line by -73 to 9
\[0x.*\] Special opcode 19: advance Address by 1 to 0x1 and Line by 0 to 9
\[0x.*\] Advance PC by 3 to 0x4
diff --git a/gas/testsuite/gas/i386/dwarf5-line-1.d b/gas/testsuite/gas/i386/dwarf5-line-1.d
index f57fc47d269..2c2cf5696c4 100644
--- a/gas/testsuite/gas/i386/dwarf5-line-1.d
+++ b/gas/testsuite/gas/i386/dwarf5-line-1.d
@@ -36,12 +36,14 @@ Raw dump of debug contents of section \.z?debug_line:
0 \(indirect line string, offset: 0x.*\): .*/gas/testsuite
1 \(indirect line string, offset: 0x.*\): .*/gas/testsuite/gas/i386
- The File Name Table \(offset 0x.*, lines 2, columns 3\):
+ The File Name Table \(offset 0x.*, lines 3, columns 3\):
Entry Dir MD5 Name
0 0 0xbbd69fc03ce253b2dbaab2522dd519ae \(indirect line string, offset: 0x.*\): core.c
1 0 0x0 \(indirect line string, offset: 0x.*\): types.h
+ 2 1 0x0 \(indirect line string, offset: 0x.*\): dwarf5-line-1.s
Line Number Statements:
+ \[0x.*\] Set File Name to entry 2 in the File Name Table
\[0x.*\] Extended opcode 2: set Address to 0x0
\[0x.*\] Special opcode 8: advance Address by 0 to 0x0 and Line by 3 to 4
\[0x.*\] Advance PC by 1 to 0x1
diff --git a/gas/testsuite/gas/i386/dwarf5-line-2.d b/gas/testsuite/gas/i386/dwarf5-line-2.d
index 2f96df510d0..85f98c8ab9c 100644
--- a/gas/testsuite/gas/i386/dwarf5-line-2.d
+++ b/gas/testsuite/gas/i386/dwarf5-line-2.d
@@ -36,9 +36,10 @@ Raw dump of debug contents of section \.z?debug_line:
0 \(indirect line string, offset: 0x.*\): .*/gas/testsuite
1 \(indirect line string, offset: 0x.*\): .*/gas/testsuite/gas/i386
- The File Name Table \(offset 0x.*, lines 1, columns 3\):
+ The File Name Table \(offset 0x.*, lines 2, columns 3\):
Entry Dir MD5 Name
0 0 0xbbd69fc03ce253b2dbaab2522dd519ae \(indirect line string, offset: 0x.*\): core.c
+ 1 1 0x0 \(indirect line string, offset: .*\): dwarf5-line-2.s
Line Number Statements:
\[0x.*\] Extended opcode 2: set Address to 0x0
</cut>
Progress (short week, 3 days)
* UM-2 [QEMU upstream maintainership]
+ more code review, notably the Apple Silicon hvf support, which is
nearly ready to go in
* QEMU-406 [QEMU support for MVE (M-profile Vector Extension; Helium)]
+ Sent out v2 of the "optimized code gen for MVE" patchset;
this now covers all the insns that have an easy optimized version.
+ Fixed a bug where we weren't correctly setting up FPSCR.LTPSIZE
when using QEMU's user-mode-only emulator
+ Wrote some code to add support for the (not yet finalized) gdbstub
XML that tells GDB that the guest CPU has MVE. This causes a GDB
with the MVE handling to crash, so one or the other of us has
got something wrong :-)
KVM Forum was this week, as a 2-day virtual conference. I felt the
programme was comparatively a bit small this year, but there were some
interesting talks. Also a BoF session on whether/how we should
consider adding Rust code to QEMU: I am pushing for (a) a clearer
medium-to-long-term vision of where we would be going and why we'd be
doing this and (b) more design-sketch type work of "what would XYZ in
rust look like", which would hopefully both (a) make the benefit/lack
thereof a bit more clear and (b) demonstrate that there are enough
people enthusiastic enough about the prospect to make it a success...
-- PMM
After llvm commit 1c3fcc8ae92ebfe9a9d1a21a288ad71ef7f98091
Author: Amy Kwan <amy.kwan1(a)ibm.com>
[libc++][NFC] Mark values in gdb pretty print comparison functions as live to prevent values being optimized out.
the following hot functions grew in size by more than 10% (but their benchmarks grew in size by less than 1%):
- 447.dealII,[.] contract<3> grew in size by 164%
Benchmark:
Toolchain: Clang + Glibc + LLVM Linker
Version: all components were built from their latest release branch
Target: aarch64-linux-gnu
Compiler flags: -Oz
Hardware: APM Mustang 8x X-Gene1
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_apm/llvm-release-aarch64-spec2k6-Oz
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release…
Reproduce builds:
<cut>
mkdir investigate-llvm-1c3fcc8ae92ebfe9a9d1a21a288ad71ef7f98091
cd investigate-llvm-1c3fcc8ae92ebfe9a9d1a21a288ad71ef7f98091
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-release… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach 1c3fcc8ae92ebfe9a9d1a21a288ad71ef7f98091
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach c8905f1bb304f1cfe297312ae0dda9946cb27594
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 1c3fcc8ae92ebfe9a9d1a21a288ad71ef7f98091
Author: Amy Kwan <amy.kwan1(a)ibm.com>
Date: Fri Sep 3 14:53:57 2021 -0400
[libc++][NFC] Mark values in gdb pretty print comparison functions as live to prevent values being optimized out.
It appears when testing LLVM 13 on Power, we run into failures with the
`libcxx/test/libcxx/gdb/gdb_pretty_printer_test.sh.cpp` test case optimizing
values out.
Despite some the functions in the test already being marked with optnone,
adding the `MarkAsLive()` calls inside of the pretty printer comparison functions
resolves the issues of the values being optimized out.
This patch aims to address https://llvm.org/PR51675.
Differential Revision: https://reviews.llvm.org/D109204
(cherry picked from commit 217c6d643124be312f4a99b203118744edb9d54c)
---
libcxx/test/libcxx/gdb/gdb_pretty_printer_test.sh.cpp | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/libcxx/test/libcxx/gdb/gdb_pretty_printer_test.sh.cpp b/libcxx/test/libcxx/gdb/gdb_pretty_printer_test.sh.cpp
index 2d8e9620089a..7c8d307d19fb 100644
--- a/libcxx/test/libcxx/gdb/gdb_pretty_printer_test.sh.cpp
+++ b/libcxx/test/libcxx/gdb/gdb_pretty_printer_test.sh.cpp
@@ -92,24 +92,28 @@ void MarkAsLive(Type &&) {}
template <typename TypeToPrint> void ComparePrettyPrintToChars(
TypeToPrint value,
const char *expectation) {
+ MarkAsLive(value);
StopForDebugger(&value, &expectation);
}
template <typename TypeToPrint> void ComparePrettyPrintToRegex(
TypeToPrint value,
const char *expectation) {
+ MarkAsLive(value);
StopForDebugger(&value, &expectation);
}
void CompareExpressionPrettyPrintToChars(
std::string value,
const char *expectation) {
+ MarkAsLive(value);
StopForDebugger(&value, &expectation);
}
void CompareExpressionPrettyPrintToRegex(
std::string value,
const char *expectation) {
+ MarkAsLive(value);
StopForDebugger(&value, &expectation);
}
</cut>
After gcc commit c416c52bcdb120db5e8c53a51bd78c4360daf79b
Author: Nathan Sidwell <nathan(a)acm.org>
c++ ICE with nested requirement as default tpl parm[PR94827]
the following benchmarks slowed down by more than 2%:
- 456.hmmer slowed down by 4%
Benchmark:
Toolchain: GCC + Glibc + GNU Linker
Version: all components were built from their latest release branch
Target: aarch64-linux-gnu
Compiler flags: -O3 -flto
Hardware: NVidia TX1 4x Cortex-A57
This commit has regressed these CI configurations:
- tcwg_bmk_gnu_tx1/gnu-release-aarch64-spec2k6-O3_LTO
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-release-a…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-release-a…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-release-a…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-release-a…
Reproduce builds:
<cut>
mkdir investigate-gcc-c416c52bcdb120db5e8c53a51bd78c4360daf79b
cd investigate-gcc-c416c52bcdb120db5e8c53a51bd78c4360daf79b
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-release-a… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-release-a… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tx1-gnu-release-a… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach c416c52bcdb120db5e8c53a51bd78c4360daf79b
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach b1983f4582bbe060b7da83578acb9ed653681fc8
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit c416c52bcdb120db5e8c53a51bd78c4360daf79b
Author: Nathan Sidwell <nathan(a)acm.org>
Date: Thu Apr 30 08:23:16 2020 -0700
c++ ICE with nested requirement as default tpl parm[PR94827]
Template headers are not incrementally updated as we parse its parameters.
We maintain a dummy level until the closing > when we replace the dummy with
a real parameter set. requires processing was expecting a properly populated
arg_vec in current_template_parms, and then creates a self-mapping of parameters
from that. But we don't need to do that, just teach map_arguments to look at
TREE_VALUE when args is NULL.
* constraint.cc (map_arguments): If ARGS is null, it's a
self-mapping of parms.
(finish_nested_requirement): Do not pass argified
current_template_parms to normalization.
(tsubst_nested_requirement): Don't assert no template parms.
---
gcc/cp/ChangeLog | 10 ++++++++++
gcc/cp/constraint.cc | 27 ++++++++++++++++-----------
gcc/testsuite/g++.dg/concepts/pr94827.C | 15 +++++++++++++++
3 files changed, 41 insertions(+), 11 deletions(-)
diff --git a/gcc/cp/ChangeLog b/gcc/cp/ChangeLog
index 1fa0e123cb1..3c57945cecf 100644
--- a/gcc/cp/ChangeLog
+++ b/gcc/cp/ChangeLog
@@ -1,3 +1,13 @@
+2020-04-30 Jason Merrill <jason(a)redhat.com>
+ Nathan Sidwell <nathan(a)acm.org>
+
+ PR c++/94827
+ * constraint.cc (map_arguments): If ARGS is null, it's a
+ self-mapping of parms.
+ (finish_nested_requirement): Do not pass argified
+ current_template_parms to normalization.
+ (tsubst_nested_requirement): Don't assert no template parms.
+
2020-04-30 Iain Sandoe <iain(a)sandoe.co.uk>
PR c++/94886
diff --git a/gcc/cp/constraint.cc b/gcc/cp/constraint.cc
index 866b0f51b05..85513fecf43 100644
--- a/gcc/cp/constraint.cc
+++ b/gcc/cp/constraint.cc
@@ -546,12 +546,16 @@ static tree
map_arguments (tree parms, tree args)
{
for (tree p = parms; p; p = TREE_CHAIN (p))
- {
- int level;
- int index;
- template_parm_level_and_index (TREE_VALUE (p), &level, &index);
- TREE_PURPOSE (p) = TMPL_ARG (args, level, index);
- }
+ if (args)
+ {
+ int level;
+ int index;
+ template_parm_level_and_index (TREE_VALUE (p), &level, &index);
+ TREE_PURPOSE (p) = TMPL_ARG (args, level, index);
+ }
+ else
+ TREE_PURPOSE (p) = TREE_VALUE (p);
+
return parms;
}
@@ -2005,8 +2009,6 @@ tsubst_compound_requirement (tree t, tree args, subst_info info)
static tree
tsubst_nested_requirement (tree t, tree args, subst_info info)
{
- gcc_assert (!uses_template_parms (args));
-
/* Ensure that we're in an evaluation context prior to satisfaction. */
tree norm = TREE_VALUE (TREE_TYPE (t));
tree result = satisfy_constraint (norm, args, info);
@@ -2953,12 +2955,15 @@ finish_compound_requirement (location_t loc, tree expr, tree type, bool noexcept
tree
finish_nested_requirement (location_t loc, tree expr)
{
+ /* Currently open template headers have dummy arg vectors, so don't
+ pass into normalization. */
+ tree norm = normalize_constraint_expression (expr, NULL_TREE, false);
+ tree args = current_template_parms
+ ? template_parms_to_args (current_template_parms) : NULL_TREE;
+
/* Save the normalized constraint and complete set of normalization
arguments with the requirement. We keep the complete set of arguments
around for re-normalization during diagnostics. */
- tree args = current_template_parms
- ? template_parms_to_args (current_template_parms) : NULL_TREE;
- tree norm = normalize_constraint_expression (expr, args, false);
tree info = build_tree_list (args, norm);
/* Build the constraint, saving its normalization as its type. */
diff --git a/gcc/testsuite/g++.dg/concepts/pr94827.C b/gcc/testsuite/g++.dg/concepts/pr94827.C
new file mode 100644
index 00000000000..f14ec2551a1
--- /dev/null
+++ b/gcc/testsuite/g++.dg/concepts/pr94827.C
@@ -0,0 +1,15 @@
+// PR 94287 ICE looking inside open template-parm level
+// { dg-do run { target c++17 } }
+// { dg-options -fconcepts }
+
+template <typename T,
+ bool X = requires { requires (sizeof(T)==1); } >
+ int foo(T) { return X; }
+
+int main() {
+ if (!foo('4'))
+ return 1;
+ if (foo (4))
+ return 2;
+ return 0;
+}
</cut>
After llvm commit f17d60d620283b5d53286056ceeaeb8c27b6530a
Author: Bjorn Pettersson <bjorn.a.pettersson(a)ericsson.com>
Inform pass manager when child loops are deleted
Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
This commit has regressed these CI configurations:
- tcwg_bmk_llvm_tx1/llvm-release-aarch64-spec2k6-O2
First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release…
Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release…
Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release…
Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release…
Reproduce builds:
<cut>
mkdir investigate-llvm-f17d60d620283b5d53286056ceeaeb8c27b6530a
cd investigate-llvm-f17d60d620283b5d53286056ceeaeb8c27b6530a
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-release… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/
cd llvm
# Reproduce first_bad build
git checkout --detach f17d60d620283b5d53286056ceeaeb8c27b6530a
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach f56129fe78d5c849971017976c71333b6b1a27c6
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit f17d60d620283b5d53286056ceeaeb8c27b6530a
Author: Bjorn Pettersson <bjorn.a.pettersson(a)ericsson.com>
Date: Fri Sep 3 20:50:33 2021 +0200
Inform pass manager when child loops are deleted
As part of the nontrivial unswitching we could end up removing child
loops. This patch add a notification to the pass manager when
that happens (using the markLoopAsDeleted callback).
Without this there could be stale LoopAccessAnalysis results cached
in the analysis manager. Those analysis results are cached based on
a Loop* as key. Since the BumpPtrAllocator used to allocate
Loop objects could be resetted between different runs of for
example the loop-distribute pass (running on different functions),
a new Loop object could be created using the same Loop pointer.
And then when requiring the LoopAccessAnalysis for the loop we
got the stale (corrupt) result from the destroyed loop.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D109257
(fixes PR51754)
(cherry-picked from commit 0f0344dd1e3b53387bb396070916e67f4c426da6)
---
llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp | 43 +++++++++----
.../nontrivial-unswitch-markloopasdeleted.ll | 71 ++++++++++++++++++++++
2 files changed, 102 insertions(+), 12 deletions(-)
diff --git a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
index b9cccc2af309..b1c105258027 100644
--- a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
+++ b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
@@ -1587,10 +1587,12 @@ deleteDeadClonedBlocks(Loop &L, ArrayRef<BasicBlock *> ExitBlocks,
BB->eraseFromParent();
}
-static void deleteDeadBlocksFromLoop(Loop &L,
- SmallVectorImpl<BasicBlock *> &ExitBlocks,
- DominatorTree &DT, LoopInfo &LI,
- MemorySSAUpdater *MSSAU) {
+static void
+deleteDeadBlocksFromLoop(Loop &L,
+ SmallVectorImpl<BasicBlock *> &ExitBlocks,
+ DominatorTree &DT, LoopInfo &LI,
+ MemorySSAUpdater *MSSAU,
+ function_ref<void(Loop &, StringRef)> DestroyLoopCB) {
// Find all the dead blocks tied to this loop, and remove them from their
// successors.
SmallSetVector<BasicBlock *, 8> DeadBlockSet;
@@ -1640,6 +1642,7 @@ static void deleteDeadBlocksFromLoop(Loop &L,
}) &&
"If the child loop header is dead all blocks in the child loop must "
"be dead as well!");
+ DestroyLoopCB(*ChildL, ChildL->getName());
LI.destroy(ChildL);
return true;
});
@@ -1980,6 +1983,8 @@ static bool rebuildLoopAfterUnswitch(Loop &L, ArrayRef<BasicBlock *> ExitBlocks,
ParentL->removeChildLoop(llvm::find(*ParentL, &L));
else
LI.removeLoop(llvm::find(LI, &L));
+ // markLoopAsDeleted for L should be triggered by the caller (it is typically
+ // done by using the UnswitchCB callback).
LI.destroy(&L);
return false;
}
@@ -2019,7 +2024,8 @@ static void unswitchNontrivialInvariants(
SmallVectorImpl<BasicBlock *> &ExitBlocks, IVConditionInfo &PartialIVInfo,
DominatorTree &DT, LoopInfo &LI, AssumptionCache &AC,
function_ref<void(bool, bool, ArrayRef<Loop *>)> UnswitchCB,
- ScalarEvolution *SE, MemorySSAUpdater *MSSAU) {
+ ScalarEvolution *SE, MemorySSAUpdater *MSSAU,
+ function_ref<void(Loop &, StringRef)> DestroyLoopCB) {
auto *ParentBB = TI.getParent();
BranchInst *BI = dyn_cast<BranchInst>(&TI);
SwitchInst *SI = BI ? nullptr : cast<SwitchInst>(&TI);
@@ -2319,7 +2325,7 @@ static void unswitchNontrivialInvariants(
// Now that our cloned loops have been built, we can update the original loop.
// First we delete the dead blocks from it and then we rebuild the loop
// structure taking these deletions into account.
- deleteDeadBlocksFromLoop(L, ExitBlocks, DT, LI, MSSAU);
+ deleteDeadBlocksFromLoop(L, ExitBlocks, DT, LI, MSSAU, DestroyLoopCB);
if (MSSAU && VerifyMemorySSA)
MSSAU->getMemorySSA()->verifyMemorySSA();
@@ -2670,7 +2676,8 @@ static bool unswitchBestCondition(
Loop &L, DominatorTree &DT, LoopInfo &LI, AssumptionCache &AC,
AAResults &AA, TargetTransformInfo &TTI,
function_ref<void(bool, bool, ArrayRef<Loop *>)> UnswitchCB,
- ScalarEvolution *SE, MemorySSAUpdater *MSSAU) {
+ ScalarEvolution *SE, MemorySSAUpdater *MSSAU,
+ function_ref<void(Loop &, StringRef)> DestroyLoopCB) {
// Collect all invariant conditions within this loop (as opposed to an inner
// loop which would be handled when visiting that inner loop).
SmallVector<std::pair<Instruction *, TinyPtrVector<Value *>>, 4>
@@ -2958,7 +2965,7 @@ static bool unswitchBestCondition(
<< "\n");
unswitchNontrivialInvariants(L, *BestUnswitchTI, BestUnswitchInvariants,
ExitBlocks, PartialIVInfo, DT, LI, AC,
- UnswitchCB, SE, MSSAU);
+ UnswitchCB, SE, MSSAU, DestroyLoopCB);
return true;
}
@@ -2988,7 +2995,8 @@ unswitchLoop(Loop &L, DominatorTree &DT, LoopInfo &LI, AssumptionCache &AC,
AAResults &AA, TargetTransformInfo &TTI, bool Trivial,
bool NonTrivial,
function_ref<void(bool, bool, ArrayRef<Loop *>)> UnswitchCB,
- ScalarEvolution *SE, MemorySSAUpdater *MSSAU) {
+ ScalarEvolution *SE, MemorySSAUpdater *MSSAU,
+ function_ref<void(Loop &, StringRef)> DestroyLoopCB) {
assert(L.isRecursivelyLCSSAForm(DT, LI) &&
"Loops must be in LCSSA form before unswitching.");
@@ -3036,7 +3044,8 @@ unswitchLoop(Loop &L, DominatorTree &DT, LoopInfo &LI, AssumptionCache &AC,
// Try to unswitch the best invariant condition. We prefer this full unswitch to
// a partial unswitch when possible below the threshold.
- if (unswitchBestCondition(L, DT, LI, AC, AA, TTI, UnswitchCB, SE, MSSAU))
+ if (unswitchBestCondition(L, DT, LI, AC, AA, TTI, UnswitchCB, SE, MSSAU,
+ DestroyLoopCB))
return true;
// No other opportunities to unswitch.
@@ -3083,6 +3092,10 @@ PreservedAnalyses SimpleLoopUnswitchPass::run(Loop &L, LoopAnalysisManager &AM,
U.markLoopAsDeleted(L, LoopName);
};
+ auto DestroyLoopCB = [&U](Loop &L, StringRef Name) {
+ U.markLoopAsDeleted(L, Name);
+ };
+
Optional<MemorySSAUpdater> MSSAU;
if (AR.MSSA) {
MSSAU = MemorySSAUpdater(AR.MSSA);
@@ -3091,7 +3104,8 @@ PreservedAnalyses SimpleLoopUnswitchPass::run(Loop &L, LoopAnalysisManager &AM,
}
if (!unswitchLoop(L, AR.DT, AR.LI, AR.AC, AR.AA, AR.TTI, Trivial, NonTrivial,
UnswitchCB, &AR.SE,
- MSSAU.hasValue() ? MSSAU.getPointer() : nullptr))
+ MSSAU.hasValue() ? MSSAU.getPointer() : nullptr,
+ DestroyLoopCB))
return PreservedAnalyses::all();
if (AR.MSSA && VerifyMemorySSA)
@@ -3179,12 +3193,17 @@ bool SimpleLoopUnswitchLegacyPass::runOnLoop(Loop *L, LPPassManager &LPM) {
LPM.markLoopAsDeleted(*L);
};
+ auto DestroyLoopCB = [&LPM](Loop &L, StringRef /* Name */) {
+ LPM.markLoopAsDeleted(L);
+ };
+
if (MSSA && VerifyMemorySSA)
MSSA->verifyMemorySSA();
bool Changed =
unswitchLoop(*L, DT, LI, AC, AA, TTI, true, NonTrivial, UnswitchCB, SE,
- MSSAU.hasValue() ? MSSAU.getPointer() : nullptr);
+ MSSAU.hasValue() ? MSSAU.getPointer() : nullptr,
+ DestroyLoopCB);
if (MSSA && VerifyMemorySSA)
MSSA->verifyMemorySSA();
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch-markloopasdeleted.ll b/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch-markloopasdeleted.ll
new file mode 100644
index 000000000000..455a38535576
--- /dev/null
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch-markloopasdeleted.ll
@@ -0,0 +1,71 @@
+; RUN: opt < %s -enable-loop-distribute -passes='loop-distribute,loop-mssa(simple-loop-unswitch<nontrivial>),loop-distribute' -o /dev/null -S -debug-pass-manager=verbose 2>&1 | FileCheck %s
+
+
+; Running loop-distribute will result in LoopAccessAnalysis being required and
+; cached in the LoopAnalysisManagerFunctionProxy.
+;
+; CHECK: Running analysis: LoopAccessAnalysis on Loop at depth 2 containing: %loop_a_inner<header><latch><exiting>
+
+
+; Then simple-loop-unswitch is removing/replacing some loops (resulting in
+; Loop objects used as key in the analyses cache is destroyed). So here we
+; want to see that any analysis results cached on the destroyed loop is
+; cleared. A special case here is that loop_a_inner is destroyed when
+; unswitching the parent loop.
+;
+; The bug solved and verified by this test case was related to the
+; SimpleLoopUnswitch not marking the Loop as removed, so we missed clearing
+; the analysis caches.
+;
+; CHECK: Running pass: SimpleLoopUnswitchPass on Loop at depth 1 containing: %loop_begin<header>,%loop_b,%loop_b_inner,%loop_b_inner_exit,%loop_a,%loop_a_inner,%loop_a_inner_exit,%latch<latch><exiting>
+; CHECK-NEXT: Clearing all analysis results for: loop_a_inner
+
+
+; When running loop-distribute the second time we can see that loop_a_inner
+; isn't analysed because the loop no longer exists (instead we find a new loop,
+; loop_a_inner.us). This kind of verifies that it was correct to remove the
+; loop_a_inner related analysis above.
+;
+; CHECK: Running analysis: LoopAccessAnalysis on Loop at depth 2 containing: %loop_a_inner.us<header><latch><exiting>
+
+
+define i32 @test6(i1* %ptr, i1 %cond1, i32* %a.ptr, i32* %b.ptr) {
+entry:
+ br label %loop_begin
+
+loop_begin:
+ %v = load i1, i1* %ptr
+ br i1 %cond1, label %loop_a, label %loop_b
+
+loop_a:
+ br label %loop_a_inner
+
+loop_a_inner:
+ %va = load i1, i1* %ptr
+ %a = load i32, i32* %a.ptr
+ br i1 %va, label %loop_a_inner, label %loop_a_inner_exit
+
+loop_a_inner_exit:
+ %a.lcssa = phi i32 [ %a, %loop_a_inner ]
+ br label %latch
+
+loop_b:
+ br label %loop_b_inner
+
+loop_b_inner:
+ %vb = load i1, i1* %ptr
+ %b = load i32, i32* %b.ptr
+ br i1 %vb, label %loop_b_inner, label %loop_b_inner_exit
+
+loop_b_inner_exit:
+ %b.lcssa = phi i32 [ %b, %loop_b_inner ]
+ br label %latch
+
+latch:
+ %ab.phi = phi i32 [ %a.lcssa, %loop_a_inner_exit ], [ %b.lcssa, %loop_b_inner_exit ]
+ br i1 %v, label %loop_begin, label %loop_exit
+
+loop_exit:
+ %ab.lcssa = phi i32 [ %ab.phi, %latch ]
+ ret i32 %ab.lcssa
+}
</cut>
Identified regression caused by *gcc:76b75018b3d053a890ebe155e47814de14b3c9fb*:
commit 76b75018b3d053a890ebe155e47814de14b3c9fb
Author: Jason Merrill <jason(a)redhat.com>
c++: implement C++17 hardware interference size
Results regressed to (for first_bad == 76b75018b3d053a890ebe155e47814de14b3c9fb)
# reset_artifacts:
-10
# true:
0
# build_abe binutils:
1
# build_abe stage1:
2
# build_abe linux:
3
# build_abe glibc:
4
# First few build errors in logs:
from (for last_good == 8ea292591e42aa4d52b4b7a00b86335bfd2e2e85)
# reset_artifacts:
-10
# true:
0
# build_abe binutils:
1
# build_abe stage1:
2
# build_abe linux:
3
# build_abe glibc:
4
# build_abe stage2:
5
# build_abe gdb:
6
# build_abe qemu:
7
This commit has regressed these CI configurations:
- tcwg_gnu_cross_build/master-aarch64
Artifacts of last_good build: https://ci.linaro.org/job/tcwg_gnu_cross_build-bisect-master-aarch64/2/arti…
Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_gnu_cross_build-bisect-master-aarch64/2/arti…
Even more details: https://ci.linaro.org/job/tcwg_gnu_cross_build-bisect-master-aarch64/2/arti…
Reproduce builds:
<cut>
mkdir investigate-gcc-76b75018b3d053a890ebe155e47814de14b3c9fb
cd investigate-gcc-76b75018b3d053a890ebe155e47814de14b3c9fb
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_gnu_cross_build-bisect-master-aarch64/2/arti… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_gnu_cross_build-bisect-master-aarch64/2/arti… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_gnu_cross_build-bisect-master-aarch64/2/arti… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_gnu-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach 76b75018b3d053a890ebe155e47814de14b3c9fb
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 8ea292591e42aa4d52b4b7a00b86335bfd2e2e85
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 76b75018b3d053a890ebe155e47814de14b3c9fb
Author: Jason Merrill <jason(a)redhat.com>
Date: Thu Jul 15 15:30:17 2021 -0400
c++: implement C++17 hardware interference size
The last missing piece of the C++17 standard library is the hardware
intereference size constants. Much of the delay in implementing these has
been due to uncertainty about what the right values are, and even whether
there is a single constant value that is suitable; the destructive
interference size is intended to be used in structure layout, so program
ABIs will depend on it.
In principle, both of these values should be the same as the target's L1
cache line size. When compiling for a generic target that is intended to
support a range of target CPUs with different cache line sizes, the
constructive size should probably be the minimum size, and the destructive
size the maximum, unless you are constrained by ABI compatibility with
previous code.
From discussion on gcc-patches, I've come to the conclusion that the
solution to the difficulty of choosing stable values is to give up on it,
and instead encourage only uses where ABI stability is unimportant: in
particular, uses where the ABI is shared at most between translation units
built at the same time with the same flags.
To that end, I've added a warning for any use of the constant value of
std::hardware_destructive_interference_size in a header or module export.
Appropriate uses within a project can disable the warning.
A previous iteration of this patch included an -finterference-tune flag to
make the value vary with -mtune; this iteration makes that the default
behavior, which should be appropriate for all reasonable uses of the
variable. The previous default of "stable-ish" seems to me likely to have
been more of an attractive nuisance; since we can't promise actual
stability, we should instead make proper uses more convenient.
JF Bastien's implementation proposal is summarized at
https://github.com/itanium-cxx-abi/cxx-abi/issues/74
I implement this by adding new --params for the two sizes. Targets can
override these values in targetm.target_option.override() to support a range
of values for the generic target; otherwise, both will default to the L1
cache line size.
64 bytes still seems correct for all x86.
I'm not sure why he proposed 64/64 for generic 32-bit ARM, since the Cortex
A9 has a 32-byte cache line, so I'd think 32/64 would make more sense.
He proposed 64/128 for generic AArch64, but since the A64FX now has a 256B
cache line, I've changed that to 64/256.
Other arch maintainers are invited to set ranges for their generic targets
if that seems better than using the default cache line size for both values.
With the above choice to reject stability as a goal, getting these values
"right" is now just a matter of what we want the default optimization to be,
and we can feel free to adjust them as CPUs with different cache lines
become more and less common.
gcc/ChangeLog:
* params.opt: Add destructive-interference-size and
constructive-interference-size.
* doc/invoke.texi: Document them.
* config/aarch64/aarch64.c (aarch64_override_options_internal):
Set them.
* config/arm/arm.c (arm_option_override): Set them.
* config/i386/i386-options.c (ix86_option_override_internal):
Set them.
gcc/c-family/ChangeLog:
* c.opt: Add -Winterference-size.
* c-cppbuiltin.c (cpp_atomic_builtins): Add __GCC_DESTRUCTIVE_SIZE
and __GCC_CONSTRUCTIVE_SIZE.
gcc/cp/ChangeLog:
* constexpr.c (maybe_warn_about_constant_value):
Complain about std::hardware_destructive_interference_size.
(cxx_eval_constant_expression): Call it.
* decl.c (cxx_init_decl_processing): Check
--param *-interference-size values.
libstdc++-v3/ChangeLog:
* include/std/version: Define __cpp_lib_hardware_interference_size.
* libsupc++/new: Define hardware interference size variables.
gcc/testsuite/ChangeLog:
* g++.dg/warn/Winterference.H: New file.
* g++.dg/warn/Winterference.C: New test.
* g++.target/aarch64/interference.C: New test.
* g++.target/arm/interference.C: New test.
* g++.target/i386/interference.C: New test.
---
gcc/c-family/c-cppbuiltin.c | 14 ++++++
gcc/c-family/c.opt | 5 ++
gcc/config/aarch64/aarch64.c | 22 +++++++++
gcc/config/arm/arm.c | 22 +++++++++
gcc/config/i386/i386-options.c | 6 +++
gcc/cp/constexpr.c | 33 +++++++++++++
gcc/cp/decl.c | 32 ++++++++++++
gcc/doc/invoke.texi | 65 +++++++++++++++++++++++++
gcc/params.opt | 16 ++++++
gcc/testsuite/g++.dg/warn/Winterference-2.C | 14 ++++++
gcc/testsuite/g++.dg/warn/Winterference.C | 6 +++
gcc/testsuite/g++.dg/warn/Winterference.H | 7 +++
gcc/testsuite/g++.target/aarch64/interference.C | 9 ++++
gcc/testsuite/g++.target/arm/interference.C | 9 ++++
gcc/testsuite/g++.target/i386/interference.C | 8 +++
libstdc++-v3/include/std/version | 3 ++
libstdc++-v3/libsupc++/new | 10 +++-
17 files changed, 279 insertions(+), 2 deletions(-)
diff --git a/gcc/c-family/c-cppbuiltin.c b/gcc/c-family/c-cppbuiltin.c
index 48cbefd8bf8..ce88e707127 100644
--- a/gcc/c-family/c-cppbuiltin.c
+++ b/gcc/c-family/c-cppbuiltin.c
@@ -741,6 +741,20 @@ cpp_atomic_builtins (cpp_reader *pfile)
builtin_define_with_int_value ("__GCC_ATOMIC_TEST_AND_SET_TRUEVAL",
targetm.atomic_test_and_set_trueval);
+ /* Macros for C++17 hardware interference size constants. Either both or
+ neither should be set. */
+ gcc_assert (!param_destruct_interfere_size
+ == !param_construct_interfere_size);
+ if (param_destruct_interfere_size)
+ {
+ /* FIXME The way of communicating these values to the library should be
+ part of the C++ ABI, whether macro or builtin. */
+ builtin_define_with_int_value ("__GCC_DESTRUCTIVE_SIZE",
+ param_destruct_interfere_size);
+ builtin_define_with_int_value ("__GCC_CONSTRUCTIVE_SIZE",
+ param_construct_interfere_size);
+ }
+
/* ptr_type_node can't be used here since ptr_mode is only set when
toplev calls backend_init which is not done with -E or pch. */
psize = POINTER_SIZE_UNITS;
diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt
index c5fe90003f2..9c151d19870 100644
--- a/gcc/c-family/c.opt
+++ b/gcc/c-family/c.opt
@@ -722,6 +722,11 @@ Winit-list-lifetime
C++ ObjC++ Var(warn_init_list) Warning Init(1)
Warn about uses of std::initializer_list that can result in dangling pointers.
+Winterference-size
+C++ ObjC++ Var(warn_interference_size) Warning Init(1)
+Warn about nonsensical values of --param destructive-interference-size or
+constructive-interference-size.
+
Wimplicit
C ObjC Var(warn_implicit) Warning LangEnabledBy(C ObjC,Wall)
Warn about implicit declarations.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 30d9a0b7a3d..36519ccc5a5 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -16540,6 +16540,28 @@ aarch64_override_options_internal (struct gcc_options *opts)
SET_OPTION_IF_UNSET (opts, &global_options_set,
param_l1_cache_line_size,
aarch64_tune_params.prefetch->l1_cache_line_size);
+
+ if (aarch64_tune_params.prefetch->l1_cache_line_size >= 0)
+ {
+ SET_OPTION_IF_UNSET (opts, &global_options_set,
+ param_destruct_interfere_size,
+ aarch64_tune_params.prefetch->l1_cache_line_size);
+ SET_OPTION_IF_UNSET (opts, &global_options_set,
+ param_construct_interfere_size,
+ aarch64_tune_params.prefetch->l1_cache_line_size);
+ }
+ else
+ {
+ /* For a generic AArch64 target, cover the current range of cache line
+ sizes. */
+ SET_OPTION_IF_UNSET (opts, &global_options_set,
+ param_destruct_interfere_size,
+ 256);
+ SET_OPTION_IF_UNSET (opts, &global_options_set,
+ param_construct_interfere_size,
+ 64);
+ }
+
if (aarch64_tune_params.prefetch->l2_cache_size >= 0)
SET_OPTION_IF_UNSET (opts, &global_options_set,
param_l2_cache_size,
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index f1e628253d0..6c6e77fab66 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -3669,6 +3669,28 @@ arm_option_override (void)
SET_OPTION_IF_UNSET (&global_options, &global_options_set,
param_l1_cache_line_size,
current_tune->prefetch.l1_cache_line_size);
+ if (current_tune->prefetch.l1_cache_line_size >= 0)
+ {
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_destruct_interfere_size,
+ current_tune->prefetch.l1_cache_line_size);
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_construct_interfere_size,
+ current_tune->prefetch.l1_cache_line_size);
+ }
+ else
+ {
+ /* For a generic ARM target, JF Bastien proposed using 64 for both. */
+ /* ??? Cortex A9 has a 32-byte cache line, so why not 32 for
+ constructive? */
+ /* More recent Cortex chips have a 64-byte cache line, but are marked
+ ARM_PREFETCH_NOT_BENEFICIAL, so they get these defaults. */
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_destruct_interfere_size, 64);
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_construct_interfere_size, 64);
+ }
+
if (current_tune->prefetch.l1_cache_size >= 0)
SET_OPTION_IF_UNSET (&global_options, &global_options_set,
param_l1_cache_size,
diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index 2cb87cedec0..c0006b3674b 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -2579,6 +2579,12 @@ ix86_option_override_internal (bool main_args_p,
SET_OPTION_IF_UNSET (opts, opts_set, param_l2_cache_size,
ix86_tune_cost->l2_cache_size);
+ /* 64B is the accepted value for these for all x86. */
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_destruct_interfere_size, 64);
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_construct_interfere_size, 64);
+
/* Enable sw prefetching at -O3 for CPUS that prefetching is helpful. */
if (opts->x_flag_prefetch_loop_arrays < 0
&& HAVE_prefetch
diff --git a/gcc/cp/constexpr.c b/gcc/cp/constexpr.c
index 7772fe62d95..0c2498aee22 100644
--- a/gcc/cp/constexpr.c
+++ b/gcc/cp/constexpr.c
@@ -6075,6 +6075,37 @@ inline_asm_in_constexpr_error (location_t loc)
"%<constexpr%> function in C++20");
}
+/* We're getting the constant value of DECL in a manifestly constant-evaluated
+ context; maybe complain about that. */
+
+static void
+maybe_warn_about_constant_value (location_t loc, tree decl)
+{
+ static bool explained = false;
+ if (cxx_dialect >= cxx17
+ && warn_interference_size
+ && !global_options_set.x_param_destruct_interfere_size
+ && DECL_CONTEXT (decl) == std_node
+ && id_equal (DECL_NAME (decl), "hardware_destructive_interference_size")
+ && (LOCATION_FILE (input_location) != main_input_filename
+ || module_exporting_p ())
+ && warning_at (loc, OPT_Winterference_size, "use of %qD", decl)
+ && !explained)
+ {
+ explained = true;
+ inform (loc, "its value can vary between compiler versions or "
+ "with different %<-mtune%> or %<-mcpu%> flags");
+ inform (loc, "if this use is part of a public ABI, change it to "
+ "instead use a constant variable you define");
+ inform (loc, "the default value for the current CPU tuning "
+ "is %d bytes", param_destruct_interfere_size);
+ inform (loc, "you can stabilize this value with %<--param "
+ "hardware_destructive_interference_size=%d%>, or disable "
+ "this warning with %<-Wno-interference-size%>",
+ param_destruct_interfere_size);
+ }
+}
+
/* Attempt to reduce the expression T to a constant value.
On failure, issue diagnostic and return error_mark_node. */
/* FIXME unify with c_fully_fold */
@@ -6219,6 +6250,8 @@ cxx_eval_constant_expression (const constexpr_ctx *ctx, tree t,
r = *p;
break;
}
+ if (ctx->manifestly_const_eval)
+ maybe_warn_about_constant_value (loc, t);
if (COMPLETE_TYPE_P (TREE_TYPE (t))
&& is_really_empty_class (TREE_TYPE (t), /*ignore_vptr*/false))
{
diff --git a/gcc/cp/decl.c b/gcc/cp/decl.c
index bce62ad202a..c2065027369 100644
--- a/gcc/cp/decl.c
+++ b/gcc/cp/decl.c
@@ -4752,6 +4752,38 @@ cxx_init_decl_processing (void)
/* Show we use EH for cleanups. */
if (flag_exceptions)
using_eh_for_cleanups ();
+
+ /* Check that the hardware interference sizes are at least
+ alignof(max_align_t), as required by the standard. */
+ const int max_align = max_align_t_align () / BITS_PER_UNIT;
+ if (param_destruct_interfere_size)
+ {
+ if (param_destruct_interfere_size < max_align)
+ error ("%<--param destructive-interference-size=%d%> is less than "
+ "%d", param_destruct_interfere_size, max_align);
+ else if (param_destruct_interfere_size < param_l1_cache_line_size)
+ warning (OPT_Winterference_size,
+ "%<--param destructive-interference-size=%d%> "
+ "is less than %<--param l1-cache-line-size=%d%>",
+ param_destruct_interfere_size, param_l1_cache_line_size);
+ }
+ else if (param_l1_cache_line_size >= max_align)
+ param_destruct_interfere_size = param_l1_cache_line_size;
+ /* else leave it unset. */
+
+ if (param_construct_interfere_size)
+ {
+ if (param_construct_interfere_size < max_align)
+ error ("%<--param constructive-interference-size=%d%> is less than "
+ "%d", param_construct_interfere_size, max_align);
+ else if (param_construct_interfere_size > param_l1_cache_line_size)
+ warning (OPT_Winterference_size,
+ "%<--param constructive-interference-size=%d%> "
+ "is greater than %<--param l1-cache-line-size=%d%>",
+ param_construct_interfere_size, param_l1_cache_line_size);
+ }
+ else if (param_l1_cache_line_size >= max_align)
+ param_construct_interfere_size = param_l1_cache_line_size;
}
/* Enter an abi node in global-module context. returns a cookie to
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 23cc68f92b5..78cfc100ac2 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -9018,6 +9018,43 @@ that has already been done in the current function. Therefore,
seemingly insignificant changes in the source program can cause the
warnings produced by @option{-Winline} to appear or disappear.
+@item -Winterference-size
+@opindex Winterference-size
+Warn about use of C++17 @code{std::hardware_destructive_interference_size}
+without specifying its value with @option{--param destructive-interference-size}.
+Also warn about questionable values for that option.
+
+This variable is intended to be used for controlling class layout, to
+avoid false sharing in concurrent code:
+
+@smallexample
+struct independent_fields @{
+ alignas(std::hardware_destructive_interference_size) std::atomic<int> one;
+ alignas(std::hardware_destructive_interference_size) std::atomic<int> two;
+@};
+@end smallexample
+
+Here @samp{one} and @samp{two} are intended to be far enough apart
+that stores to one won't require accesses to the other to reload the
+cache line.
+
+By default, @option{--param destructive-interference-size} and
+@option{--param constructive-interference-size} are set based on the
+current @option{-mtune} option, typically to the L1 cache line size
+for the particular target CPU, sometimes to a range if tuning for a
+generic target. So all translation units that depend on ABI
+compatibility for the use of these variables must be compiled with
+the same @option{-mtune} (or @option{-mcpu}).
+
+If ABI stability is important, such as if the use is in a header for a
+library, you should probably not use the hardware interference size
+variables at all. Alternatively, you can force a particular value
+with @option{--param}.
+
+If you are confident that your use of the variable does not affect ABI
+outside a single build of your project, you can turn off the warning
+with @option{-Wno-interference-size}.
+
@item -Wint-in-bool-context
@opindex Wint-in-bool-context
@opindex Wno-int-in-bool-context
@@ -13938,6 +13975,34 @@ prefetch hints can be issued for any constant stride.
This setting is only useful for strides that are known and constant.
+@item destructive-interference-size
+@item constructive-interference-size
+The values for the C++17 variables
+@code{std::hardware_destructive_interference_size} and
+@code{std::hardware_constructive_interference_size}. The destructive
+interference size is the minimum recommended offset between two
+independent concurrently-accessed objects; the constructive
+interference size is the maximum recommended size of contiguous memory
+accessed together. Typically both will be the size of an L1 cache
+line for the target, in bytes. For a generic target covering a range of L1
+cache line sizes, typically the constructive interference size will be
+the small end of the range and the destructive size will be the large
+end.
+
+The destructive interference size is intended to be used for layout,
+and thus has ABI impact. The default value is not expected to be
+stable, and on some targets varies with @option{-mtune}, so use of
+this variable in a context where ABI stability is important, such as
+the public interface of a library, is strongly discouraged; if it is
+used in that context, users can stabilize the value using this
+option.
+
+The constructive interference size is less sensitive, as it is
+typically only used in a @samp{static_assert} to make sure that a type
+fits within a cache line.
+
+See also @option{-Winterference-size}.
+
@item loop-interchange-max-num-stmts
The maximum number of stmts in a loop to be interchanged.
diff --git a/gcc/params.opt b/gcc/params.opt
index 3a701e22c46..658ca028851 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -361,6 +361,22 @@ The maximum code size growth ratio when expanding into a jump table (in percent)
Common Joined UInteger Var(param_l1_cache_line_size) Init(32) Param Optimization
The size of L1 cache line.
+-param=destructive-interference-size=
+Common Joined UInteger Var(param_destruct_interfere_size) Init(0) Param Optimization
+The minimum recommended offset between two concurrently-accessed objects to
+avoid additional performance degradation due to contention introduced by the
+implementation. Typically the L1 cache line size, but can be larger to
+accommodate a variety of target processors with different cache line sizes.
+C++17 code might use this value in structure layout, but is strongly
+discouraged from doing so in public ABIs.
+
+-param=constructive-interference-size=
+Common Joined UInteger Var(param_construct_interfere_size) Init(0) Param Optimization
+The maximum recommended size of contiguous memory occupied by two objects
+accessed with temporal locality by concurrent threads. Typically the L1 cache
+line size, but can be smaller to accommodate a variety of target processors with
+different cache line sizes.
+
-param=l1-cache-size=
Common Joined UInteger Var(param_l1_cache_size) Init(64) Param Optimization
The size of L1 cache.
diff --git a/gcc/testsuite/g++.dg/warn/Winterference-2.C b/gcc/testsuite/g++.dg/warn/Winterference-2.C
new file mode 100644
index 00000000000..2af75c63f83
--- /dev/null
+++ b/gcc/testsuite/g++.dg/warn/Winterference-2.C
@@ -0,0 +1,14 @@
+// { dg-do compile { target c++20 } }
+// { dg-additional-options -fmodules-ts }
+
+module ;
+
+#include <new>
+
+export module foo;
+
+export {
+ struct A {
+ alignas(std::hardware_destructive_interference_size) int x; // { dg-warning Winterference-size }
+ };
+}
diff --git a/gcc/testsuite/g++.dg/warn/Winterference.C b/gcc/testsuite/g++.dg/warn/Winterference.C
new file mode 100644
index 00000000000..57c001bc032
--- /dev/null
+++ b/gcc/testsuite/g++.dg/warn/Winterference.C
@@ -0,0 +1,6 @@
+// Test that we warn about use of std::hardware_destructive_interference_size
+// in a header.
+// { dg-do compile { target c++17 } }
+
+// { dg-warning Winterference-size "" { target *-*-* } 0 }
+#include "Winterference.H"
diff --git a/gcc/testsuite/g++.dg/warn/Winterference.H b/gcc/testsuite/g++.dg/warn/Winterference.H
new file mode 100644
index 00000000000..36f0ad5f6d1
--- /dev/null
+++ b/gcc/testsuite/g++.dg/warn/Winterference.H
@@ -0,0 +1,7 @@
+#include <new>
+
+struct A
+{
+ alignas(std::hardware_destructive_interference_size) int i;
+ alignas(std::hardware_destructive_interference_size) int j;
+};
diff --git a/gcc/testsuite/g++.target/aarch64/interference.C b/gcc/testsuite/g++.target/aarch64/interference.C
new file mode 100644
index 00000000000..0fc01655223
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/interference.C
@@ -0,0 +1,9 @@
+// Test C++17 hardware interference size constants
+// { dg-do compile { target c++17 } }
+
+#include <new>
+
+// Most AArch64 CPUs have an L1 cache line size of 64, but some recent ones use
+// 128 or even 256.
+static_assert(std::hardware_destructive_interference_size == 256);
+static_assert(std::hardware_constructive_interference_size == 64);
diff --git a/gcc/testsuite/g++.target/arm/interference.C b/gcc/testsuite/g++.target/arm/interference.C
new file mode 100644
index 00000000000..34fe8a52bff
--- /dev/null
+++ b/gcc/testsuite/g++.target/arm/interference.C
@@ -0,0 +1,9 @@
+// Test C++17 hardware interference size constants
+// { dg-do compile { target c++17 } }
+
+#include <new>
+
+// Recent ARM CPUs have a cache line size of 64. Older ones have
+// a size of 32, but I guess they're old enough that we don't care?
+static_assert(std::hardware_destructive_interference_size == 64);
+static_assert(std::hardware_constructive_interference_size == 64);
diff --git a/gcc/testsuite/g++.target/i386/interference.C b/gcc/testsuite/g++.target/i386/interference.C
new file mode 100644
index 00000000000..c7b910e3ada
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/interference.C
@@ -0,0 +1,8 @@
+// Test C++17 hardware interference size constants
+// { dg-do compile { target c++17 } }
+
+#include <new>
+
+// It is generally agreed that these are the right values for all x86.
+static_assert(std::hardware_destructive_interference_size == 64);
+static_assert(std::hardware_constructive_interference_size == 64);
diff --git a/libstdc++-v3/include/std/version b/libstdc++-v3/include/std/version
index f950bf0f0db..f41004b5911 100644
--- a/libstdc++-v3/include/std/version
+++ b/libstdc++-v3/include/std/version
@@ -140,6 +140,9 @@
#define __cpp_lib_filesystem 201703
#define __cpp_lib_gcd 201606
#define __cpp_lib_gcd_lcm 201606
+#ifdef __GCC_DESTRUCTIVE_SIZE
+# define __cpp_lib_hardware_interference_size 201703L
+#endif
#define __cpp_lib_hypot 201603
#define __cpp_lib_invoke 201411L
#define __cpp_lib_lcm 201606
diff --git a/libstdc++-v3/libsupc++/new b/libstdc++-v3/libsupc++/new
index 3349b13fd1b..7bc67a6cb02 100644
--- a/libstdc++-v3/libsupc++/new
+++ b/libstdc++-v3/libsupc++/new
@@ -183,9 +183,9 @@ inline void operator delete[](void*, void*) _GLIBCXX_USE_NOEXCEPT { }
} // extern "C++"
#if __cplusplus >= 201703L
-#ifdef _GLIBCXX_HAVE_BUILTIN_LAUNDER
namespace std
{
+#ifdef _GLIBCXX_HAVE_BUILTIN_LAUNDER
#define __cpp_lib_launder 201606
/// Pointer optimization barrier [ptr.launder]
template<typename _Tp>
@@ -205,8 +205,14 @@ namespace std
void launder(const void*) = delete;
void launder(volatile void*) = delete;
void launder(const volatile void*) = delete;
-}
#endif // _GLIBCXX_HAVE_BUILTIN_LAUNDER
+
+#ifdef __GCC_DESTRUCTIVE_SIZE
+# define __cpp_lib_hardware_interference_size 201703L
+ inline constexpr size_t hardware_destructive_interference_size = __GCC_DESTRUCTIVE_SIZE;
+ inline constexpr size_t hardware_constructive_interference_size = __GCC_CONSTRUCTIVE_SIZE;
+#endif // __GCC_DESTRUCTIVE_SIZE
+}
#endif // C++17
#if __cplusplus > 201703L
</cut>
Identified regression caused by *gcc:76b75018b3d053a890ebe155e47814de14b3c9fb*:
commit 76b75018b3d053a890ebe155e47814de14b3c9fb
Author: Jason Merrill <jason(a)redhat.com>
c++: implement C++17 hardware interference size
Results regressed to (for first_bad == 76b75018b3d053a890ebe155e47814de14b3c9fb)
# reset_artifacts:
-10
# true:
0
# build_abe binutils:
1
# First few build errors in logs:
from (for last_good == 8ea292591e42aa4d52b4b7a00b86335bfd2e2e85)
# reset_artifacts:
-10
# true:
0
# build_abe binutils:
1
# build_abe bootstrap:
2
This commit has regressed these CI configurations:
- tcwg_gcc_bootstrap/master-aarch64-bootstrap
Artifacts of last_good build: https://ci.linaro.org/job/tcwg_gcc_bootstrap-bisect-master-aarch64-bootstra…
Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_gcc_bootstrap-bisect-master-aarch64-bootstra…
Even more details: https://ci.linaro.org/job/tcwg_gcc_bootstrap-bisect-master-aarch64-bootstra…
Reproduce builds:
<cut>
mkdir investigate-gcc-76b75018b3d053a890ebe155e47814de14b3c9fb
cd investigate-gcc-76b75018b3d053a890ebe155e47814de14b3c9fb
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_gcc_bootstrap-bisect-master-aarch64-bootstra… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_gcc_bootstrap-bisect-master-aarch64-bootstra… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_gcc_bootstrap-bisect-master-aarch64-bootstra… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_gnu-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach 76b75018b3d053a890ebe155e47814de14b3c9fb
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 8ea292591e42aa4d52b4b7a00b86335bfd2e2e85
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 76b75018b3d053a890ebe155e47814de14b3c9fb
Author: Jason Merrill <jason(a)redhat.com>
Date: Thu Jul 15 15:30:17 2021 -0400
c++: implement C++17 hardware interference size
The last missing piece of the C++17 standard library is the hardware
intereference size constants. Much of the delay in implementing these has
been due to uncertainty about what the right values are, and even whether
there is a single constant value that is suitable; the destructive
interference size is intended to be used in structure layout, so program
ABIs will depend on it.
In principle, both of these values should be the same as the target's L1
cache line size. When compiling for a generic target that is intended to
support a range of target CPUs with different cache line sizes, the
constructive size should probably be the minimum size, and the destructive
size the maximum, unless you are constrained by ABI compatibility with
previous code.
From discussion on gcc-patches, I've come to the conclusion that the
solution to the difficulty of choosing stable values is to give up on it,
and instead encourage only uses where ABI stability is unimportant: in
particular, uses where the ABI is shared at most between translation units
built at the same time with the same flags.
To that end, I've added a warning for any use of the constant value of
std::hardware_destructive_interference_size in a header or module export.
Appropriate uses within a project can disable the warning.
A previous iteration of this patch included an -finterference-tune flag to
make the value vary with -mtune; this iteration makes that the default
behavior, which should be appropriate for all reasonable uses of the
variable. The previous default of "stable-ish" seems to me likely to have
been more of an attractive nuisance; since we can't promise actual
stability, we should instead make proper uses more convenient.
JF Bastien's implementation proposal is summarized at
https://github.com/itanium-cxx-abi/cxx-abi/issues/74
I implement this by adding new --params for the two sizes. Targets can
override these values in targetm.target_option.override() to support a range
of values for the generic target; otherwise, both will default to the L1
cache line size.
64 bytes still seems correct for all x86.
I'm not sure why he proposed 64/64 for generic 32-bit ARM, since the Cortex
A9 has a 32-byte cache line, so I'd think 32/64 would make more sense.
He proposed 64/128 for generic AArch64, but since the A64FX now has a 256B
cache line, I've changed that to 64/256.
Other arch maintainers are invited to set ranges for their generic targets
if that seems better than using the default cache line size for both values.
With the above choice to reject stability as a goal, getting these values
"right" is now just a matter of what we want the default optimization to be,
and we can feel free to adjust them as CPUs with different cache lines
become more and less common.
gcc/ChangeLog:
* params.opt: Add destructive-interference-size and
constructive-interference-size.
* doc/invoke.texi: Document them.
* config/aarch64/aarch64.c (aarch64_override_options_internal):
Set them.
* config/arm/arm.c (arm_option_override): Set them.
* config/i386/i386-options.c (ix86_option_override_internal):
Set them.
gcc/c-family/ChangeLog:
* c.opt: Add -Winterference-size.
* c-cppbuiltin.c (cpp_atomic_builtins): Add __GCC_DESTRUCTIVE_SIZE
and __GCC_CONSTRUCTIVE_SIZE.
gcc/cp/ChangeLog:
* constexpr.c (maybe_warn_about_constant_value):
Complain about std::hardware_destructive_interference_size.
(cxx_eval_constant_expression): Call it.
* decl.c (cxx_init_decl_processing): Check
--param *-interference-size values.
libstdc++-v3/ChangeLog:
* include/std/version: Define __cpp_lib_hardware_interference_size.
* libsupc++/new: Define hardware interference size variables.
gcc/testsuite/ChangeLog:
* g++.dg/warn/Winterference.H: New file.
* g++.dg/warn/Winterference.C: New test.
* g++.target/aarch64/interference.C: New test.
* g++.target/arm/interference.C: New test.
* g++.target/i386/interference.C: New test.
---
gcc/c-family/c-cppbuiltin.c | 14 ++++++
gcc/c-family/c.opt | 5 ++
gcc/config/aarch64/aarch64.c | 22 +++++++++
gcc/config/arm/arm.c | 22 +++++++++
gcc/config/i386/i386-options.c | 6 +++
gcc/cp/constexpr.c | 33 +++++++++++++
gcc/cp/decl.c | 32 ++++++++++++
gcc/doc/invoke.texi | 65 +++++++++++++++++++++++++
gcc/params.opt | 16 ++++++
gcc/testsuite/g++.dg/warn/Winterference-2.C | 14 ++++++
gcc/testsuite/g++.dg/warn/Winterference.C | 6 +++
gcc/testsuite/g++.dg/warn/Winterference.H | 7 +++
gcc/testsuite/g++.target/aarch64/interference.C | 9 ++++
gcc/testsuite/g++.target/arm/interference.C | 9 ++++
gcc/testsuite/g++.target/i386/interference.C | 8 +++
libstdc++-v3/include/std/version | 3 ++
libstdc++-v3/libsupc++/new | 10 +++-
17 files changed, 279 insertions(+), 2 deletions(-)
diff --git a/gcc/c-family/c-cppbuiltin.c b/gcc/c-family/c-cppbuiltin.c
index 48cbefd8bf8..ce88e707127 100644
--- a/gcc/c-family/c-cppbuiltin.c
+++ b/gcc/c-family/c-cppbuiltin.c
@@ -741,6 +741,20 @@ cpp_atomic_builtins (cpp_reader *pfile)
builtin_define_with_int_value ("__GCC_ATOMIC_TEST_AND_SET_TRUEVAL",
targetm.atomic_test_and_set_trueval);
+ /* Macros for C++17 hardware interference size constants. Either both or
+ neither should be set. */
+ gcc_assert (!param_destruct_interfere_size
+ == !param_construct_interfere_size);
+ if (param_destruct_interfere_size)
+ {
+ /* FIXME The way of communicating these values to the library should be
+ part of the C++ ABI, whether macro or builtin. */
+ builtin_define_with_int_value ("__GCC_DESTRUCTIVE_SIZE",
+ param_destruct_interfere_size);
+ builtin_define_with_int_value ("__GCC_CONSTRUCTIVE_SIZE",
+ param_construct_interfere_size);
+ }
+
/* ptr_type_node can't be used here since ptr_mode is only set when
toplev calls backend_init which is not done with -E or pch. */
psize = POINTER_SIZE_UNITS;
diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt
index c5fe90003f2..9c151d19870 100644
--- a/gcc/c-family/c.opt
+++ b/gcc/c-family/c.opt
@@ -722,6 +722,11 @@ Winit-list-lifetime
C++ ObjC++ Var(warn_init_list) Warning Init(1)
Warn about uses of std::initializer_list that can result in dangling pointers.
+Winterference-size
+C++ ObjC++ Var(warn_interference_size) Warning Init(1)
+Warn about nonsensical values of --param destructive-interference-size or
+constructive-interference-size.
+
Wimplicit
C ObjC Var(warn_implicit) Warning LangEnabledBy(C ObjC,Wall)
Warn about implicit declarations.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 30d9a0b7a3d..36519ccc5a5 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -16540,6 +16540,28 @@ aarch64_override_options_internal (struct gcc_options *opts)
SET_OPTION_IF_UNSET (opts, &global_options_set,
param_l1_cache_line_size,
aarch64_tune_params.prefetch->l1_cache_line_size);
+
+ if (aarch64_tune_params.prefetch->l1_cache_line_size >= 0)
+ {
+ SET_OPTION_IF_UNSET (opts, &global_options_set,
+ param_destruct_interfere_size,
+ aarch64_tune_params.prefetch->l1_cache_line_size);
+ SET_OPTION_IF_UNSET (opts, &global_options_set,
+ param_construct_interfere_size,
+ aarch64_tune_params.prefetch->l1_cache_line_size);
+ }
+ else
+ {
+ /* For a generic AArch64 target, cover the current range of cache line
+ sizes. */
+ SET_OPTION_IF_UNSET (opts, &global_options_set,
+ param_destruct_interfere_size,
+ 256);
+ SET_OPTION_IF_UNSET (opts, &global_options_set,
+ param_construct_interfere_size,
+ 64);
+ }
+
if (aarch64_tune_params.prefetch->l2_cache_size >= 0)
SET_OPTION_IF_UNSET (opts, &global_options_set,
param_l2_cache_size,
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index f1e628253d0..6c6e77fab66 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -3669,6 +3669,28 @@ arm_option_override (void)
SET_OPTION_IF_UNSET (&global_options, &global_options_set,
param_l1_cache_line_size,
current_tune->prefetch.l1_cache_line_size);
+ if (current_tune->prefetch.l1_cache_line_size >= 0)
+ {
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_destruct_interfere_size,
+ current_tune->prefetch.l1_cache_line_size);
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_construct_interfere_size,
+ current_tune->prefetch.l1_cache_line_size);
+ }
+ else
+ {
+ /* For a generic ARM target, JF Bastien proposed using 64 for both. */
+ /* ??? Cortex A9 has a 32-byte cache line, so why not 32 for
+ constructive? */
+ /* More recent Cortex chips have a 64-byte cache line, but are marked
+ ARM_PREFETCH_NOT_BENEFICIAL, so they get these defaults. */
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_destruct_interfere_size, 64);
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_construct_interfere_size, 64);
+ }
+
if (current_tune->prefetch.l1_cache_size >= 0)
SET_OPTION_IF_UNSET (&global_options, &global_options_set,
param_l1_cache_size,
diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index 2cb87cedec0..c0006b3674b 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -2579,6 +2579,12 @@ ix86_option_override_internal (bool main_args_p,
SET_OPTION_IF_UNSET (opts, opts_set, param_l2_cache_size,
ix86_tune_cost->l2_cache_size);
+ /* 64B is the accepted value for these for all x86. */
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_destruct_interfere_size, 64);
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_construct_interfere_size, 64);
+
/* Enable sw prefetching at -O3 for CPUS that prefetching is helpful. */
if (opts->x_flag_prefetch_loop_arrays < 0
&& HAVE_prefetch
diff --git a/gcc/cp/constexpr.c b/gcc/cp/constexpr.c
index 7772fe62d95..0c2498aee22 100644
--- a/gcc/cp/constexpr.c
+++ b/gcc/cp/constexpr.c
@@ -6075,6 +6075,37 @@ inline_asm_in_constexpr_error (location_t loc)
"%<constexpr%> function in C++20");
}
+/* We're getting the constant value of DECL in a manifestly constant-evaluated
+ context; maybe complain about that. */
+
+static void
+maybe_warn_about_constant_value (location_t loc, tree decl)
+{
+ static bool explained = false;
+ if (cxx_dialect >= cxx17
+ && warn_interference_size
+ && !global_options_set.x_param_destruct_interfere_size
+ && DECL_CONTEXT (decl) == std_node
+ && id_equal (DECL_NAME (decl), "hardware_destructive_interference_size")
+ && (LOCATION_FILE (input_location) != main_input_filename
+ || module_exporting_p ())
+ && warning_at (loc, OPT_Winterference_size, "use of %qD", decl)
+ && !explained)
+ {
+ explained = true;
+ inform (loc, "its value can vary between compiler versions or "
+ "with different %<-mtune%> or %<-mcpu%> flags");
+ inform (loc, "if this use is part of a public ABI, change it to "
+ "instead use a constant variable you define");
+ inform (loc, "the default value for the current CPU tuning "
+ "is %d bytes", param_destruct_interfere_size);
+ inform (loc, "you can stabilize this value with %<--param "
+ "hardware_destructive_interference_size=%d%>, or disable "
+ "this warning with %<-Wno-interference-size%>",
+ param_destruct_interfere_size);
+ }
+}
+
/* Attempt to reduce the expression T to a constant value.
On failure, issue diagnostic and return error_mark_node. */
/* FIXME unify with c_fully_fold */
@@ -6219,6 +6250,8 @@ cxx_eval_constant_expression (const constexpr_ctx *ctx, tree t,
r = *p;
break;
}
+ if (ctx->manifestly_const_eval)
+ maybe_warn_about_constant_value (loc, t);
if (COMPLETE_TYPE_P (TREE_TYPE (t))
&& is_really_empty_class (TREE_TYPE (t), /*ignore_vptr*/false))
{
diff --git a/gcc/cp/decl.c b/gcc/cp/decl.c
index bce62ad202a..c2065027369 100644
--- a/gcc/cp/decl.c
+++ b/gcc/cp/decl.c
@@ -4752,6 +4752,38 @@ cxx_init_decl_processing (void)
/* Show we use EH for cleanups. */
if (flag_exceptions)
using_eh_for_cleanups ();
+
+ /* Check that the hardware interference sizes are at least
+ alignof(max_align_t), as required by the standard. */
+ const int max_align = max_align_t_align () / BITS_PER_UNIT;
+ if (param_destruct_interfere_size)
+ {
+ if (param_destruct_interfere_size < max_align)
+ error ("%<--param destructive-interference-size=%d%> is less than "
+ "%d", param_destruct_interfere_size, max_align);
+ else if (param_destruct_interfere_size < param_l1_cache_line_size)
+ warning (OPT_Winterference_size,
+ "%<--param destructive-interference-size=%d%> "
+ "is less than %<--param l1-cache-line-size=%d%>",
+ param_destruct_interfere_size, param_l1_cache_line_size);
+ }
+ else if (param_l1_cache_line_size >= max_align)
+ param_destruct_interfere_size = param_l1_cache_line_size;
+ /* else leave it unset. */
+
+ if (param_construct_interfere_size)
+ {
+ if (param_construct_interfere_size < max_align)
+ error ("%<--param constructive-interference-size=%d%> is less than "
+ "%d", param_construct_interfere_size, max_align);
+ else if (param_construct_interfere_size > param_l1_cache_line_size)
+ warning (OPT_Winterference_size,
+ "%<--param constructive-interference-size=%d%> "
+ "is greater than %<--param l1-cache-line-size=%d%>",
+ param_construct_interfere_size, param_l1_cache_line_size);
+ }
+ else if (param_l1_cache_line_size >= max_align)
+ param_construct_interfere_size = param_l1_cache_line_size;
}
/* Enter an abi node in global-module context. returns a cookie to
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 23cc68f92b5..78cfc100ac2 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -9018,6 +9018,43 @@ that has already been done in the current function. Therefore,
seemingly insignificant changes in the source program can cause the
warnings produced by @option{-Winline} to appear or disappear.
+@item -Winterference-size
+@opindex Winterference-size
+Warn about use of C++17 @code{std::hardware_destructive_interference_size}
+without specifying its value with @option{--param destructive-interference-size}.
+Also warn about questionable values for that option.
+
+This variable is intended to be used for controlling class layout, to
+avoid false sharing in concurrent code:
+
+@smallexample
+struct independent_fields @{
+ alignas(std::hardware_destructive_interference_size) std::atomic<int> one;
+ alignas(std::hardware_destructive_interference_size) std::atomic<int> two;
+@};
+@end smallexample
+
+Here @samp{one} and @samp{two} are intended to be far enough apart
+that stores to one won't require accesses to the other to reload the
+cache line.
+
+By default, @option{--param destructive-interference-size} and
+@option{--param constructive-interference-size} are set based on the
+current @option{-mtune} option, typically to the L1 cache line size
+for the particular target CPU, sometimes to a range if tuning for a
+generic target. So all translation units that depend on ABI
+compatibility for the use of these variables must be compiled with
+the same @option{-mtune} (or @option{-mcpu}).
+
+If ABI stability is important, such as if the use is in a header for a
+library, you should probably not use the hardware interference size
+variables at all. Alternatively, you can force a particular value
+with @option{--param}.
+
+If you are confident that your use of the variable does not affect ABI
+outside a single build of your project, you can turn off the warning
+with @option{-Wno-interference-size}.
+
@item -Wint-in-bool-context
@opindex Wint-in-bool-context
@opindex Wno-int-in-bool-context
@@ -13938,6 +13975,34 @@ prefetch hints can be issued for any constant stride.
This setting is only useful for strides that are known and constant.
+@item destructive-interference-size
+@item constructive-interference-size
+The values for the C++17 variables
+@code{std::hardware_destructive_interference_size} and
+@code{std::hardware_constructive_interference_size}. The destructive
+interference size is the minimum recommended offset between two
+independent concurrently-accessed objects; the constructive
+interference size is the maximum recommended size of contiguous memory
+accessed together. Typically both will be the size of an L1 cache
+line for the target, in bytes. For a generic target covering a range of L1
+cache line sizes, typically the constructive interference size will be
+the small end of the range and the destructive size will be the large
+end.
+
+The destructive interference size is intended to be used for layout,
+and thus has ABI impact. The default value is not expected to be
+stable, and on some targets varies with @option{-mtune}, so use of
+this variable in a context where ABI stability is important, such as
+the public interface of a library, is strongly discouraged; if it is
+used in that context, users can stabilize the value using this
+option.
+
+The constructive interference size is less sensitive, as it is
+typically only used in a @samp{static_assert} to make sure that a type
+fits within a cache line.
+
+See also @option{-Winterference-size}.
+
@item loop-interchange-max-num-stmts
The maximum number of stmts in a loop to be interchanged.
diff --git a/gcc/params.opt b/gcc/params.opt
index 3a701e22c46..658ca028851 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -361,6 +361,22 @@ The maximum code size growth ratio when expanding into a jump table (in percent)
Common Joined UInteger Var(param_l1_cache_line_size) Init(32) Param Optimization
The size of L1 cache line.
+-param=destructive-interference-size=
+Common Joined UInteger Var(param_destruct_interfere_size) Init(0) Param Optimization
+The minimum recommended offset between two concurrently-accessed objects to
+avoid additional performance degradation due to contention introduced by the
+implementation. Typically the L1 cache line size, but can be larger to
+accommodate a variety of target processors with different cache line sizes.
+C++17 code might use this value in structure layout, but is strongly
+discouraged from doing so in public ABIs.
+
+-param=constructive-interference-size=
+Common Joined UInteger Var(param_construct_interfere_size) Init(0) Param Optimization
+The maximum recommended size of contiguous memory occupied by two objects
+accessed with temporal locality by concurrent threads. Typically the L1 cache
+line size, but can be smaller to accommodate a variety of target processors with
+different cache line sizes.
+
-param=l1-cache-size=
Common Joined UInteger Var(param_l1_cache_size) Init(64) Param Optimization
The size of L1 cache.
diff --git a/gcc/testsuite/g++.dg/warn/Winterference-2.C b/gcc/testsuite/g++.dg/warn/Winterference-2.C
new file mode 100644
index 00000000000..2af75c63f83
--- /dev/null
+++ b/gcc/testsuite/g++.dg/warn/Winterference-2.C
@@ -0,0 +1,14 @@
+// { dg-do compile { target c++20 } }
+// { dg-additional-options -fmodules-ts }
+
+module ;
+
+#include <new>
+
+export module foo;
+
+export {
+ struct A {
+ alignas(std::hardware_destructive_interference_size) int x; // { dg-warning Winterference-size }
+ };
+}
diff --git a/gcc/testsuite/g++.dg/warn/Winterference.C b/gcc/testsuite/g++.dg/warn/Winterference.C
new file mode 100644
index 00000000000..57c001bc032
--- /dev/null
+++ b/gcc/testsuite/g++.dg/warn/Winterference.C
@@ -0,0 +1,6 @@
+// Test that we warn about use of std::hardware_destructive_interference_size
+// in a header.
+// { dg-do compile { target c++17 } }
+
+// { dg-warning Winterference-size "" { target *-*-* } 0 }
+#include "Winterference.H"
diff --git a/gcc/testsuite/g++.dg/warn/Winterference.H b/gcc/testsuite/g++.dg/warn/Winterference.H
new file mode 100644
index 00000000000..36f0ad5f6d1
--- /dev/null
+++ b/gcc/testsuite/g++.dg/warn/Winterference.H
@@ -0,0 +1,7 @@
+#include <new>
+
+struct A
+{
+ alignas(std::hardware_destructive_interference_size) int i;
+ alignas(std::hardware_destructive_interference_size) int j;
+};
diff --git a/gcc/testsuite/g++.target/aarch64/interference.C b/gcc/testsuite/g++.target/aarch64/interference.C
new file mode 100644
index 00000000000..0fc01655223
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/interference.C
@@ -0,0 +1,9 @@
+// Test C++17 hardware interference size constants
+// { dg-do compile { target c++17 } }
+
+#include <new>
+
+// Most AArch64 CPUs have an L1 cache line size of 64, but some recent ones use
+// 128 or even 256.
+static_assert(std::hardware_destructive_interference_size == 256);
+static_assert(std::hardware_constructive_interference_size == 64);
diff --git a/gcc/testsuite/g++.target/arm/interference.C b/gcc/testsuite/g++.target/arm/interference.C
new file mode 100644
index 00000000000..34fe8a52bff
--- /dev/null
+++ b/gcc/testsuite/g++.target/arm/interference.C
@@ -0,0 +1,9 @@
+// Test C++17 hardware interference size constants
+// { dg-do compile { target c++17 } }
+
+#include <new>
+
+// Recent ARM CPUs have a cache line size of 64. Older ones have
+// a size of 32, but I guess they're old enough that we don't care?
+static_assert(std::hardware_destructive_interference_size == 64);
+static_assert(std::hardware_constructive_interference_size == 64);
diff --git a/gcc/testsuite/g++.target/i386/interference.C b/gcc/testsuite/g++.target/i386/interference.C
new file mode 100644
index 00000000000..c7b910e3ada
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/interference.C
@@ -0,0 +1,8 @@
+// Test C++17 hardware interference size constants
+// { dg-do compile { target c++17 } }
+
+#include <new>
+
+// It is generally agreed that these are the right values for all x86.
+static_assert(std::hardware_destructive_interference_size == 64);
+static_assert(std::hardware_constructive_interference_size == 64);
diff --git a/libstdc++-v3/include/std/version b/libstdc++-v3/include/std/version
index f950bf0f0db..f41004b5911 100644
--- a/libstdc++-v3/include/std/version
+++ b/libstdc++-v3/include/std/version
@@ -140,6 +140,9 @@
#define __cpp_lib_filesystem 201703
#define __cpp_lib_gcd 201606
#define __cpp_lib_gcd_lcm 201606
+#ifdef __GCC_DESTRUCTIVE_SIZE
+# define __cpp_lib_hardware_interference_size 201703L
+#endif
#define __cpp_lib_hypot 201603
#define __cpp_lib_invoke 201411L
#define __cpp_lib_lcm 201606
diff --git a/libstdc++-v3/libsupc++/new b/libstdc++-v3/libsupc++/new
index 3349b13fd1b..7bc67a6cb02 100644
--- a/libstdc++-v3/libsupc++/new
+++ b/libstdc++-v3/libsupc++/new
@@ -183,9 +183,9 @@ inline void operator delete[](void*, void*) _GLIBCXX_USE_NOEXCEPT { }
} // extern "C++"
#if __cplusplus >= 201703L
-#ifdef _GLIBCXX_HAVE_BUILTIN_LAUNDER
namespace std
{
+#ifdef _GLIBCXX_HAVE_BUILTIN_LAUNDER
#define __cpp_lib_launder 201606
/// Pointer optimization barrier [ptr.launder]
template<typename _Tp>
@@ -205,8 +205,14 @@ namespace std
void launder(const void*) = delete;
void launder(volatile void*) = delete;
void launder(const volatile void*) = delete;
-}
#endif // _GLIBCXX_HAVE_BUILTIN_LAUNDER
+
+#ifdef __GCC_DESTRUCTIVE_SIZE
+# define __cpp_lib_hardware_interference_size 201703L
+ inline constexpr size_t hardware_destructive_interference_size = __GCC_DESTRUCTIVE_SIZE;
+ inline constexpr size_t hardware_constructive_interference_size = __GCC_CONSTRUCTIVE_SIZE;
+#endif // __GCC_DESTRUCTIVE_SIZE
+}
#endif // C++17
#if __cplusplus > 201703L
</cut>
Identified regression caused by *gcc:76b75018b3d053a890ebe155e47814de14b3c9fb*:
commit 76b75018b3d053a890ebe155e47814de14b3c9fb
Author: Jason Merrill <jason(a)redhat.com>
c++: implement C++17 hardware interference size
Results regressed to (for first_bad == 76b75018b3d053a890ebe155e47814de14b3c9fb)
# reset_artifacts:
-10
# true:
0
# build_abe binutils:
1
# First few build errors in logs:
from (for last_good == 8ea292591e42aa4d52b4b7a00b86335bfd2e2e85)
# reset_artifacts:
-10
# true:
0
# build_abe binutils:
1
# build_abe gcc:
2
# build_abe linux:
4
# build_abe glibc:
5
# build_abe gdb:
6
This commit has regressed these CI configurations:
- tcwg_gnu_native_build/master-arm
Artifacts of last_good build: https://ci.linaro.org/job/tcwg_gnu_native_build-bisect-master-arm/2/artifac…
Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_gnu_native_build-bisect-master-arm/2/artifac…
Even more details: https://ci.linaro.org/job/tcwg_gnu_native_build-bisect-master-arm/2/artifac…
Reproduce builds:
<cut>
mkdir investigate-gcc-76b75018b3d053a890ebe155e47814de14b3c9fb
cd investigate-gcc-76b75018b3d053a890ebe155e47814de14b3c9fb
# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts
# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_gnu_native_build-bisect-master-arm/2/artifac… --fail
curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_gnu_native_build-bisect-master-arm/2/artifac… --fail
curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_gnu_native_build-bisect-master-arm/2/artifac… --fail
chmod +x artifacts/test.sh
# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_gnu-build.sh @@ artifacts/manifests/build-baseline.sh
# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
cd gcc
# Reproduce first_bad build
git checkout --detach 76b75018b3d053a890ebe155e47814de14b3c9fb
../artifacts/test.sh
# Reproduce last_good build
git checkout --detach 8ea292591e42aa4d52b4b7a00b86335bfd2e2e85
../artifacts/test.sh
cd ..
</cut>
Full commit (up to 1000 lines):
<cut>
commit 76b75018b3d053a890ebe155e47814de14b3c9fb
Author: Jason Merrill <jason(a)redhat.com>
Date: Thu Jul 15 15:30:17 2021 -0400
c++: implement C++17 hardware interference size
The last missing piece of the C++17 standard library is the hardware
intereference size constants. Much of the delay in implementing these has
been due to uncertainty about what the right values are, and even whether
there is a single constant value that is suitable; the destructive
interference size is intended to be used in structure layout, so program
ABIs will depend on it.
In principle, both of these values should be the same as the target's L1
cache line size. When compiling for a generic target that is intended to
support a range of target CPUs with different cache line sizes, the
constructive size should probably be the minimum size, and the destructive
size the maximum, unless you are constrained by ABI compatibility with
previous code.
From discussion on gcc-patches, I've come to the conclusion that the
solution to the difficulty of choosing stable values is to give up on it,
and instead encourage only uses where ABI stability is unimportant: in
particular, uses where the ABI is shared at most between translation units
built at the same time with the same flags.
To that end, I've added a warning for any use of the constant value of
std::hardware_destructive_interference_size in a header or module export.
Appropriate uses within a project can disable the warning.
A previous iteration of this patch included an -finterference-tune flag to
make the value vary with -mtune; this iteration makes that the default
behavior, which should be appropriate for all reasonable uses of the
variable. The previous default of "stable-ish" seems to me likely to have
been more of an attractive nuisance; since we can't promise actual
stability, we should instead make proper uses more convenient.
JF Bastien's implementation proposal is summarized at
https://github.com/itanium-cxx-abi/cxx-abi/issues/74
I implement this by adding new --params for the two sizes. Targets can
override these values in targetm.target_option.override() to support a range
of values for the generic target; otherwise, both will default to the L1
cache line size.
64 bytes still seems correct for all x86.
I'm not sure why he proposed 64/64 for generic 32-bit ARM, since the Cortex
A9 has a 32-byte cache line, so I'd think 32/64 would make more sense.
He proposed 64/128 for generic AArch64, but since the A64FX now has a 256B
cache line, I've changed that to 64/256.
Other arch maintainers are invited to set ranges for their generic targets
if that seems better than using the default cache line size for both values.
With the above choice to reject stability as a goal, getting these values
"right" is now just a matter of what we want the default optimization to be,
and we can feel free to adjust them as CPUs with different cache lines
become more and less common.
gcc/ChangeLog:
* params.opt: Add destructive-interference-size and
constructive-interference-size.
* doc/invoke.texi: Document them.
* config/aarch64/aarch64.c (aarch64_override_options_internal):
Set them.
* config/arm/arm.c (arm_option_override): Set them.
* config/i386/i386-options.c (ix86_option_override_internal):
Set them.
gcc/c-family/ChangeLog:
* c.opt: Add -Winterference-size.
* c-cppbuiltin.c (cpp_atomic_builtins): Add __GCC_DESTRUCTIVE_SIZE
and __GCC_CONSTRUCTIVE_SIZE.
gcc/cp/ChangeLog:
* constexpr.c (maybe_warn_about_constant_value):
Complain about std::hardware_destructive_interference_size.
(cxx_eval_constant_expression): Call it.
* decl.c (cxx_init_decl_processing): Check
--param *-interference-size values.
libstdc++-v3/ChangeLog:
* include/std/version: Define __cpp_lib_hardware_interference_size.
* libsupc++/new: Define hardware interference size variables.
gcc/testsuite/ChangeLog:
* g++.dg/warn/Winterference.H: New file.
* g++.dg/warn/Winterference.C: New test.
* g++.target/aarch64/interference.C: New test.
* g++.target/arm/interference.C: New test.
* g++.target/i386/interference.C: New test.
---
gcc/c-family/c-cppbuiltin.c | 14 ++++++
gcc/c-family/c.opt | 5 ++
gcc/config/aarch64/aarch64.c | 22 +++++++++
gcc/config/arm/arm.c | 22 +++++++++
gcc/config/i386/i386-options.c | 6 +++
gcc/cp/constexpr.c | 33 +++++++++++++
gcc/cp/decl.c | 32 ++++++++++++
gcc/doc/invoke.texi | 65 +++++++++++++++++++++++++
gcc/params.opt | 16 ++++++
gcc/testsuite/g++.dg/warn/Winterference-2.C | 14 ++++++
gcc/testsuite/g++.dg/warn/Winterference.C | 6 +++
gcc/testsuite/g++.dg/warn/Winterference.H | 7 +++
gcc/testsuite/g++.target/aarch64/interference.C | 9 ++++
gcc/testsuite/g++.target/arm/interference.C | 9 ++++
gcc/testsuite/g++.target/i386/interference.C | 8 +++
libstdc++-v3/include/std/version | 3 ++
libstdc++-v3/libsupc++/new | 10 +++-
17 files changed, 279 insertions(+), 2 deletions(-)
diff --git a/gcc/c-family/c-cppbuiltin.c b/gcc/c-family/c-cppbuiltin.c
index 48cbefd8bf8..ce88e707127 100644
--- a/gcc/c-family/c-cppbuiltin.c
+++ b/gcc/c-family/c-cppbuiltin.c
@@ -741,6 +741,20 @@ cpp_atomic_builtins (cpp_reader *pfile)
builtin_define_with_int_value ("__GCC_ATOMIC_TEST_AND_SET_TRUEVAL",
targetm.atomic_test_and_set_trueval);
+ /* Macros for C++17 hardware interference size constants. Either both or
+ neither should be set. */
+ gcc_assert (!param_destruct_interfere_size
+ == !param_construct_interfere_size);
+ if (param_destruct_interfere_size)
+ {
+ /* FIXME The way of communicating these values to the library should be
+ part of the C++ ABI, whether macro or builtin. */
+ builtin_define_with_int_value ("__GCC_DESTRUCTIVE_SIZE",
+ param_destruct_interfere_size);
+ builtin_define_with_int_value ("__GCC_CONSTRUCTIVE_SIZE",
+ param_construct_interfere_size);
+ }
+
/* ptr_type_node can't be used here since ptr_mode is only set when
toplev calls backend_init which is not done with -E or pch. */
psize = POINTER_SIZE_UNITS;
diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt
index c5fe90003f2..9c151d19870 100644
--- a/gcc/c-family/c.opt
+++ b/gcc/c-family/c.opt
@@ -722,6 +722,11 @@ Winit-list-lifetime
C++ ObjC++ Var(warn_init_list) Warning Init(1)
Warn about uses of std::initializer_list that can result in dangling pointers.
+Winterference-size
+C++ ObjC++ Var(warn_interference_size) Warning Init(1)
+Warn about nonsensical values of --param destructive-interference-size or
+constructive-interference-size.
+
Wimplicit
C ObjC Var(warn_implicit) Warning LangEnabledBy(C ObjC,Wall)
Warn about implicit declarations.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 30d9a0b7a3d..36519ccc5a5 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -16540,6 +16540,28 @@ aarch64_override_options_internal (struct gcc_options *opts)
SET_OPTION_IF_UNSET (opts, &global_options_set,
param_l1_cache_line_size,
aarch64_tune_params.prefetch->l1_cache_line_size);
+
+ if (aarch64_tune_params.prefetch->l1_cache_line_size >= 0)
+ {
+ SET_OPTION_IF_UNSET (opts, &global_options_set,
+ param_destruct_interfere_size,
+ aarch64_tune_params.prefetch->l1_cache_line_size);
+ SET_OPTION_IF_UNSET (opts, &global_options_set,
+ param_construct_interfere_size,
+ aarch64_tune_params.prefetch->l1_cache_line_size);
+ }
+ else
+ {
+ /* For a generic AArch64 target, cover the current range of cache line
+ sizes. */
+ SET_OPTION_IF_UNSET (opts, &global_options_set,
+ param_destruct_interfere_size,
+ 256);
+ SET_OPTION_IF_UNSET (opts, &global_options_set,
+ param_construct_interfere_size,
+ 64);
+ }
+
if (aarch64_tune_params.prefetch->l2_cache_size >= 0)
SET_OPTION_IF_UNSET (opts, &global_options_set,
param_l2_cache_size,
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index f1e628253d0..6c6e77fab66 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -3669,6 +3669,28 @@ arm_option_override (void)
SET_OPTION_IF_UNSET (&global_options, &global_options_set,
param_l1_cache_line_size,
current_tune->prefetch.l1_cache_line_size);
+ if (current_tune->prefetch.l1_cache_line_size >= 0)
+ {
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_destruct_interfere_size,
+ current_tune->prefetch.l1_cache_line_size);
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_construct_interfere_size,
+ current_tune->prefetch.l1_cache_line_size);
+ }
+ else
+ {
+ /* For a generic ARM target, JF Bastien proposed using 64 for both. */
+ /* ??? Cortex A9 has a 32-byte cache line, so why not 32 for
+ constructive? */
+ /* More recent Cortex chips have a 64-byte cache line, but are marked
+ ARM_PREFETCH_NOT_BENEFICIAL, so they get these defaults. */
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_destruct_interfere_size, 64);
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_construct_interfere_size, 64);
+ }
+
if (current_tune->prefetch.l1_cache_size >= 0)
SET_OPTION_IF_UNSET (&global_options, &global_options_set,
param_l1_cache_size,
diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index 2cb87cedec0..c0006b3674b 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -2579,6 +2579,12 @@ ix86_option_override_internal (bool main_args_p,
SET_OPTION_IF_UNSET (opts, opts_set, param_l2_cache_size,
ix86_tune_cost->l2_cache_size);
+ /* 64B is the accepted value for these for all x86. */
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_destruct_interfere_size, 64);
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_construct_interfere_size, 64);
+
/* Enable sw prefetching at -O3 for CPUS that prefetching is helpful. */
if (opts->x_flag_prefetch_loop_arrays < 0
&& HAVE_prefetch
diff --git a/gcc/cp/constexpr.c b/gcc/cp/constexpr.c
index 7772fe62d95..0c2498aee22 100644
--- a/gcc/cp/constexpr.c
+++ b/gcc/cp/constexpr.c
@@ -6075,6 +6075,37 @@ inline_asm_in_constexpr_error (location_t loc)
"%<constexpr%> function in C++20");
}
+/* We're getting the constant value of DECL in a manifestly constant-evaluated
+ context; maybe complain about that. */
+
+static void
+maybe_warn_about_constant_value (location_t loc, tree decl)
+{
+ static bool explained = false;
+ if (cxx_dialect >= cxx17
+ && warn_interference_size
+ && !global_options_set.x_param_destruct_interfere_size
+ && DECL_CONTEXT (decl) == std_node
+ && id_equal (DECL_NAME (decl), "hardware_destructive_interference_size")
+ && (LOCATION_FILE (input_location) != main_input_filename
+ || module_exporting_p ())
+ && warning_at (loc, OPT_Winterference_size, "use of %qD", decl)
+ && !explained)
+ {
+ explained = true;
+ inform (loc, "its value can vary between compiler versions or "
+ "with different %<-mtune%> or %<-mcpu%> flags");
+ inform (loc, "if this use is part of a public ABI, change it to "
+ "instead use a constant variable you define");
+ inform (loc, "the default value for the current CPU tuning "
+ "is %d bytes", param_destruct_interfere_size);
+ inform (loc, "you can stabilize this value with %<--param "
+ "hardware_destructive_interference_size=%d%>, or disable "
+ "this warning with %<-Wno-interference-size%>",
+ param_destruct_interfere_size);
+ }
+}
+
/* Attempt to reduce the expression T to a constant value.
On failure, issue diagnostic and return error_mark_node. */
/* FIXME unify with c_fully_fold */
@@ -6219,6 +6250,8 @@ cxx_eval_constant_expression (const constexpr_ctx *ctx, tree t,
r = *p;
break;
}
+ if (ctx->manifestly_const_eval)
+ maybe_warn_about_constant_value (loc, t);
if (COMPLETE_TYPE_P (TREE_TYPE (t))
&& is_really_empty_class (TREE_TYPE (t), /*ignore_vptr*/false))
{
diff --git a/gcc/cp/decl.c b/gcc/cp/decl.c
index bce62ad202a..c2065027369 100644
--- a/gcc/cp/decl.c
+++ b/gcc/cp/decl.c
@@ -4752,6 +4752,38 @@ cxx_init_decl_processing (void)
/* Show we use EH for cleanups. */
if (flag_exceptions)
using_eh_for_cleanups ();
+
+ /* Check that the hardware interference sizes are at least
+ alignof(max_align_t), as required by the standard. */
+ const int max_align = max_align_t_align () / BITS_PER_UNIT;
+ if (param_destruct_interfere_size)
+ {
+ if (param_destruct_interfere_size < max_align)
+ error ("%<--param destructive-interference-size=%d%> is less than "
+ "%d", param_destruct_interfere_size, max_align);
+ else if (param_destruct_interfere_size < param_l1_cache_line_size)
+ warning (OPT_Winterference_size,
+ "%<--param destructive-interference-size=%d%> "
+ "is less than %<--param l1-cache-line-size=%d%>",
+ param_destruct_interfere_size, param_l1_cache_line_size);
+ }
+ else if (param_l1_cache_line_size >= max_align)
+ param_destruct_interfere_size = param_l1_cache_line_size;
+ /* else leave it unset. */
+
+ if (param_construct_interfere_size)
+ {
+ if (param_construct_interfere_size < max_align)
+ error ("%<--param constructive-interference-size=%d%> is less than "
+ "%d", param_construct_interfere_size, max_align);
+ else if (param_construct_interfere_size > param_l1_cache_line_size)
+ warning (OPT_Winterference_size,
+ "%<--param constructive-interference-size=%d%> "
+ "is greater than %<--param l1-cache-line-size=%d%>",
+ param_construct_interfere_size, param_l1_cache_line_size);
+ }
+ else if (param_l1_cache_line_size >= max_align)
+ param_construct_interfere_size = param_l1_cache_line_size;
}
/* Enter an abi node in global-module context. returns a cookie to
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 23cc68f92b5..78cfc100ac2 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -9018,6 +9018,43 @@ that has already been done in the current function. Therefore,
seemingly insignificant changes in the source program can cause the
warnings produced by @option{-Winline} to appear or disappear.
+@item -Winterference-size
+@opindex Winterference-size
+Warn about use of C++17 @code{std::hardware_destructive_interference_size}
+without specifying its value with @option{--param destructive-interference-size}.
+Also warn about questionable values for that option.
+
+This variable is intended to be used for controlling class layout, to
+avoid false sharing in concurrent code:
+
+@smallexample
+struct independent_fields @{
+ alignas(std::hardware_destructive_interference_size) std::atomic<int> one;
+ alignas(std::hardware_destructive_interference_size) std::atomic<int> two;
+@};
+@end smallexample
+
+Here @samp{one} and @samp{two} are intended to be far enough apart
+that stores to one won't require accesses to the other to reload the
+cache line.
+
+By default, @option{--param destructive-interference-size} and
+@option{--param constructive-interference-size} are set based on the
+current @option{-mtune} option, typically to the L1 cache line size
+for the particular target CPU, sometimes to a range if tuning for a
+generic target. So all translation units that depend on ABI
+compatibility for the use of these variables must be compiled with
+the same @option{-mtune} (or @option{-mcpu}).
+
+If ABI stability is important, such as if the use is in a header for a
+library, you should probably not use the hardware interference size
+variables at all. Alternatively, you can force a particular value
+with @option{--param}.
+
+If you are confident that your use of the variable does not affect ABI
+outside a single build of your project, you can turn off the warning
+with @option{-Wno-interference-size}.
+
@item -Wint-in-bool-context
@opindex Wint-in-bool-context
@opindex Wno-int-in-bool-context
@@ -13938,6 +13975,34 @@ prefetch hints can be issued for any constant stride.
This setting is only useful for strides that are known and constant.
+@item destructive-interference-size
+@item constructive-interference-size
+The values for the C++17 variables
+@code{std::hardware_destructive_interference_size} and
+@code{std::hardware_constructive_interference_size}. The destructive
+interference size is the minimum recommended offset between two
+independent concurrently-accessed objects; the constructive
+interference size is the maximum recommended size of contiguous memory
+accessed together. Typically both will be the size of an L1 cache
+line for the target, in bytes. For a generic target covering a range of L1
+cache line sizes, typically the constructive interference size will be
+the small end of the range and the destructive size will be the large
+end.
+
+The destructive interference size is intended to be used for layout,
+and thus has ABI impact. The default value is not expected to be
+stable, and on some targets varies with @option{-mtune}, so use of
+this variable in a context where ABI stability is important, such as
+the public interface of a library, is strongly discouraged; if it is
+used in that context, users can stabilize the value using this
+option.
+
+The constructive interference size is less sensitive, as it is
+typically only used in a @samp{static_assert} to make sure that a type
+fits within a cache line.
+
+See also @option{-Winterference-size}.
+
@item loop-interchange-max-num-stmts
The maximum number of stmts in a loop to be interchanged.
diff --git a/gcc/params.opt b/gcc/params.opt
index 3a701e22c46..658ca028851 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -361,6 +361,22 @@ The maximum code size growth ratio when expanding into a jump table (in percent)
Common Joined UInteger Var(param_l1_cache_line_size) Init(32) Param Optimization
The size of L1 cache line.
+-param=destructive-interference-size=
+Common Joined UInteger Var(param_destruct_interfere_size) Init(0) Param Optimization
+The minimum recommended offset between two concurrently-accessed objects to
+avoid additional performance degradation due to contention introduced by the
+implementation. Typically the L1 cache line size, but can be larger to
+accommodate a variety of target processors with different cache line sizes.
+C++17 code might use this value in structure layout, but is strongly
+discouraged from doing so in public ABIs.
+
+-param=constructive-interference-size=
+Common Joined UInteger Var(param_construct_interfere_size) Init(0) Param Optimization
+The maximum recommended size of contiguous memory occupied by two objects
+accessed with temporal locality by concurrent threads. Typically the L1 cache
+line size, but can be smaller to accommodate a variety of target processors with
+different cache line sizes.
+
-param=l1-cache-size=
Common Joined UInteger Var(param_l1_cache_size) Init(64) Param Optimization
The size of L1 cache.
diff --git a/gcc/testsuite/g++.dg/warn/Winterference-2.C b/gcc/testsuite/g++.dg/warn/Winterference-2.C
new file mode 100644
index 00000000000..2af75c63f83
--- /dev/null
+++ b/gcc/testsuite/g++.dg/warn/Winterference-2.C
@@ -0,0 +1,14 @@
+// { dg-do compile { target c++20 } }
+// { dg-additional-options -fmodules-ts }
+
+module ;
+
+#include <new>
+
+export module foo;
+
+export {
+ struct A {
+ alignas(std::hardware_destructive_interference_size) int x; // { dg-warning Winterference-size }
+ };
+}
diff --git a/gcc/testsuite/g++.dg/warn/Winterference.C b/gcc/testsuite/g++.dg/warn/Winterference.C
new file mode 100644
index 00000000000..57c001bc032
--- /dev/null
+++ b/gcc/testsuite/g++.dg/warn/Winterference.C
@@ -0,0 +1,6 @@
+// Test that we warn about use of std::hardware_destructive_interference_size
+// in a header.
+// { dg-do compile { target c++17 } }
+
+// { dg-warning Winterference-size "" { target *-*-* } 0 }
+#include "Winterference.H"
diff --git a/gcc/testsuite/g++.dg/warn/Winterference.H b/gcc/testsuite/g++.dg/warn/Winterference.H
new file mode 100644
index 00000000000..36f0ad5f6d1
--- /dev/null
+++ b/gcc/testsuite/g++.dg/warn/Winterference.H
@@ -0,0 +1,7 @@
+#include <new>
+
+struct A
+{
+ alignas(std::hardware_destructive_interference_size) int i;
+ alignas(std::hardware_destructive_interference_size) int j;
+};
diff --git a/gcc/testsuite/g++.target/aarch64/interference.C b/gcc/testsuite/g++.target/aarch64/interference.C
new file mode 100644
index 00000000000..0fc01655223
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/interference.C
@@ -0,0 +1,9 @@
+// Test C++17 hardware interference size constants
+// { dg-do compile { target c++17 } }
+
+#include <new>
+
+// Most AArch64 CPUs have an L1 cache line size of 64, but some recent ones use
+// 128 or even 256.
+static_assert(std::hardware_destructive_interference_size == 256);
+static_assert(std::hardware_constructive_interference_size == 64);
diff --git a/gcc/testsuite/g++.target/arm/interference.C b/gcc/testsuite/g++.target/arm/interference.C
new file mode 100644
index 00000000000..34fe8a52bff
--- /dev/null
+++ b/gcc/testsuite/g++.target/arm/interference.C
@@ -0,0 +1,9 @@
+// Test C++17 hardware interference size constants
+// { dg-do compile { target c++17 } }
+
+#include <new>
+
+// Recent ARM CPUs have a cache line size of 64. Older ones have
+// a size of 32, but I guess they're old enough that we don't care?
+static_assert(std::hardware_destructive_interference_size == 64);
+static_assert(std::hardware_constructive_interference_size == 64);
diff --git a/gcc/testsuite/g++.target/i386/interference.C b/gcc/testsuite/g++.target/i386/interference.C
new file mode 100644
index 00000000000..c7b910e3ada
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/interference.C
@@ -0,0 +1,8 @@
+// Test C++17 hardware interference size constants
+// { dg-do compile { target c++17 } }
+
+#include <new>
+
+// It is generally agreed that these are the right values for all x86.
+static_assert(std::hardware_destructive_interference_size == 64);
+static_assert(std::hardware_constructive_interference_size == 64);
diff --git a/libstdc++-v3/include/std/version b/libstdc++-v3/include/std/version
index f950bf0f0db..f41004b5911 100644
--- a/libstdc++-v3/include/std/version
+++ b/libstdc++-v3/include/std/version
@@ -140,6 +140,9 @@
#define __cpp_lib_filesystem 201703
#define __cpp_lib_gcd 201606
#define __cpp_lib_gcd_lcm 201606
+#ifdef __GCC_DESTRUCTIVE_SIZE
+# define __cpp_lib_hardware_interference_size 201703L
+#endif
#define __cpp_lib_hypot 201603
#define __cpp_lib_invoke 201411L
#define __cpp_lib_lcm 201606
diff --git a/libstdc++-v3/libsupc++/new b/libstdc++-v3/libsupc++/new
index 3349b13fd1b..7bc67a6cb02 100644
--- a/libstdc++-v3/libsupc++/new
+++ b/libstdc++-v3/libsupc++/new
@@ -183,9 +183,9 @@ inline void operator delete[](void*, void*) _GLIBCXX_USE_NOEXCEPT { }
} // extern "C++"
#if __cplusplus >= 201703L
-#ifdef _GLIBCXX_HAVE_BUILTIN_LAUNDER
namespace std
{
+#ifdef _GLIBCXX_HAVE_BUILTIN_LAUNDER
#define __cpp_lib_launder 201606
/// Pointer optimization barrier [ptr.launder]
template<typename _Tp>
@@ -205,8 +205,14 @@ namespace std
void launder(const void*) = delete;
void launder(volatile void*) = delete;
void launder(const volatile void*) = delete;
-}
#endif // _GLIBCXX_HAVE_BUILTIN_LAUNDER
+
+#ifdef __GCC_DESTRUCTIVE_SIZE
+# define __cpp_lib_hardware_interference_size 201703L
+ inline constexpr size_t hardware_destructive_interference_size = __GCC_DESTRUCTIVE_SIZE;
+ inline constexpr size_t hardware_constructive_interference_size = __GCC_CONSTRUCTIVE_SIZE;
+#endif // __GCC_DESTRUCTIVE_SIZE
+}
#endif // C++17
#if __cplusplus > 201703L
</cut>