- linaro-toolchain - lists.linaro.org

Interest in reproducing gcc-linaro-4.9-2016.02 arm-linux-gnueabihf target under darwin and linux aarch64 host

by jhgorse＠gmail.com

Hello, I see this release gcc-linaro-4.9-2016.02 for 86_64_arm-linux-gnueabihf: https://releases.linaro.org/components/toolchain/binaries/4.9-2016.02/arm-l… and would like to reproduce the toolchain for aarch64 hosts. I see that it was built with ABE, though I have generally been unsuccessful in getting ABE to work on aarch64 for this. I was looking for some build or ci breadcrumbs or documentation. What can you recommend? The motivation here is to support legacy development/testing from modern aarch64 hardware. Cheers, Joe Gorse

4 years, 1 month

1
0
0 0

[Activity] Week #10

by Thiago Jung Bauermann

Hello, # GDB support for ARMv9 Scalable Matrix Extension (SME) - Synced with Luis Machado to learn what the current status is. Read discussions in the linux-arm-kernel mailing list which he pointed to. - Read Arm architecture documentation about Neon, SVE, SVE2 and SME to familiarise myself with these features. - Basic setup / onboarding - Joined some internal and external mailing lists, IRC and Slack channels. - Read some company policy documents. - Researched models and got a quote for a work laptop. - Set up aarch64 cross-compilation environment on my laptop. - Set up emulated aarch64 machine with Fedora on my laptop. - Attempted setting up emulated aarch64 machine with Ubuntu on my laptop, but ran into problems with the Ubuntu Server installer. -- Thiago

4 years, 1 month

1
0
0 0

[ACTIVITY] week ending Mar. 6 2022

by Alex Bennée

Project Stratos =============== - spent some time talking through design approaches for xen vhost-master with Viresh Linux RPMB Sub-system and virtio-driver ([STR-40]) - continued working on [Linux driver] - discovered a bug in vhost-user config handling in QEMU as well [STR-40] <https://linaro.atlassian.net/browse/STR-40> [Linux driver] <http://git.linaro.org/people/alex.bennee/linux.git/shortlog/refs/heads/rpmb…> QEMU Upstream Work ([UM-2]) =========================== - posted [PULL 00/18] testing and semihosting updates Message-Id: <20220301094715.550871-1-alex.bennee(a)linaro.org> Other ===== - started work on presentation for LTD Completed Reviews [5/5] ======================= [PATCH] gdbstub.c: add support for info proc mappings Message-Id: <20220221030910.3203063-1-dominik.b.czarnota(a)gmail.com> [PATCH] tests/Makefile.include: Let "make clean" remove the TCG tests, too Message-Id: <20220301085900.1443232-1-thuth(a)redhat.com> [PATCH 0/3] gdbstub: add support for switchable endianness Message-Id: <20210823142004.17935-1-changbin.du(a)gmail.com> [PATCH 0/6] More record/replay acceptance tests Message-Id: <162332427732.194926.7555369160312506539.stgit@pasha-ThinkPad-X280> [PATCH v6 00/43] CXl 2.0 emulation Support Message-Id: <20220211120747.3074-1-Jonathan.Cameron(a)huawei.com> Absences ======== Current Review Queue ==================== TODO [PATCH v4 00/18] target/arm: Implement LVA, LPA, LPA2 features Message-Id: <20220301215958.157011-1-richard.henderson(a)linaro.org> ===================================================================================================================================== TODO [RFC PATCH 00/27] Virtio sound card implementation Message-Id: <20210429120445.694420-1-chouhan.shreyansh2702(a)gmail.com> ============================================================================================================================ TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== -- Alex Bennée

4 years, 2 months

1
0
0 0

[ACTIVITY] report week ending 4 Mar

by Peter Maydell

Progress * UM-2 [QEMU upstream maintainership] + Looked at and sent patches to fix a minor decode error for Neon VLD1/VST1 that RTH found + softfreeze is next Tuesday -- sent out last big Arm pullreq before freeze, though there will probably need to be another smaller one + code review, respinning previously sent patches, looking at bug reports, all to get things in before freeze * QEMU-420 [GICv4 emulation] + All the GICv4.0 stuff is now code-complete, but testing and loose ends (like plumbing it into the virt board) will take a while still. -- PMM

4 years, 2 months

1
0
0 0

[weekly][linaro] report week ending 25 Feb

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] + Respins of a few patchsets that needed v2 + Looked at a few bugs since softfreeze for 7.0 is near + Amazingly my to-review queue is now almost empty * QEMU-420 [GICv4 emulation] + Implemented more of the redistributor code -- the last missing big piece is its handling of VMOVI, though there are also probably some loose ends to tidy up + Note that this isn't going to be in time for 7.0, so will likely go on the back-burner a bit in favour of release-critical items thanks -- PMM

4 years, 2 months

1
0
0 0

[ACTIVITY] week ending Feb. 27 2022

by Alex Bennée

Project Stratos =============== - spent more time troubleshooting Xen builds with Viresh Linux RPMB Sub-system and virtio-driver ([STR-40]) - continued working on [Linux driver] - discovered a bug in vhost-user config handling in QEMU as well [STR-40] <https://linaro.atlassian.net/browse/STR-40> [Linux driver] <http://git.linaro.org/people/alex.bennee/linux.git/shortlog/refs/heads/rpmb…> QEMU Upstream Work ([UM-2]) =========================== - follow-up on Analysis of slow distro boots in check-avocado (BootLinuxAarch64.test_virt_tcg*) Message-Id: <874k4xbqvp.fsf(a)linaro.org> - posted [PATCH v2 00/18] testing and semihosting pre-PR Message-Id: <20220225172021.3493923-1-alex.bennee(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> Current Review Queue ==================== TODO [RFC PATCH 00/27] Virtio sound card implementation Message-Id: <20210429120445.694420-1-chouhan.shreyansh2702(a)gmail.com> ============================================================================================================================ TODO [PATCH v6 00/43] CXl 2.0 emulation Support Message-Id: <20220211120747.3074-1-Jonathan.Cameron(a)huawei.com> ============================================================================================================== TODO [PATCH v2 00/15] target/arm: Implement LVA, LPA, LPA2 features Message-Id: <20220210040423.95120-1-richard.henderson(a)linaro.org> ==================================================================================================================================== -- Alex Bennée

4 years, 2 months

1
0
0 0

[ACTIVITY] week ending Feb. 20 2022

by Alex Bennée

Project Stratos =============== - spent more time troubleshooting Xen builds with Viresh Linux RPMB Sub-system and virtio-driver ([STR-40]) - started working on v2 of the Linux driver QEMU Upstream Work ([UM-2]) =========================== - posted Analysis of slow distro boots in check-avocado (BootLinuxAarch64.test_virt_tcg*) Message-Id: <874k4xbqvp.fsf(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> Current Review Queue ==================== TODO [RFC PATCH 00/27] Virtio sound card implementation Message-Id: <20210429120445.694420-1-chouhan.shreyansh2702(a)gmail.com> ============================================================================================================================ TODO [PATCH v6 00/43] CXl 2.0 emulation Support Message-Id: <20220211120747.3074-1-Jonathan.Cameron(a)huawei.com> ============================================================================================================== TODO [PATCH v2 00/15] target/arm: Implement LVA, LPA, LPA2 features Message-Id: <20220210040423.95120-1-richard.henderson(a)linaro.org> ==================================================================================================================================== -- Alex Bennée

4 years, 2 months

1
0
0 0

[ACTIVITY] report week ending 18 Feb

by Peter Maydell

Progress (a report covering two half-weeks) * UM-2 [QEMU upstream maintainership] - lots of code review - fixed another bug in the armv7m clock framework code - refactoring patchset to trim some fat from a header that gets included by every C file in the build * QEMU-420 [GICv4 emulation] - CPU interface parts of GICv4 work are code-complete - started on the redistributor work -- PMM

4 years, 2 months

1
0
0 0

[ACTIVITY] week ending Feb. 13 2022

by Alex Bennée

Project Stratos =============== - posted Metadata and signalling channels for Zephyr virtio-backends on Xen Message-Id: <87h79bgd1m.fsf(a)linaro.org> - spent some time troubleshooting Xen builds with Viresh vhost-device maintainer effort ([UM-196]) - posted [a pull request in rust-vmm/community] [a pull request in rust-vmm/community] <elfeed:github.com#tag:github.com,2008:PullRequestEvent/20180885703> QEMU Upstream Work ([UM-2]) =========================== - posted [RFC PATCH] tcg/optimize: only read val after const check Message-Id: <20220209112142.3367525-1-alex.bennee(a)linaro.org> - posted [PULL 00/28] testing and plugin updates Message-Id: <20220209141529.3418384-1-alex.bennee(a)linaro.org> - triage for [qemu-x86_64 uses host libraries instead of emulated system libraries] - triage for [linux-user: substantial memory leak when threads are created and destroyed] - posted [RFC PATCH] linux-user: trap internal SIGABRT's Message-Id: <20220209112207.3368139-1-alex.bennee(a)linaro.org> - posted [PATCH v5 0/2] semihosting/next (SYS_HEAPINFO) Message-Id: <20220210113021.3799514-2-alex.bennee(a)linaro.org> - posted [PATCH v1 00/11] testing/next (docker, lcitool, ci, tcg) Message-Id: <20220211160309.335014-1-alex.bennee(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> [qemu-x86_64 uses host libraries instead of emulated system libraries] <elfeed:gitlab.com#https://gitlab.com/qemu-project/qemu/-/issues/857> [linux-user: substantial memory leak when threads are created and destroyed] <elfeed:gitlab.com#https://gitlab.com/qemu-project/qemu/-/issues/866> Upstream MTTCG tests ([QEMU-52]) - still waiting final review of [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> Completed Reviews [0/0] ======================= Absences ======== Current Review Queue ==================== TODO [PATCH v6 00/43] CXl 2.0 emulation Support Message-Id: <20220211120747.3074-1-Jonathan.Cameron(a)huawei.com> ============================================================================================================== TODO [PATCH v2 00/15] target/arm: Implement LVA, LPA, LPA2 features Message-Id: <20220210040423.95120-1-richard.henderson(a)linaro.org> ==================================================================================================================================== TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== -- Alex Bennée

4 years, 3 months

1
0
0 0

Re: [TCWG CI] 401.bzip2 grew in size by 9% after llvm: [LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`

by Maxim Kuvyrkov

Hi Roman, Your below patch increased code-size of 401.bzip2 by 9% on 32-bit ARM when compiled with -Os. That’s quite a lot, would you please investigate whether this regression can be avoided? Please let me know if this doesn’t reproduce for you and I’ll try to help. Thank you, -- Maxim Kuvyrkov https://www.linaro.org > On 9 Feb 2022, at 17:10, ci_notify(a)linaro.org wrote: > > After llvm commit 77a0da926c9ea86afa9baf28158d79c7678fc6b9 > Author: Roman Lebedev <lebedev.ri(a)gmail.com> > > [LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()` > > the following benchmarks grew in size by more than 1%: > - 401.bzip2 grew in size by 9% from 37909 to 41405 bytes > - 401.bzip2:[.] BZ2_decompress grew in size by 42% from 7664 to 10864 bytes > - 429.mcf grew in size by 2% from 7732 to 7908 bytes > > Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. > > For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: > - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > > Configuration: > - Benchmark: SPEC CPU2006 > - Toolchain: Clang + Glibc + LLVM Linker > - Version: all components were built from their tip of trunk > - Target: arm-linux-gnueabihf > - Compiler flags: -Os -mthumb > - Hardware: APM Mustang 8x X-Gene1 > > This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. > > THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. > > This commit has regressed these CI configurations: > - tcwg_bmk_llvm_apm/llvm-master-aarch64-spec2k6-Os_LTO > - tcwg_bmk_llvm_apm/llvm-master-arm-spec2k6-Os > - tcwg_bmk_llvm_apm/llvm-master-arm-spec2k6-Os_LTO > > First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > > Reproduce builds: > <cut> > mkdir investigate-llvm-77a0da926c9ea86afa9baf28158d79c7678fc6b9 > cd investigate-llvm-77a0da926c9ea86afa9baf28158d79c7678fc6b9 > > # Fetch scripts > git clone https://git.linaro.org/toolchain/jenkins-scripts > > # Fetch manifests and test.sh script > mkdir -p artifacts/manifests > curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail > curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail > curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail > chmod +x artifacts/test.sh > > # Reproduce the baseline build (build all pre-requisites) > ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh > > # Save baseline build state (which is then restored in artifacts/test.sh) > mkdir -p ./bisect > rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ > > cd llvm > > # Reproduce first_bad build > git checkout --detach 77a0da926c9ea86afa9baf28158d79c7678fc6b9 > ../artifacts/test.sh > > # Reproduce last_good build > git checkout --detach f59787084e09aeb787cb3be3103b2419ccd14163 > ../artifacts/test.sh > > cd .. > </cut> > > Full commit (up to 1000 lines): > <cut> > commit 77a0da926c9ea86afa9baf28158d79c7678fc6b9 > Author: Roman Lebedev <lebedev.ri(a)gmail.com> > Date: Mon Feb 7 16:03:40 2022 +0300 > > [LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()` > > D43208 extracted `useEmulatedMaskMemRefHack()` from legality into cost model. > What it essentially does is prevents scalarized vectorization of masked memory operations: > ``` > // TODO: Cost model for emulated masked load/store is completely > // broken. This hack guides the cost model to use an artificially > // high enough value to practically disable vectorization with such > // operations, except where previously deployed legality hack allowed > // using very low cost values. This is to avoid regressions coming simply > // from moving "masked load/store" check from legality to cost model. > // Masked Load/Gather emulation was previously never allowed. > // Limited number of Masked Store/Scatter emulation was allowed. > ``` > > While i don't really understand about what specifically `is completely broken` > was talking about, i believe that at least on X86 with AVX2-or-later, > this is no longer true. (or at least, i would like to know what is still broken). > So i would like to follow suit after D111460, and like wise disable that hack for AVX2+. > > But since this was added for X86 specifically, let's just instead completely remove this hack. > > Reviewed By: RKSimon > > Differential Revision: https://reviews.llvm.org/D114779 > --- > llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | 34 +- > .../X86/masked-gather-i32-with-i8-index.ll | 40 +- > .../X86/masked-gather-i64-with-i8-index.ll | 40 +- > .../CostModel/X86/masked-interleaved-load-i16.ll | 36 +- > .../CostModel/X86/masked-interleaved-store-i16.ll | 24 +- > .../test/Analysis/CostModel/X86/masked-load-i16.ll | 46 +- > .../test/Analysis/CostModel/X86/masked-load-i32.ll | 16 +- > .../test/Analysis/CostModel/X86/masked-load-i64.ll | 16 +- > llvm/test/Analysis/CostModel/X86/masked-load-i8.ll | 46 +- > .../AArch64/tail-fold-uniform-memops.ll | 159 ++- > .../Transforms/LoopVectorize/X86/gather_scatter.ll | 1176 ++++++++++++++++---- > .../X86/x86-interleaved-accesses-masked-group.ll | 1041 ++++++++--------- > .../Transforms/LoopVectorize/if-pred-stores.ll | 6 +- > .../Transforms/LoopVectorize/memdep-fold-tail.ll | 6 +- > llvm/test/Transforms/LoopVectorize/optsize.ll | 837 +++++++++++--- > llvm/test/Transforms/LoopVectorize/tripcount.ll | 673 ++++++++++- > .../LoopVectorize/vplan-sink-scalars-and-merge.ll | 4 +- > 17 files changed, 3064 insertions(+), 1136 deletions(-) > > diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp > index bfe08d42c883..ccce2c2a7b15 100644 > --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp > +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp > @@ -307,11 +307,6 @@ static cl::opt<bool> InterleaveSmallLoopScalarReduction( > cl::desc("Enable interleaving for loops with small iteration counts that " > "contain scalar reductions to expose ILP.")); > > -/// The number of stores in a loop that are allowed to need predication. > -static cl::opt<unsigned> NumberOfStoresToPredicate( > - "vectorize-num-stores-pred", cl::init(1), cl::Hidden, > - cl::desc("Max number of stores to be predicated behind an if.")); > - > static cl::opt<bool> EnableIndVarRegisterHeur( > "enable-ind-var-reg-heur", cl::init(true), cl::Hidden, > cl::desc("Count the induction variable only once when interleaving")); > @@ -1797,10 +1792,6 @@ private: > /// as a vector operation. > bool isConsecutiveLoadOrStore(Instruction *I); > > - /// Returns true if an artificially high cost for emulated masked memrefs > - /// should be used. > - bool useEmulatedMaskMemRefHack(Instruction *I, ElementCount VF); > - > /// Map of scalar integer values to the smallest bitwidth they can be legally > /// represented as. The vector equivalents of these values should be truncated > /// to this type. > @@ -6437,22 +6428,6 @@ LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef<ElementCount> VFs) { > return RUs; > } > > -bool LoopVectorizationCostModel::useEmulatedMaskMemRefHack(Instruction *I, > - ElementCount VF) { > - // TODO: Cost model for emulated masked load/store is completely > - // broken. This hack guides the cost model to use an artificially > - // high enough value to practically disable vectorization with such > - // operations, except where previously deployed legality hack allowed > - // using very low cost values. This is to avoid regressions coming simply > - // from moving "masked load/store" check from legality to cost model. > - // Masked Load/Gather emulation was previously never allowed. > - // Limited number of Masked Store/Scatter emulation was allowed. > - assert(isPredicatedInst(I, VF) && "Expecting a scalar emulated instruction"); > - return isa<LoadInst>(I) || > - (isa<StoreInst>(I) && > - NumPredStores > NumberOfStoresToPredicate); > -} > - > void LoopVectorizationCostModel::collectInstsToScalarize(ElementCount VF) { > // If we aren't vectorizing the loop, or if we've already collected the > // instructions to scalarize, there's nothing to do. Collection may already > @@ -6478,9 +6453,7 @@ void LoopVectorizationCostModel::collectInstsToScalarize(ElementCount VF) { > ScalarCostsTy ScalarCosts; > // Do not apply discount if scalable, because that would lead to > // invalid scalarization costs. > - // Do not apply discount logic if hacked cost is needed > - // for emulated masked memrefs. > - if (!VF.isScalable() && !useEmulatedMaskMemRefHack(&I, VF) && > + if (!VF.isScalable() && > computePredInstDiscount(&I, ScalarCosts, VF) >= 0) > ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end()); > // Remember that BB will remain after vectorization. > @@ -6736,11 +6709,6 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I, > Vec_i1Ty, APInt::getAllOnes(VF.getKnownMinValue()), > /*Insert=*/false, /*Extract=*/true); > Cost += TTI.getCFInstrCost(Instruction::Br, TTI::TCK_RecipThroughput); > - > - if (useEmulatedMaskMemRefHack(I, VF)) > - // Artificially setting to a high enough value to practically disable > - // vectorization with such operations. > - Cost = 3000000; > } > > return Cost; > diff --git a/llvm/test/Analysis/CostModel/X86/masked-gather-i32-with-i8-index.ll b/llvm/test/Analysis/CostModel/X86/masked-gather-i32-with-i8-index.ll > index 62412a5d1af0..c52755b7d65c 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-gather-i32-with-i8-index.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-gather-i32-with-i8-index.ll > @@ -17,30 +17,30 @@ target triple = "x86_64-unknown-linux-gnu" > ; CHECK: LV: Checking a loop in "test" > ; > ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 11 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 22 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 11 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 22 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX1: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX1: LV: Found an estimated cost of 9 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX1: LV: Found an estimated cost of 18 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX1: LV: Found an estimated cost of 36 for VF 32 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; AVX2-SLOWGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 9 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 18 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 36 for VF 32 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; AVX2-FASTGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; AVX2-FASTGATHER: LV: Found an estimated cost of 4 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > @@ -50,8 +50,8 @@ target triple = "x86_64-unknown-linux-gnu" > ; AVX2-FASTGATHER: LV: Found an estimated cost of 48 for VF 32 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX512: LV: Found an estimated cost of 10 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX512: LV: Found an estimated cost of 22 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX512: LV: Found an estimated cost of 5 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX512: LV: Found an estimated cost of 11 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; AVX512: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; AVX512: LV: Found an estimated cost of 18 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; AVX512: LV: Found an estimated cost of 36 for VF 32 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > diff --git a/llvm/test/Analysis/CostModel/X86/masked-gather-i64-with-i8-index.ll b/llvm/test/Analysis/CostModel/X86/masked-gather-i64-with-i8-index.ll > index b8eba8b0327b..b38026c824b5 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-gather-i64-with-i8-index.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-gather-i64-with-i8-index.ll > @@ -17,30 +17,30 @@ target triple = "x86_64-unknown-linux-gnu" > ; CHECK: LV: Checking a loop in "test" > ; > ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX1: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX1: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX1: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX1: LV: Found an estimated cost of 40 for VF 32 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; AVX2-SLOWGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 40 for VF 32 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; AVX2-FASTGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; AVX2-FASTGATHER: LV: Found an estimated cost of 4 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > @@ -50,8 +50,8 @@ target triple = "x86_64-unknown-linux-gnu" > ; AVX2-FASTGATHER: LV: Found an estimated cost of 48 for VF 32 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX512: LV: Found an estimated cost of 10 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX512: LV: Found an estimated cost of 24 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX512: LV: Found an estimated cost of 5 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX512: LV: Found an estimated cost of 12 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; AVX512: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; AVX512: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; AVX512: LV: Found an estimated cost of 40 for VF 32 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > diff --git a/llvm/test/Analysis/CostModel/X86/masked-interleaved-load-i16.ll b/llvm/test/Analysis/CostModel/X86/masked-interleaved-load-i16.ll > index d6bfdf9d3848..184e23a0128b 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-interleaved-load-i16.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-interleaved-load-i16.ll > @@ -89,30 +89,30 @@ for.end: > ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 17 for VF 16 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 17 for VF 16 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > > ; ENABLED_MASKED_STRIDED: LV: Checking a loop in "test2" > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 2 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 11 for VF 4 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 11 for VF 8 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 17 for VF 16 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > @@ -164,17 +164,17 @@ for.end: > ; DISABLED_MASKED_STRIDED: LV: Checking a loop in "test" > ; > ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 17 for VF 16 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > > ; ENABLED_MASKED_STRIDED: LV: Checking a loop in "test" > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 7 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 9 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 9 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 14 for VF 16 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > > define void @test(i16* noalias nocapture %points, i16* noalias nocapture readonly %x, i16* noalias nocapture readnone %y) { > diff --git a/llvm/test/Analysis/CostModel/X86/masked-interleaved-store-i16.ll b/llvm/test/Analysis/CostModel/X86/masked-interleaved-store-i16.ll > index 5f67026737fc..224dd75a4dc5 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-interleaved-store-i16.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-interleaved-store-i16.ll > @@ -89,17 +89,17 @@ for.end: > ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %0, i16* %arrayidx2, align 2 > ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 5 for VF 2 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 2 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 %0, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 11 for VF 4 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 4 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: store i16 %0, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 23 for VF 8 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 8 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: store i16 %0, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 50 for VF 16 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 20 for VF 16 For instruction: store i16 %0, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 20 for VF 16 For instruction: store i16 %2, i16* %arrayidx7, align 2 > > ; ENABLED_MASKED_STRIDED: LV: Checking a loop in "test2" > ; > @@ -107,16 +107,16 @@ for.end: > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 2 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 10 for VF 2 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 4 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 14 for VF 4 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 8 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 14 for VF 8 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 16 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 27 for VF 16 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 20 for VF 16 For instruction: store i16 %2, i16* %arrayidx7, align 2 > > define void @test2(i16* noalias nocapture %points, i32 %numPoints, i16* noalias nocapture readonly %x, i16* noalias nocapture readonly %y) { > entry: > diff --git a/llvm/test/Analysis/CostModel/X86/masked-load-i16.ll b/llvm/test/Analysis/CostModel/X86/masked-load-i16.ll > index c8c3078f1625..2722a52c3d96 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-load-i16.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-load-i16.ll > @@ -16,37 +16,37 @@ target triple = "x86_64-unknown-linux-gnu" > ; CHECK: LV: Checking a loop in "test" > ; > ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE2: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE2: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE2: LV: Found an estimated cost of 16 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > ; > ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE42: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE42: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE42: LV: Found an estimated cost of 16 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > ; > ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX1: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX1: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX1: LV: Found an estimated cost of 17 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX1: LV: Found an estimated cost of 34 for VF 32 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > ; > ; AVX2-SLOWGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 17 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 34 for VF 32 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > ; > ; AVX2-FASTGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 17 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 34 for VF 32 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > ; > ; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > ; AVX512: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > diff --git a/llvm/test/Analysis/CostModel/X86/masked-load-i32.ll b/llvm/test/Analysis/CostModel/X86/masked-load-i32.ll > index f74c9f044d0b..16c00cfc03b5 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-load-i32.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-load-i32.ll > @@ -16,16 +16,16 @@ target triple = "x86_64-unknown-linux-gnu" > ; CHECK: LV: Checking a loop in "test" > ; > ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 11 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 22 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 11 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 22 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; AVX1: LV: Found an estimated cost of 3 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > diff --git a/llvm/test/Analysis/CostModel/X86/masked-load-i64.ll b/llvm/test/Analysis/CostModel/X86/masked-load-i64.ll > index c5a7825348e9..1baeff242304 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-load-i64.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-load-i64.ll > @@ -16,16 +16,16 @@ target triple = "x86_64-unknown-linux-gnu" > ; CHECK: LV: Checking a loop in "test" > ; > ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > diff --git a/llvm/test/Analysis/CostModel/X86/masked-load-i8.ll b/llvm/test/Analysis/CostModel/X86/masked-load-i8.ll > index fc540da58700..99d0f28a03f8 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-load-i8.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-load-i8.ll > @@ -16,37 +16,37 @@ target triple = "x86_64-unknown-linux-gnu" > ; CHECK: LV: Checking a loop in "test" > ; > ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE2: LV: Found an estimated cost of 11 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE2: LV: Found an estimated cost of 23 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > ; > ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE42: LV: Found an estimated cost of 11 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE42: LV: Found an estimated cost of 23 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > ; > ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX1: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX1: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX1: LV: Found an estimated cost of 16 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX1: LV: Found an estimated cost of 33 for VF 32 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > ; > ; AVX2-SLOWGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 16 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 33 for VF 32 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > ; > ; AVX2-FASTGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 16 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 33 for VF 32 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > ; > ; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > ; AVX512: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll b/llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll > index bf0aba1931d1..8ce310962b48 100644 > --- a/llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll > +++ b/llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll > @@ -1,3 +1,4 @@ > +; NOTE: Assertions have been autogenerated by utils/update_test_checks.py > ; RUN: opt -loop-vectorize -scalable-vectorization=off -force-vector-width=4 -prefer-predicate-over-epilogue=predicate-dont-vectorize -S < %s | FileCheck %s > > ; NOTE: These tests aren't really target-specific, but it's convenient to target AArch64 > @@ -9,21 +10,43 @@ target triple = "aarch64-linux-gnu" > ; we don't artificially create new predicated blocks for the load. > define void @uniform_load(i32* noalias %dst, i32* noalias readonly %src, i64 %n) #0 { > ; CHECK-LABEL: @uniform_load( > +; CHECK-NEXT: entry: > +; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]] > +; CHECK: vector.ph: > +; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N:%.*]], 3 > +; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], 4 > +; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]] > +; CHECK-NEXT: br label [[VECTOR_BODY:%.*]] > ; CHECK: vector.body: > -; CHECK-NEXT: [[IDX:%.*]] = phi i64 [ 0, %vector.ph ], [ [[IDX_NEXT:%.*]], %vector.body ] > -; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[IDX]], 0 > -; CHECK-NEXT: [[LOOP_PRED:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP3]], i64 %n) > -; CHECK-NEXT: [[LOAD_VAL:%.*]] = load i32, i32* %src, align 4 > -; CHECK-NOT: load i32, i32* %src, align 4 > -; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> poison, i32 [[LOAD_VAL]], i32 0 > -; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> poison, <4 x i32> zeroinitializer > -; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i32, i32* %dst, i64 [[TMP3]] > -; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i32, i32* [[TMP6]], i32 0 > -; CHECK-NEXT: [[STORE_PTR:%.*]] = bitcast i32* [[TMP7]] to <4 x i32>* > -; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[TMP5]], <4 x i32>* [[STORE_PTR]], i32 4, <4 x i1> [[LOOP_PRED]]) > -; CHECK-NEXT: [[IDX_NEXT]] = add i64 [[IDX]], 4 > -; CHECK-NEXT: [[CMP:%.*]] = icmp eq i64 [[IDX_NEXT]], %n.vec > -; CHECK-NEXT: br i1 [[CMP]], label %middle.block, label %vector.body > +; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ] > +; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0 > +; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP0]], i64 [[N]]) > +; CHECK-NEXT: [[TMP1:%.*]] = load i32, i32* [[SRC:%.*]], align 4 > +; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[TMP1]], i32 0 > +; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer > +; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, i32* [[DST:%.*]], i64 [[TMP0]] > +; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i32, i32* [[TMP2]], i32 0 > +; CHECK-NEXT: [[TMP4:%.*]] = bitcast i32* [[TMP3]] to <4 x i32>* > +; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[BROADCAST_SPLAT]], <4 x i32>* [[TMP4]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]]) > +; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4 > +; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]] > +; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]] > +; CHECK: middle.block: > +; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]] > +; CHECK: scalar.ph: > +; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ] > +; CHECK-NEXT: br label [[FOR_BODY:%.*]] > +; CHECK: for.body: > +; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ] > +; CHECK-NEXT: [[VAL:%.*]] = load i32, i32* [[SRC]], align 4 > +; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[DST]], i64 [[INDVARS_IV]] > +; CHECK-NEXT: store i32 [[VAL]], i32* [[ARRAYIDX]], align 4 > +; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1 > +; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[N]] > +; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]] > +; CHECK: for.end: > +; CHECK-NEXT: ret void > +; > > entry: > br label %for.body > @@ -47,18 +70,108 @@ for.end: ; preds = %for.body, %entry > ; and the original condition. > define void @cond_uniform_load(i32* nocapture %dst, i32* nocapture readonly %src, i32* nocapture readonly %cond, i64 %n) #0 { > ; CHECK-LABEL: @cond_uniform_load( > +; CHECK-NEXT: entry: > +; CHECK-NEXT: [[DST1:%.*]] = bitcast i32* [[DST:%.*]] to i8* > +; CHECK-NEXT: [[COND3:%.*]] = bitcast i32* [[COND:%.*]] to i8* > +; CHECK-NEXT: [[SRC6:%.*]] = bitcast i32* [[SRC:%.*]] to i8* > +; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_MEMCHECK:%.*]] > +; CHECK: vector.memcheck: > +; CHECK-NEXT: [[SCEVGEP:%.*]] = getelementptr i32, i32* [[DST]], i64 [[N:%.*]] > +; CHECK-NEXT: [[SCEVGEP2:%.*]] = bitcast i32* [[SCEVGEP]] to i8* > +; CHECK-NEXT: [[SCEVGEP4:%.*]] = getelementptr i32, i32* [[COND]], i64 [[N]] > +; CHECK-NEXT: [[SCEVGEP45:%.*]] = bitcast i32* [[SCEVGEP4]] to i8* > +; CHECK-NEXT: [[SCEVGEP7:%.*]] = getelementptr i32, i32* [[SRC]], i64 1 > +; CHECK-NEXT: [[SCEVGEP78:%.*]] = bitcast i32* [[SCEVGEP7]] to i8* > +; CHECK-NEXT: [[BOUND0:%.*]] = icmp ult i8* [[DST1]], [[SCEVGEP45]] > +; CHECK-NEXT: [[BOUND1:%.*]] = icmp ult i8* [[COND3]], [[SCEVGEP2]] > +; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]] > +; CHECK-NEXT: [[BOUND09:%.*]] = icmp ult i8* [[DST1]], [[SCEVGEP78]] > +; CHECK-NEXT: [[BOUND110:%.*]] = icmp ult i8* [[SRC6]], [[SCEVGEP2]] > +; CHECK-NEXT: [[FOUND_CONFLICT11:%.*]] = and i1 [[BOUND09]], [[BOUND110]] > +; CHECK-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT11]] > +; CHECK-NEXT: br i1 [[CONFLICT_RDX]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]] > ; CHECK: vector.ph: > -; CHECK: [[TMP1:%.*]] = insertelement <4 x i32*> poison, i32* %src, i32 0 > -; CHECK-NEXT: [[SRC_SPLAT:%.*]] = shufflevector <4 x i32*> [[TMP1]], <4 x i32*> poison, <4 x i32> zeroinitializer > +; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], 3 > +; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], 4 > +; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]] > +; CHECK-NEXT: br label [[VECTOR_BODY:%.*]] > ; CHECK: vector.body: > -; CHECK-NEXT: [[IDX:%.*]] = phi i64 [ 0, %vector.ph ], [ [[IDX_NEXT:%.*]], %vector.body ] > -; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[IDX]], 0 > -; CHECK-NEXT: [[LOOP_PRED:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP3]], i64 %n) > -; CHECK: [[COND_LOAD:%.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{%.*}}, i32 4, <4 x i1> [[LOOP_PRED]], <4 x i32> poison) > -; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i32> [[COND_LOAD]], zeroinitializer > +; CHECK-NEXT: [[INDEX12:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT19:%.*]], [[PRED_LOAD_CONTINUE18:%.*]] ] > +; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX12]], 0 > +; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP0]], i64 [[N]]) > +; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i32, i32* [[COND]], i64 [[TMP0]] > +; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, i32* [[TMP1]], i32 0 > +; CHECK-NEXT: [[TMP3:%.*]] = bitcast i32* [[TMP2]] to <4 x i32>* > +; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* [[TMP3]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> poison), !alias.scope !4 > +; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i32> [[WIDE_MASKED_LOAD]], zeroinitializer > ; CHECK-NEXT: [[TMP5:%.*]] = xor <4 x i1> [[TMP4]], <i1 true, i1 true, i1 true, i1 true> > -; CHECK-NEXT: [[MASK:%.*]] = select <4 x i1> [[LOOP_PRED]], <4 x i1> [[TMP5]], <4 x i1> zeroinitializer > -; CHECK-NEXT: call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32*> [[SRC_SPLAT]], i32 4, <4 x i1> [[MASK]], <4 x i32> undef) > +; CHECK-NEXT: [[TMP6:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i1> [[TMP5]], <4 x i1> zeroinitializer > +; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x i1> [[TMP6]], i32 0 > +; CHECK-NEXT: br i1 [[TMP7]], label [[PRED_LOAD_IF:%.*]], label [[PRED_LOAD_CONTINUE:%.*]] > +; CHECK: pred.load.if: > +; CHECK-NEXT: [[TMP8:%.*]] = load i32, i32* [[SRC]], align 4, !alias.scope !7 > +; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> poison, i32 [[TMP8]], i32 0 > +; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE]] > +; CHECK: pred.load.continue: > +; CHECK-NEXT: [[TMP10:%.*]] = phi <4 x i32> [ poison, [[VECTOR_BODY]] ], [ [[TMP9]], [[PRED_LOAD_IF]] ] > +; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x i1> [[TMP6]], i32 1 > +; CHECK-NEXT: br i1 [[TMP11]], label [[PRED_LOAD_IF13:%.*]], label [[PRED_LOAD_CONTINUE14:%.*]] > +; CHECK: pred.load.if13: > +; CHECK-NEXT: [[TMP12:%.*]] = load i32, i32* [[SRC]], align 4, !alias.scope !7 > +; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[TMP12]], i32 1 > +; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE14]] > +; CHECK: pred.load.continue14: > +; CHECK-NEXT: [[TMP14:%.*]] = phi <4 x i32> [ [[TMP10]], [[PRED_LOAD_CONTINUE]] ], [ [[TMP13]], [[PRED_LOAD_IF13]] ] > +; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x i1> [[TMP6]], i32 2 > +; CHECK-NEXT: br i1 [[TMP15]], label [[PRED_LOAD_IF15:%.*]], label [[PRED_LOAD_CONTINUE16:%.*]] > +; CHECK: pred.load.if15: > +; CHECK-NEXT: [[TMP16:%.*]] = load i32, i32* [[SRC]], align 4, !alias.scope !7 > +; CHECK-NEXT: [[TMP17:%.*]] = insertelement <4 x i32> [[TMP14]], i32 [[TMP16]], i32 2 > +; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE16]] > +; CHECK: pred.load.continue16: > +; CHECK-NEXT: [[TMP18:%.*]] = phi <4 x i32> [ [[TMP14]], [[PRED_LOAD_CONTINUE14]] ], [ [[TMP17]], [[PRED_LOAD_IF15]] ] > +; CHECK-NEXT: [[TMP19:%.*]] = extractelement <4 x i1> [[TMP6]], i32 3 > +; CHECK-NEXT: br i1 [[TMP19]], label [[PRED_LOAD_IF17:%.*]], label [[PRED_LOAD_CONTINUE18]] > +; CHECK: pred.load.if17: > +; CHECK-NEXT: [[TMP20:%.*]] = load i32, i32* [[SRC]], align 4, !alias.scope !7 > +; CHECK-NEXT: [[TMP21:%.*]] = insertelement <4 x i32> [[TMP18]], i32 [[TMP20]], i32 3 > +; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE18]] > +; CHECK: pred.load.continue18: > +; CHECK-NEXT: [[TMP22:%.*]] = phi <4 x i32> [ [[TMP18]], [[PRED_LOAD_CONTINUE16]] ], [ [[TMP21]], [[PRED_LOAD_IF17]] ] > +; CHECK-NEXT: [[TMP23:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i1> [[TMP4]], <4 x i1> zeroinitializer > +; CHECK-NEXT: [[PREDPHI:%.*]] = select <4 x i1> [[TMP23]], <4 x i32> zeroinitializer, <4 x i32> [[TMP22]] > +; CHECK-NEXT: [[TMP24:%.*]] = getelementptr inbounds i32, i32* [[DST]], i64 [[TMP0]] > +; CHECK-NEXT: [[TMP25:%.*]] = or <4 x i1> [[TMP6]], [[TMP23]] > +; CHECK-NEXT: [[TMP26:%.*]] = getelementptr inbounds i32, i32* [[TMP24]], i32 0 > +; CHECK-NEXT: [[TMP27:%.*]] = bitcast i32* [[TMP26]] to <4 x i32>* > +; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[PREDPHI]], <4 x i32>* [[TMP27]], i32 4, <4 x i1> [[TMP25]]), !alias.scope !9, !noalias !11 > +; CHECK-NEXT: [[INDEX_NEXT19]] = add i64 [[INDEX12]], 4 > +; CHECK-NEXT: [[TMP28:%.*]] = icmp eq i64 [[INDEX_NEXT19]], [[N_VEC]] > +; CHECK-NEXT: br i1 [[TMP28]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]] > +; CHECK: middle.block: > +; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]] > +; CHECK: scalar.ph: > +; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ], [ 0, [[VECTOR_MEMCHECK]] ] > +; CHECK-NEXT: br label [[FOR_BODY:%.*]] > +; CHECK: for.body: > +; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ [[INDEX_NEXT:%.*]], [[IF_END:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ] > +; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[COND]], i64 [[INDEX]] > +; CHECK-NEXT: [[TMP29:%.*]] = load i32, i32* [[ARRAYIDX]], align 4 > +; CHECK-NEXT: [[TOBOOL_NOT:%.*]] = icmp eq i32 [[TMP29]], 0 > +; CHECK-NEXT: br i1 [[TOBOOL_NOT]], label [[IF_END]], label [[IF_THEN:%.*]] > +; CHECK: if.then: > +; CHECK-NEXT: [[TMP30:%.*]] = load i32, i32* [[SRC]], align 4 > +; CHECK-NEXT: br label [[IF_END]] > +; CHECK: if.end: > +; CHECK-NEXT: [[VAL_0:%.*]] = phi i32 [ [[TMP30]], [[IF_THEN]] ], [ 0, [[FOR_BODY]] ] > +; CHECK-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds i32, i32* [[DST]], i64 [[INDEX]] > +; CHECK-NEXT: store i32 [[VAL_0]], i32* [[ARRAYIDX1]], align 4 > +; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 1 > +; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N]] > +; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP13:![0-9]+]] > +; CHECK: for.end: > +; CHECK-NEXT: ret void > +; > entry: > br label %for.body > > diff --git a/llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll b/llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll > index def98e03030f..d13942e85466 100644 > --- a/llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll > +++ b/llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll > @@ -25,22 +25,22 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; AVX512-NEXT: iter.check: > ; AVX512-NEXT: br label [[VECTOR_BODY:%.*]] > ; AVX512: vector.body: > -; AVX512-NEXT: [[INDEX8:%.*]] = phi i64 [ 0, [[ITER_CHECK:%.*]] ], [ [[INDEX_NEXT_3:%.*]], [[VECTOR_BODY]] ] > -; AVX512-NEXT: [[TMP0:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER:%.*]], i64 [[INDEX8]] > +; AVX512-NEXT: [[INDEX7:%.*]] = phi i64 [ 0, [[ITER_CHECK:%.*]] ], [ [[INDEX_NEXT_3:%.*]], [[VECTOR_BODY]] ] > +; AVX512-NEXT: [[TMP0:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER:%.*]], i64 [[INDEX7]] > ; AVX512-NEXT: [[TMP1:%.*]] = bitcast i32* [[TMP0]] to <16 x i32>* > ; AVX512-NEXT: [[WIDE_LOAD:%.*]] = load <16 x i32>, <16 x i32>* [[TMP1]], align 4 > ; AVX512-NEXT: [[TMP2:%.*]] = icmp sgt <16 x i32> [[WIDE_LOAD]], zeroinitializer > -; AVX512-NEXT: [[TMP3:%.*]] = getelementptr i32, i32* [[INDEX:%.*]], i64 [[INDEX8]] > +; AVX512-NEXT: [[TMP3:%.*]] = getelementptr i32, i32* [[INDEX:%.*]], i64 [[INDEX7]] > ; AVX512-NEXT: [[TMP4:%.*]] = bitcast i32* [[TMP3]] to <16 x i32>* > ; AVX512-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* [[TMP4]], i32 4, <16 x i1> [[TMP2]], <16 x i32> poison) > ; AVX512-NEXT: [[TMP5:%.*]] = sext <16 x i32> [[WIDE_MASKED_LOAD]] to <16 x i64> > ; AVX512-NEXT: [[TMP6:%.*]] = getelementptr inbounds float, float* [[IN:%.*]], <16 x i64> [[TMP5]] > ; AVX512-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <16 x float> @llvm.masked.gather.v16f32.v16p0f32(<16 x float*> [[TMP6]], i32 4, <16 x i1> [[TMP2]], <16 x float> undef) > ; AVX512-NEXT: [[TMP7:%.*]] = fadd <16 x float> [[WIDE_MASKED_GATHER]], <float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01> > -; AVX512-NEXT: [[TMP8:%.*]] = getelementptr float, float* [[OUT:%.*]], i64 [[INDEX8]] > +; AVX512-NEXT: [[TMP8:%.*]] = getelementptr float, float* [[OUT:%.*]], i64 [[INDEX7]] > ; AVX512-NEXT: [[TMP9:%.*]] = bitcast float* [[TMP8]] to <16 x float>* > ; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP7]], <16 x float>* [[TMP9]], i32 4, <16 x i1> [[TMP2]]) > -; AVX512-NEXT: [[INDEX_NEXT:%.*]] = or i64 [[INDEX8]], 16 > +; AVX512-NEXT: [[INDEX_NEXT:%.*]] = or i64 [[INDEX7]], 16 > ; AVX512-NEXT: [[TMP10:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[INDEX_NEXT]] > ; AVX512-NEXT: [[TMP11:%.*]] = bitcast i32* [[TMP10]] to <16 x i32>* > ; AVX512-NEXT: [[WIDE_LOAD_1:%.*]] = load <16 x i32>, <16 x i32>* [[TMP11]], align 4 > @@ -55,7 +55,7 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; AVX512-NEXT: [[TMP18:%.*]] = getelementptr float, float* [[OUT]], i64 [[INDEX_NEXT]] > ; AVX512-NEXT: [[TMP19:%.*]] = bitcast float* [[TMP18]] to <16 x float>* > ; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP17]], <16 x float>* [[TMP19]], i32 4, <16 x i1> [[TMP12]]) > -; AVX512-NEXT: [[INDEX_NEXT_1:%.*]] = or i64 [[INDEX8]], 32 > +; AVX512-NEXT: [[INDEX_NEXT_1:%.*]] = or i64 [[INDEX7]], 32 > ; AVX512-NEXT: [[TMP20:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[INDEX_NEXT_1]] > ; AVX512-NEXT: [[TMP21:%.*]] = bitcast i32* [[TMP20]] to <16 x i32>* > ; AVX512-NEXT: [[WIDE_LOAD_2:%.*]] = load <16 x i32>, <16 x i32>* [[TMP21]], align 4 > @@ -70,7 +70,7 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; AVX512-NEXT: [[TMP28:%.*]] = getelementptr float, float* [[OUT]], i64 [[INDEX_NEXT_1]] > ; AVX512-NEXT: [[TMP29:%.*]] = bitcast float* [[TMP28]] to <16 x float>* > ; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP27]], <16 x float>* [[TMP29]], i32 4, <16 x i1> [[TMP22]]) > -; AVX512-NEXT: [[INDEX_NEXT_2:%.*]] = or i64 [[INDEX8]], 48 > +; AVX512-NEXT: [[INDEX_NEXT_2:%.*]] = or i64 [[INDEX7]], 48 > ; AVX512-NEXT: [[TMP30:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[INDEX_NEXT_2]] > ; AVX512-NEXT: [[TMP31:%.*]] = bitcast i32* [[TMP30]] to <16 x i32>* > ; AVX512-NEXT: [[WIDE_LOAD_3:%.*]] = load <16 x i32>, <16 x i32>* [[TMP31]], align 4 > @@ -85,7 +85,7 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; AVX512-NEXT: [[TMP38:%.*]] = getelementptr float, float* [[OUT]], i64 [[INDEX_NEXT_2]] > ; AVX512-NEXT: [[TMP39:%.*]] = bitcast float* [[TMP38]] to <16 x float>* > ; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP37]], <16 x float>* [[TMP39]], i32 4, <16 x i1> [[TMP32]]) > -; AVX512-NEXT: [[INDEX_NEXT_3]] = add nuw nsw i64 [[INDEX8]], 64 > +; AVX512-NEXT: [[INDEX_NEXT_3]] = add nuw nsw i64 [[INDEX7]], 64 > ; AVX512-NEXT: [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT_3]], 4096 > ; AVX512-NEXT: br i1 [[TMP40]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]] > ; AVX512: for.end: > @@ -95,8 +95,8 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; FVW2-NEXT: entry: > ; FVW2-NEXT: br label [[VECTOR_BODY:%.*]] > ; FVW2: vector.body: > -; FVW2-NEXT: [[INDEX17:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ] > -; FVW2-NEXT: [[TMP0:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER:%.*]], i64 [[INDEX17]] > +; FVW2-NEXT: [[INDEX7:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[PRED_LOAD_CONTINUE27:%.*]] ] > +; FVW2-NEXT: [[TMP0:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER:%.*]], i64 [[INDEX7]] > ; FVW2-NEXT: [[TMP1:%.*]] = bitcast i32* [[TMP0]] to <2 x i32>* > ; FVW2-NEXT: [[WIDE_LOAD:%.*]] = load <2 x i32>, <2 x i32>* [[TMP1]], align 4 > ; FVW2-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, i32* [[TMP0]], i64 2 > @@ -112,7 +112,7 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; FVW2-NEXT: [[TMP9:%.*]] = icmp sgt <2 x i32> [[WIDE_LOAD8]], zeroinitializer > ; FVW2-NEXT: [[TMP10:%.*]] = icmp sgt <2 x i32> [[WIDE_LOAD9]], zeroinitializer > ; FVW2-NEXT: [[TMP11:%.*]] = icmp sgt <2 x i32> [[WIDE_LOAD10]], zeroinitializer > -; FVW2-NEXT: [[TMP12:%.*]] = getelementptr i32, i32* [[INDEX:%.*]], i64 [[INDEX17]] > +; FVW2-NEXT: [[TMP12:%.*]] = getelementptr i32, i32* [[INDEX:%.*]], i64 [[INDEX7]] > ; FVW2-NEXT: [[TMP13:%.*]] = bitcast i32* [[TMP12]] to <2 x i32>* > ; FVW2-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* [[TMP13]], i32 4, <2 x i1> [[TMP8]], <2 x i32> poison) > ; FVW2-NEXT: [[TMP14:%.*]] = getelementptr i32, i32* [[TMP12]], i64 2 > @@ -128,33 +128,105 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; FVW2-NEXT: [[TMP21:%.*]] = sext <2 x i32> [[WIDE_MASKED_LOAD11]] to <2 x i64> > ; FVW2-NEXT: [[TMP22:%.*]] = sext <2 x i32> [[WIDE_MASKED_LOAD12]] to <2 x i64> > ; FVW2-NEXT: [[TMP23:%.*]] = sext <2 x i32> [[WIDE_MASKED_LOAD13]] to <2 x i64> > -; FVW2-NEXT: [[TMP24:%.*]] = getelementptr inbounds float, float* [[IN:%.*]], <2 x i64> [[TMP20]] > -; FVW2-NEXT: [[TMP25:%.*]] = getelementptr inbounds float, float* [[IN]], <2 x i64> [[TMP21]] > -; FVW2-NEXT: [[TMP26:%.*]] = getelementptr inbounds float, float* [[IN]], <2 x i64> [[TMP22]] > -; FVW2-NEXT: [[TMP27:%.*]] = getelementptr inbounds float, float* [[IN]], <2 x i64> [[TMP23]] > -; FVW2-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP24]], i32 4, <2 x i1> [[TMP8]], <2 x float> undef) > -; FVW2-NEXT: [[WIDE_MASKED_GATHER14:%.*]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP25]], i32 4, <2 x i1> [[TMP9]], <2 x float> undef) > -; FVW2-NEXT: [[WIDE_MASKED_GATHER15:%.*]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP26]], i32 4, <2 x i1> [[TMP10]], <2 x float> undef) > -; FVW2-NEXT: [[WIDE_MASKED_GATHER16:%.*]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP27]], i32 4, <2 x i1> [[TMP11]], <2 x float> undef) > -; FVW2-NEXT: [[TMP28:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER]], <float 5.000000e-01, float 5.000000e-01> > -; FVW2-NEXT: [[TMP29:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER14]], <float 5.000000e-01, float 5.000000e-01> > -; FVW2-NEXT: [[TMP30:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER15]], <float 5.000000e-01, float 5.000000e-01> > -; FVW2-NEXT: [[TMP31:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER16]], <float 5.000000e-01, float 5.000000e-01> > -; FVW2-NEXT: [[TMP32:%.*]] = getelementptr float, float* [[OUT:%.*]], i64 [[INDEX17]] > -; FVW2-NEXT: [[TMP33:%.*]] = bitcast float* [[TMP32]] to <2 x float>* > -; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP28]], <2 x float>* [[TMP33]], i32 4, <2 x i1> [[TMP8]]) > -; FVW2-NEXT: [[TMP34:%.*]] = getelementptr float, float* [[TMP32]], i64 2 > -; FVW2-NEXT: [[TMP35:%.*]] = bitcast float* [[TMP34]] to <2 x float>* > -; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP29]], <2 x float>* [[TMP35]], i32 4, <2 x i1> [[TMP9]]) > -; FVW2-NEXT: [[TMP36:%.*]] = getelementptr float, float* [[TMP32]], i64 4 > -; FVW2-NEXT: [[TMP37:%.*]] = bitcast float* [[TMP36]] to <2 x float>* > -; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP30]], <2 x float>* [[TMP37]], i32 4, <2 x i1> [[TMP10]]) > -; FVW2-NEXT: [[TMP38:%.*]] = getelementptr float, float* [[TMP32]], i64 6 > -; FVW2-NEXT: [[TMP39:%.*]] = bitcast float* [[TMP38]] to <2 x float>* > -; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP31]], <2 x float>* [[TMP39]], i32 4, <2 x i1> [[TMP11]]) > -; FVW2-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX17]], 8 > -; FVW2-NEXT: [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096 > -; FVW2-NEXT: br i1 [[TMP40]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]] > +; FVW2-NEXT: [[TMP24:%.*]] = extractelement <2 x i1> [[TMP8]], i64 0 > +; FVW2-NEXT: br i1 [[TMP24]], label [[PRED_LOAD_IF:%.*]], label [[PRED_LOAD_CONTINUE:%.*]] > +; FVW2: pred.load.if: > +; FVW2-NEXT: [[TMP25:%.*]] = extractelement <2 x i64> [[TMP20]], i64 0 > +; FVW2-NEXT: [[TMP26:%.*]] = getelementptr inbounds float, float* [[IN:%.*]], i64 [[TMP25]] > +; FVW2-NEXT: [[TMP27:%.*]] = load float, float* [[TMP26]], align 4 > +; FVW2-NEXT: [[TMP28:%.*]] = insertelement <2 x float> poison, float [[TMP27]], i64 0 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE]] > +; FVW2: pred.load.continue: > +; FVW2-NEXT: [[TMP29:%.*]] = phi <2 x float> [ poison, [[VECTOR_BODY]] ], [ [[TMP28]], [[PRED_LOAD_IF]] ] > +; FVW2-NEXT: [[TMP30:%.*]] = extractelement <2 x i1> [[TMP8]], i64 1 > +; FVW2-NEXT: br i1 [[TMP30]], label [[PRED_LOAD_IF14:%.*]], label [[PRED_LOAD_CONTINUE15:%.*]] > +; FVW2: pred.load.if14: > +; FVW2-NEXT: [[TMP31:%.*]] = extractelement <2 x i64> [[TMP20]], i64 1 > +; FVW2-NEXT: [[TMP32:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP31]] > +; FVW2-NEXT: [[TMP33:%.*]] = load float, float* [[TMP32]], align 4 > +; FVW2-NEXT: [[TMP34:%.*]] = insertelement <2 x float> [[TMP29]], float [[TMP33]], i64 1 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE15]] > +; FVW2: pred.load.continue15: > +; FVW2-NEXT: [[TMP35:%.*]] = phi <2 x float> [ [[TMP29]], [[PRED_LOAD_CONTINUE]] ], [ [[TMP34]], [[PRED_LOAD_IF14]] ] > +; FVW2-NEXT: [[TMP36:%.*]] = extractelement <2 x i1> [[TMP9]], i64 0 > +; FVW2-NEXT: br i1 [[TMP36]], label [[PRED_LOAD_IF16:%.*]], label [[PRED_LOAD_CONTINUE17:%.*]] > +; FVW2: pred.load.if16: > +; FVW2-NEXT: [[TMP37:%.*]] = extractelement <2 x i64> [[TMP21]], i64 0 > +; FVW2-NEXT: [[TMP38:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP37]] > +; FVW2-NEXT: [[TMP39:%.*]] = load float, float* [[TMP38]], align 4 > +; FVW2-NEXT: [[TMP40:%.*]] = insertelement <2 x float> poison, float [[TMP39]], i64 0 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE17]] > +; FVW2: pred.load.continue17: > +; FVW2-NEXT: [[TMP41:%.*]] = phi <2 x float> [ poison, [[PRED_LOAD_CONTINUE15]] ], [ [[TMP40]], [[PRED_LOAD_IF16]] ] > +; FVW2-NEXT: [[TMP42:%.*]] = extractelement <2 x i1> [[TMP9]], i64 1 > +; FVW2-NEXT: br i1 [[TMP42]], label [[PRED_LOAD_IF18:%.*]], label [[PRED_LOAD_CONTINUE19:%.*]] > +; FVW2: pred.load.if18: > +; FVW2-NEXT: [[TMP43:%.*]] = extractelement <2 x i64> [[TMP21]], i64 1 > +; FVW2-NEXT: [[TMP44:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP43]] > +; FVW2-NEXT: [[TMP45:%.*]] = load float, float* [[TMP44]], align 4 > +; FVW2-NEXT: [[TMP46:%.*]] = insertelement <2 x float> [[TMP41]], float [[TMP45]], i64 1 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE19]] > +; FVW2: pred.load.continue19: > +; FVW2-NEXT: [[TMP47:%.*]] = phi <2 x float> [ [[TMP41]], [[PRED_LOAD_CONTINUE17]] ], [ [[TMP46]], [[PRED_LOAD_IF18]] ] > +; FVW2-NEXT: [[TMP48:%.*]] = extractelement <2 x i1> [[TMP10]], i64 0 > +; FVW2-NEXT: br i1 [[TMP48]], label [[PRED_LOAD_IF20:%.*]], label [[PRED_LOAD_CONTINUE21:%.*]] > +; FVW2: pred.load.if20: > +; FVW2-NEXT: [[TMP49:%.*]] = extractelement <2 x i64> [[TMP22]], i64 0 > +; FVW2-NEXT: [[TMP50:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP49]] > +; FVW2-NEXT: [[TMP51:%.*]] = load float, float* [[TMP50]], align 4 > +; FVW2-NEXT: [[TMP52:%.*]] = insertelement <2 x float> poison, float [[TMP51]], i64 0 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE21]] > +; FVW2: pred.load.continue21: > +; FVW2-NEXT: [[TMP53:%.*]] = phi <2 x float> [ poison, [[PRED_LOAD_CONTINUE19]] ], [ [[TMP52]], [[PRED_LOAD_IF20]] ] > +; FVW2-NEXT: [[TMP54:%.*]] = extractelement <2 x i1> [[TMP10]], i64 1 > +; FVW2-NEXT: br i1 [[TMP54]], label [[PRED_LOAD_IF22:%.*]], label [[PRED_LOAD_CONTINUE23:%.*]] > +; FVW2: pred.load.if22: > +; FVW2-NEXT: [[TMP55:%.*]] = extractelement <2 x i64> [[TMP22]], i64 1 > +; FVW2-NEXT: [[TMP56:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP55]] > +; FVW2-NEXT: [[TMP57:%.*]] = load float, float* [[TMP56]], align 4 > +; FVW2-NEXT: [[TMP58:%.*]] = insertelement <2 x float> [[TMP53]], float [[TMP57]], i64 1 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE23]] > +; FVW2: pred.load.continue23: > +; FVW2-NEXT: [[TMP59:%.*]] = phi <2 x float> [ [[TMP53]], [[PRED_LOAD_CONTINUE21]] ], [ [[TMP58]], [[PRED_LOAD_IF22]] ] > +; FVW2-NEXT: [[TMP60:%.*]] = extractelement <2 x i1> [[TMP11]], i64 0 > +; FVW2-NEXT: br i1 [[TMP60]], label [[PRED_LOAD_IF24:%.*]], label [[PRED_LOAD_CONTINUE25:%.*]] > +; FVW2: pred.load.if24: > +; FVW2-NEXT: [[TMP61:%.*]] = extractelement <2 x i64> [[TMP23]], i64 0 > +; FVW2-NEXT: [[TMP62:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP61]] > +; FVW2-NEXT: [[TMP63:%.*]] = load float, float* [[TMP62]], align 4 > +; FVW2-NEXT: [[TMP64:%.*]] = insertelement <2 x float> poison, float [[TMP63]], i64 0 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE25]] > +; FVW2: pred.load.continue25: > +; FVW2-NEXT: [[TMP65:%.*]] = phi <2 x float> [ poison, [[PRED_LOAD_CONTINUE23]] ], [ [[TMP64]], [[PRED_LOAD_IF24]] ] > +; FVW2-NEXT: [[TMP66:%.*]] = extractelement <2 x i1> [[TMP11]], i64 1 > +; FVW2-NEXT: br i1 [[TMP66]], label [[PRED_LOAD_IF26:%.*]], label [[PRED_LOAD_CONTINUE27]] > +; FVW2: pred.load.if26: > +; FVW2-NEXT: [[TMP67:%.*]] = extractelement <2 x i64> [[TMP23]], i64 1 > +; FVW2-NEXT: [[TMP68:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP67]] > +; FVW2-NEXT: [[TMP69:%.*]] = load float, float* [[TMP68]], align 4 > +; FVW2-NEXT: [[TMP70:%.*]] = insertelement <2 x float> [[TMP65]], float [[TMP69]], i64 1 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE27]] > +; FVW2: pred.load.continue27: > +; FVW2-NEXT: [[TMP71:%.*]] = phi <2 x float> [ [[TMP65]], [[PRED_LOAD_CONTINUE25]] ], [ [[TMP70]], [[PRED_LOAD_IF26]] ] > +; FVW2-NEXT: [[TMP72:%.*]] = fadd <2 x float> [[TMP35]], <float 5.000000e-01, float 5.000000e-01> > +; FVW2-NEXT: [[TMP73:%.*]] = fadd <2 x float> [[TMP47]], <float 5.000000e-01, float 5.000000e-01> > +; FVW2-NEXT: [[TMP74:%.*]] = fadd <2 x float> [[TMP59]], <float 5.000000e-01, float 5.000000e-01> > +; FVW2-NEXT: [[TMP75:%.*]] = fadd <2 x float> [[TMP71]], <float 5.000000e-01, float 5.000000e-01> > +; FVW2-NEXT: [[TMP76:%.*]] = getelementptr float, float* [[OUT:%.*]], i64 [[INDEX7]] > +; FVW2-NEXT: [[TMP77:%.*]] = bitcast float* [[TMP76]] to <2 x float>* > +; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP72]], <2 x float>* [[TMP77]], i32 4, <2 x i1> [[TMP8]]) > +; FVW2-NEXT: [[TMP78:%.*]] = getelementptr float, float* [[TMP76]], i64 2 > +; FVW2-NEXT: [[TMP79:%.*]] = bitcast float* [[TMP78]] to <2 x float>* > +; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP73]], <2 x float>* [[TMP79]], i32 4, <2 x i1> [[TMP9]]) > +; FVW2-NEXT: [[TMP80:%.*]] = getelementptr float, float* [[TMP76]], i64 4 > +; FVW2-NEXT: [[TMP81:%.*]] = bitcast float* [[TMP80]] to <2 x float>* > +; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP74]], <2 x float>* [[TMP81]], i32 4, <2 x i1> [[TMP10]]) > +; FVW2-NEXT: [[TMP82:%.*]] = getelementptr float, float* [[TMP76]], i64 6 > +; FVW2-NEXT: [[TMP83:%.*]] = bitcast float* [[TMP82]] to <2 x float>* > +; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP75]], <2 x float>* [[TMP83]], i32 4, <2 x i1> [[TMP11]]) > +; FVW2-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX7]], 8 > +; FVW2-NEXT: [[TMP84:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096 > +; FVW2-NEXT: br i1 [[TMP84]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]] > ; FVW2: for.end: > ; FVW2-NEXT: ret void > ; > @@ -365,40 +437,186 @@ define void @foo2(%struct.In* noalias %in, float* noalias %out, i32* noalias %tr > ; FVW2-NEXT: entry: > ; FVW2-NEXT: br label [[VECTOR_BODY:%.*]] > ; FVW2: vector.body: > -; FVW2-NEXT: [[INDEX10:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE9:%.*]] ] > -; FVW2-NEXT: [[VEC_IND:%.*]] = phi <2 x i64> [ <i64 0, i64 16>, [[ENTRY]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE9]] ] > -; FVW2-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX10]], 4 > +; FVW2-NEXT: [[INDEX7:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE35:%.*]] ] > +; FVW2-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX7]], 4 > ; FVW2-NEXT: [[TMP0:%.*]] = or i64 [[OFFSET_IDX]], 16 > -; FVW2-NEXT: [[TMP1:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER:%.*]], i64 [[OFFSET_IDX]] > -; FVW2-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP0]] > -; FVW2-NEXT: [[TMP3:%.*]] = load i32, i32* [[TMP1]], align 4 > -; FVW2-NEXT: [[TMP4:%.*]] = load i32, i32* [[TMP2]], align 4 > -; FVW2-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> poison, i32 [[TMP3]], i64 0 > -; FVW2-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> [[TMP5]], i32 [[TMP4]], i64 1 > -; FVW2-NEXT: [[TMP7:%.*]] = icmp sgt <2 x i32> [[TMP6]], zeroinitializer > -; FVW2-NEXT: [[TMP8:%.*]] = getelementptr inbounds [[STRUCT_IN:%.*]], %struct.In* [[IN:%.*]], <2 x i64> [[VEC_IND]], i32 1 > -; FVW2-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP8]], i32 4, <2 x i1> [[TMP7]], <2 x float> undef) > -; FVW2-NEXT: [[TMP9:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER]], <float 5.000000e-01, float 5.000000e-01> > -; FVW2-NEXT: [[TMP10:%.*]] = extractelement <2 x i1> [[TMP7]], i64 0 > -; FVW2-NEXT: br i1 [[TMP10]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]] > +; FVW2-NEXT: [[TMP1:%.*]] = or i64 [[OFFSET_IDX]], 32 > +; FVW2-NEXT: [[TMP2:%.*]] = or i64 [[OFFSET_IDX]], 48 > +; FVW2-NEXT: [[TMP3:%.*]] = or i64 [[OFFSET_IDX]], 64 > +; FVW2-NEXT: [[TMP4:%.*]] = or i64 [[OFFSET_IDX]], 80 > +; FVW2-NEXT: [[TMP5:%.*]] = or i64 [[OFFSET_IDX]], 96 > +; FVW2-NEXT: [[TMP6:%.*]] = or i64 [[OFFSET_IDX]], 112 > +; FVW2-NEXT: [[TMP7:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER:%.*]], i64 [[OFFSET_IDX]] > +; FVW2-NEXT: [[TMP8:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP0]] > +; FVW2-NEXT: [[TMP9:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP1]] > +; FVW2-NEXT: [[TMP10:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP2]] > +; FVW2-NEXT: [[TMP11:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP3]] > +; FVW2-NEXT: [[TMP12:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP4]] > +; FVW2-NEXT: [[TMP13:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP5]] > +; FVW2-NEXT: [[TMP14:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP6]] > +; FVW2-NEXT: [[TMP15:%.*]] = load i32, i32* [[TMP7]], align 4 > +; FVW2-NEXT: [[TMP16:%.*]] = load i32, i32* [[TMP8]], align 4 > +; FVW2-NEXT: [[TMP17:%.*]] = insertelement <2 x i32> poison, i32 [[TMP15]], i64 0 > +; FVW2-NEXT: [[TMP18:%.*]] = insertelement <2 x i32> [[TMP17]], i32 [[TMP16]], i64 1 > +; FVW2-NEXT: [[TMP19:%.*]] = load i32, i32* [[TMP9]], align 4 > +; FVW2-NEXT: [[TMP20:%.*]] = load i32, i32* [[TMP10]], align 4 > +; FVW2-NEXT: [[TMP21:%.*]] = insertelement <2 x i32> poison, i32 [[TMP19]], i64 0 > </cut>

4 years, 3 months

1
0
0 0

[ACTIVITY] week ending Feb. 6 2022

by Alex Bennée

Project Stratos =============== - posted [RFC PATCH] tests/qtest: attempt to enable tests for virtio-gpio (!working) Message-Id: <20220121151534.3654562-1-alex.bennee(a)linaro.org> - need to increase coverage of the QEMU boilerplate to get it merged - discussions on next steps with SCMI backend with Vincent (moving from the QEMU->QEMU PoC) QEMU Upstream Work ([UM-2]) =========================== - posted [PATCH v2 00/25] testing and plugin updates Message-Id: <20220201182050.15087-1-alex.bennee(a)linaro.org> - posted [RFC PATCH 0/4] improve coverage of vector backend Message-Id: <20220202191242.652607-2-alex.bennee(a)linaro.org> - posted [PATCH v3 00/26] testing and plugins pre-PR Message-Id: <20220204204335.1689602-1-alex.bennee(a)linaro.org> - posted [RFC PATCH] arm: force flag recalculation when messing with DAIF Message-Id: <20220202122353.457084-1-alex.bennee(a)linaro.org> - trying to track down a weird TLS bug: <https://gitlab.com/stsquad/qemu/-/jobs/2056025874#L3532> - on aarch64 HW, running qemu-s390x with a simple test case fails every 100/200 times - seems TLS memory gets made non-accessible (rw-p -> ---p, except to gdb) - strace doesn't show a culprit, possible kernel bug? [UM-2] <https://linaro.atlassian.net/browse/UM-2> Upstream MTTCG tests ([QEMU-52]) - still waiting final review of [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> Other ===== - planning and brainstorming for Linaro Tech Day Completed Reviews [5/5] ======================= [PATCH v4 00/42] CXl 2.0 emulation Support Message-Id: <20220124171705.10432-1-Jonathan.Cameron(a)huawei.com> [PATCH] gitlab: fall back to commit hash in qemu-setup filename Message-Id: <20220125173454.10381-1-stefanha(a)redhat.com> [PATCH for-7.0] gitlab-ci: Add cirrus-ci based tests for NetBSD and OpenBSD Message-Id: <20211209103124.121942-1-thuth(a)redhat.com> [PATCH 00/20] tcg: vector improvements Message-Id: <20211218194250.247633-1-richard.henderson(a)linaro.org> Absences ======== Current Review Queue ==================== TODO [PATCH 0/4] target/arm: SVE fixes versus VHE Message-Id: <20220127063428.30212-1-richard.henderson(a)linaro.org> ================================================================================================================== TODO [PATCH 00/14] arm_gicv3_its: Implement MOVI and MOVALL commands Message-Id: <20220122182444.724087-1-peter.maydell(a)linaro.org> ================================================================================================================================== TODO [PATCH v11 0/8] hmp,qmp: Add commands to introspect virtio devices Message-Id: <1642678168-20447-1-git-send-email-jonah.palmer(a)oracle.com> ============================================================================================================================================== TODO [PATCH v2 00/13] arm gicv3 ITS: Various bug fixes and refactorings Message-Id: <20220111171048.3545974-1-peter.maydell(a)linaro.org> ====================================================================================================================================== -- Alex Bennée

4 years, 3 months

1
0
0 0

[ACTIVITY] report week ending 4 Feb

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] - Fixed some minor issues with the hvf accelerator and sent out a patchset + '-cpu max' didn't act like '-cpu host' + we weren't exposing PAuth to the guest * QEMU-420 [GICv4 emulation] - Sent out a patchset with more cleanups and fixes to the existing ITS code - The ITS parts of the GICv4 work are now code-complete; moving on to the redistributor end of things next week. -- PMM

4 years, 3 months

1
0
0 0

[ACTIVITY] report week ending 28 Jan

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] - Before the QEMU 7.0 release we tried to land a bug fix which corrected the handling in our PSCI emulation of calls where the function ID is unrecognized -- these are supposed to return an error code. The bugfix turned out to cause regressions for some boards when running guest code at EL3 (because those boards were incorrectly enabling PSCI emulation in that situation). Sent a patchset that fixed those boards so we don't enable PSCI when running EL3 guests, and re-introduced the original PSCI bugfix. - Fixed various bugs in the highbank/midway boards discovered in the process of writing and testing the above patchset. (These two boards were the most complicated to fix.) - More code review, and sent out an arm pullrequest - Small handful of other minor patches -- PMM

4 years, 3 months

1
0
0 0

tsan buildbot failure possibly due to DWARFv5 switch

by David Blaikie

Seems like my change to make Clang default to DWARFv5 might've caused a buildbot failure on your build worker here: https://lab.llvm.org/buildbot/#/builders/185/builds/1295 But I seem to be able to run this test successfully locally on my Linux machine - so I'm wondering if you can offer any help diagnosing the issue showing up on your builder/worker?

4 years, 3 months

2
2
0 0

[ACTIVITY] week ending Jan. 23 2022

by Alex Bennée

Project Stratos =============== - [RFC PATCH] tests/qtest: attempt to enable tests for virtio-gpio (!working) Message-Id: <20220121151534.3654562-1-alex.bennee(a)linaro.org> - trying to clear the way for merging virtio-gpio to QEMU vhost-device maintainer effort ([UM-196]) - reviewed vhost-device [pr7 with the vm-virtio vsock abstraction] [UM-196] <https://linaro.atlassian.net/browse/UM-196> [pr7 with the vm-virtio vsock abstraction] <https://github.com/stsquad/vhost-device/tree/review/pr7-with-laurat-abstrac…> QEMU Upstream Work ([UM-2]) =========================== - posted [PULL v2 00/31] testing/next and other misc fixes Message-Id: <20220118190043.1427303-1-alex.bennee(a)linaro.org> Upstream MTTCG tests ([QEMU-52]) - still waiting final review of [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> Completed Reviews [2/2] ======================= [PATCH v2 00/13] arm gicv3 ITS: Various bug fixes and refactorings Message-Id: <20220111171048.3545974-1-peter.maydell(a)linaro.org> [PATCH v2 0/6] qtests/libqos: Allow PCI tests to be run with virt-machine Message-Id: <20220118203833.316741-7-eric.auger(a)redhat.com> Absences ======== Current Review Queue ==================== TODO [PATCH v11 0/8] hmp,qmp: Add commands to introspect virtio devices Message-Id: <1642678168-20447-1-git-send-email-jonah.palmer(a)oracle.com> ============================================================================================================================================== TODO [PATCH v2 00/13] arm gicv3 ITS: Various bug fixes and refactorings Message-Id: <20220111171048.3545974-1-peter.maydell(a)linaro.org> ====================================================================================================================================== TODO [PATCH v2 00/11] Atomic cleanup + clang-12 build fix Message-Id: <20210717014121.1784956-1-richard.henderson(a)linaro.org> ============================================================================================================================ TODO [PATCH 0/7] tcg: some small towards more modular tcg Message-Id: <20210804143826.3402872-1-kraxel(a)redhat.com> ================================================================================================================= -- Alex Bennée

4 years, 3 months

1
0
0 0

[ACTIVITY] report week ending 21 Jan

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] - Sent patches for some reported bugs to do with state save/load * QEMU-420 [GICv4 emulation] - Wrote patches to implement the missing MOVALL and MOVI commands - Fixed a few minor bugs noticed along the way - Should be able to send out a patchset early next week and then can get back to the new-in-GICv4 work -- PMM

4 years, 3 months

1
0
0 0

[TCWG CI] Regression caused by gcc: Add -Wdangling-pointer [PR63272].

by ci_notify＠linaro.org

[TCWG CI] Regression caused by gcc: Add -Wdangling-pointer [PR63272].: commit 9d6a0f388eb048f8d87f47af78f07b5ce513bfe6 Author: Martin Sebor <msebor(a)redhat.com> Add -Wdangling-pointer [PR63272]. Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21324 # First few build errors in logs: # 00:03:31 sound/core/oss/mixer_oss.c:1057:21: error: ‘slot’ is used uninitialized [-Werror=uninitialized] # 00:03:32 sound/core/oss/pcm_oss.c:108:29: error: ‘t’ is used uninitialized [-Werror=uninitialized] # 00:03:32 sound/core/oss/pcm_oss.c:2488:34: error: ‘setup’ is used uninitialized [-Werror=uninitialized] # 00:03:32 sound/core/oss/pcm_oss.c:2998:51: error: ‘template’ is used uninitialized [-Werror=uninitialized] # 00:03:35 make[3]: *** [scripts/Makefile.build:277: sound/core/oss/mixer_oss.o] Error 1 # 00:03:35 sound/core/seq/oss/seq_oss_init.c:350:35: error: ‘qinfo’ is used uninitialized [-Werror=uninitialized] # 00:03:35 sound/core/seq/oss/seq_oss_init.c:370:35: error: ‘qinfo’ is used uninitialized [-Werror=uninitialized] # 00:03:36 make[4]: *** [scripts/Makefile.build:277: sound/core/seq/oss/seq_oss_init.o] Error 1 # 00:03:40 make[3]: *** [scripts/Makefile.build:277: sound/core/oss/pcm_oss.o] Error 1 # 00:03:50 make[3]: *** [scripts/Makefile.build:540: sound/core/seq/oss] Error 2 from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21354 THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/gnu-master-aarch64-stable-allmodconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… Reproduce builds: <cut> mkdir investigate-gcc-9d6a0f388eb048f8d87f47af78f07b5ce513bfe6 cd investigate-gcc-9d6a0f388eb048f8d87f47af78f07b5ce513bfe6 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach 9d6a0f388eb048f8d87f47af78f07b5ce513bfe6 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 671a283636de75f7ed638ee6b01ed2d44361b8b6 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 9d6a0f388eb048f8d87f47af78f07b5ce513bfe6 Author: Martin Sebor <msebor(a)redhat.com> Date: Sat Jan 15 16:41:40 2022 -0700 Add -Wdangling-pointer [PR63272]. Resolves: PR c/63272 - GCC should warn when using pointer to dead scoped variable with in the same function gcc/c-family/ChangeLog: PR c/63272 * c.opt (-Wdangling-pointer): New option. gcc/ChangeLog: PR c/63272 * diagnostic-spec.c (nowarn_spec_t::nowarn_spec_t): Handle -Wdangling-pointer. * doc/invoke.texi (-Wdangling-pointer): Document new option. * gimple-ssa-warn-access.cc (pass_waccess::clone): Set new member. (pass_waccess::check_pointer_uses): New function. (pass_waccess::gimple_call_return_arg): New function. (pass_waccess::gimple_call_return_arg_ref): New function. (pass_waccess::check_call_dangling): New function. (pass_waccess::check_dangling_uses): New function overloads. (pass_waccess::check_dangling_stores): New function. (pass_waccess::check_dangling_stores): New function. (pass_waccess::m_clobbers): New data member. (pass_waccess::m_func): New data member. (pass_waccess::m_run_number): New data member. (pass_waccess::m_check_dangling_p): New data member. (pass_waccess::check_alloca): Check m_early_checks_p. (pass_waccess::check_alloc_size_call): Same. (pass_waccess::check_strcat): Same. (pass_waccess::check_strncat): Same. (pass_waccess::check_stxcpy): Same. (pass_waccess::check_stxncpy): Same. (pass_waccess::check_strncmp): Same. (pass_waccess::check_memop_access): Same. (pass_waccess::check_read_access): Same. (pass_waccess::check_builtin): Call check_pointer_uses. (pass_waccess::warn_invalid_pointer): Add arguments. (is_auto_decl): New function. (pass_waccess::check_stmt): New function. (pass_waccess::check_block): Call check_stmt. (pass_waccess::execute): Call check_dangling_uses, check_dangling_stores. Empty m_clobbers. * passes.def (pass_warn_access): Invoke pass two more times. gcc/testsuite/ChangeLog: PR c/63272 * g++.dg/warn/Wfree-nonheap-object-6.C: Disable valid warnings. * g++.dg/warn/ref-temp1.C: Prune expected warning. * gcc.dg/uninit-pr50476.c: Expect a new warning. * c-c++-common/Wdangling-pointer-2.c: New test. * c-c++-common/Wdangling-pointer-3.c: New test. * c-c++-common/Wdangling-pointer-4.c: New test. * c-c++-common/Wdangling-pointer-5.c: New test. * c-c++-common/Wdangling-pointer-6.c: New test. * c-c++-common/Wdangling-pointer.c: New test. * g++.dg/warn/Wdangling-pointer-2.C: New test. * g++.dg/warn/Wdangling-pointer.C: New test. * gcc.dg/Wdangling-pointer-2.c: New test. * gcc.dg/Wdangling-pointer.c: New test. --- gcc/c-family/c.opt | 8 + gcc/diagnostic-spec.c | 1 + gcc/doc/invoke.texi | 62 +- gcc/gimple-ssa-warn-access.cc | 635 +++++++++++++++++++-- gcc/passes.def | 5 +- gcc/testsuite/c-c++-common/Wdangling-pointer-2.c | 437 ++++++++++++++ gcc/testsuite/c-c++-common/Wdangling-pointer-3.c | 64 +++ gcc/testsuite/c-c++-common/Wdangling-pointer-4.c | 73 +++ gcc/testsuite/c-c++-common/Wdangling-pointer-5.c | 90 +++ gcc/testsuite/c-c++-common/Wdangling-pointer-6.c | 32 ++ gcc/testsuite/c-c++-common/Wdangling-pointer.c | 434 ++++++++++++++ gcc/testsuite/g++.dg/warn/Wdangling-pointer-2.C | 23 + gcc/testsuite/g++.dg/warn/Wdangling-pointer.C | 74 +++ gcc/testsuite/g++.dg/warn/Wfree-nonheap-object-6.C | 4 +- gcc/testsuite/g++.dg/warn/ref-temp1.C | 3 + gcc/testsuite/gcc.dg/Wdangling-pointer-2.c | 82 +++ gcc/testsuite/gcc.dg/Wdangling-pointer.c | 75 +++ gcc/testsuite/gcc.dg/uninit-pr50476.c | 2 +- 18 files changed, 2043 insertions(+), 61 deletions(-) diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt index 28363643664..db65c14a7a5 100644 --- a/gcc/c-family/c.opt +++ b/gcc/c-family/c.opt @@ -548,6 +548,14 @@ Wdangling-else C ObjC C++ ObjC++ Var(warn_dangling_else) Warning LangEnabledBy(C ObjC C++ ObjC++,Wparentheses) Warn about dangling else. +Wdangling-pointer +C ObjC C++ LTO ObjC++ Alias(Wdangling-pointer=, 2, 0) Warning +Warn for uses of pointers to auto variables whose lifetime has ended. + +Wdangling-pointer= +C ObjC C++ ObjC++ Joined RejectNegative UInteger Var(warn_dangling_pointer) Warning LangEnabledBy(C ObjC C++ ObjC++,Wall, 2, 0) IntegerRange(0, 2) +Warn for uses of pointers to auto variables whose lifetime has ended. + Wdate-time C ObjC C++ ObjC++ CPP(warn_date_time) CppReason(CPP_W_DATE_TIME) Var(cpp_warn_date_time) Init(0) Warning Warn about __TIME__, __DATE__ and __TIMESTAMP__ usage. diff --git a/gcc/diagnostic-spec.c b/gcc/diagnostic-spec.c index c9e1c1be91d..a8af229d677 100644 --- a/gcc/diagnostic-spec.c +++ b/gcc/diagnostic-spec.c @@ -99,6 +99,7 @@ nowarn_spec_t::nowarn_spec_t (opt_code opt) m_bits = NW_UNINIT; break; + case OPT_Wdangling_pointer_: case OPT_Wreturn_local_addr: case OPT_Wuse_after_free_: m_bits = NW_DANGLING; diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 121c8ea827f..7f2205e4a85 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -341,7 +341,8 @@ Objective-C and Objective-C++ Dialects}. -Wchar-subscripts @gol -Wclobbered -Wcomment @gol -Wconversion -Wno-coverage-mismatch -Wno-cpp @gol --Wdangling-else -Wdate-time @gol +-Wdangling-else -Wdangling-pointer -Wdangling-pointer=@var{n} @gol +-Wdate-time @gol -Wno-deprecated -Wno-deprecated-declarations -Wno-designated-init @gol -Wdisabled-optimization @gol -Wno-discarded-array-qualifiers -Wno-discarded-qualifiers @gol @@ -4389,6 +4390,8 @@ Warn about overriding virtual functions that are not marked with the @opindex Wno-use-after-free Warn about uses of pointers to dynamically allocated objects that have been rendered indeterminate by a call to a deallocation function. +The warning is enabled at all optimization levels but may yield different +results with optimization than without. @table @gcctabopt @item -Wuse-after-free=1 @@ -5714,6 +5717,7 @@ Options} and @ref{Objective-C and Objective-C++ Dialect Options}. -Wcatch-value @r{(C++ and Objective-C++ only)} @gol -Wchar-subscripts @gol -Wcomment @gol +-Wdangling-pointer=2 @gol -Wduplicate-decl-specifier @r{(C and Objective-C only)} @gol -Wenum-compare @r{(in C/ObjC; this is on by default in C++)} @gol -Wformat @gol @@ -8587,6 +8591,62 @@ looks like this: This warning is enabled by @option{-Wparentheses}. +@item -Wdangling-pointer +@itemx -Wdangling-pointer=@var{n} +@opindex Wdangling-pointer +@opindex Wno-dangling-pointer +Warn about uses of pointers (or C++ references) to objects with automatic +storage duration after their lifetime has ended. This includes local +variables declared in nested blocks, compound literals and other unnamed +temporary objects. In addition, warn about storing the address of such +objects in escaped pointers. The warning is enabled at all optimization +levels but may yield different results with optimization than without. + +@table @gcctabopt +@item -Wdangling-pointer=1 +At level 1 the warning diagnoses only unconditional uses of dangling pointers. +For example +@smallexample +int f (int c1, int c2, x) +@{ + char *p = strchr ((char[])@{ c1, c2 @}, c3); + return p ? *p : 'x'; // warning: dangling pointer to a compound literal +@} +@end smallexample +In the following function the store of the address of the local variable +@code{x} in the escaped pointer @code{*p} also triggers the warning. +@smallexample +void g (int **p) +@{ + int x = 7; + *p = &x; // warning: storing the address of a local variable in *p +@} +@end smallexample + +@item -Wdangling-pointer=2 +At level 2, in addition to unconditional uses the warning also diagnoses +conditional uses of dangling pointers. + +For example, because the array @var{a} in the following function is out of +scope when the pointer @var{s} that was set to point is used, the warning +triggers at this level. + +@smallexample +void f (char *s) +@{ + if (!s) + @{ + char a[12] = "tmpname"; + s = a; + @} + strcat (s, ".tmp"); // warning: dangling pointer to a may be used + ... +@} +@end smallexample +@end table + +@option{-Wdangling-pointer=2} is included in @option{-Wall}. + @item -Wdate-time @opindex Wdate-time @opindex Wno-date-time diff --git a/gcc/gimple-ssa-warn-access.cc b/gcc/gimple-ssa-warn-access.cc index 882129143a1..f639807a78a 100644 --- a/gcc/gimple-ssa-warn-access.cc +++ b/gcc/gimple-ssa-warn-access.cc @@ -2069,10 +2069,12 @@ class pass_waccess : public gimple_opt_pass ~pass_waccess (); - opt_pass *clone () { return new pass_waccess (m_ctxt); } + opt_pass *clone (); virtual bool gate (function *); + void set_pass_param (unsigned, bool); + virtual unsigned int execute (function *); private: @@ -2089,6 +2091,9 @@ private: /* Check a call to an ordinary function for invalid accesses. */ bool check_call_access (gcall *); + /* Check a non-call statement. */ + void check_stmt (gimple *); + /* Check statements in a basic block. */ void check_block (basic_block); @@ -2112,26 +2117,41 @@ private: void check_atomic_memmodel (gimple *, tree, tree, const unsigned char *); /* Check for uses of indeterminate pointers. */ - void check_pointer_uses (gimple *, tree); + void check_pointer_uses (gimple *, tree, tree = NULL_TREE, bool = false); /* Return the argument that a call returns. */ tree gimple_call_return_arg (gcall *); + tree gimple_call_return_arg_ref (gcall *); + + /* Check a call for uses of a dangling pointer arguments. */ + void check_call_dangling (gcall *); + + /* Check uses of a dangling pointer or those derived from it. */ + void check_dangling_uses (tree, tree, bool = false, bool = false); + void check_dangling_uses (); + void check_dangling_stores (); + void check_dangling_stores (basic_block, hash_set<tree> &, auto_bitmap &); - void warn_invalid_pointer (tree, gimple *, gimple *, bool, bool = false); + void warn_invalid_pointer (tree, gimple *, gimple *, tree, bool, bool = false); /* Return true if use follows an invalidating statement. */ - bool use_after_inval_p (gimple *, gimple *); + bool use_after_inval_p (gimple *, gimple *, bool = false); /* A pointer_query object and its cache to store information about pointers and their targets in. */ pointer_query m_ptr_qry; pointer_query::cache_type m_var_cache; - + /* Mapping from DECLs and their clobber statements in the function. */ + hash_map<tree, gimple *> m_clobbers; /* A bit is set for each basic block whose statements have been assigned valid UIDs. */ bitmap m_bb_uids_set; /* The current function. */ function *m_func; + /* True to run checks for uses of dangling pointers. */ + bool m_check_dangling_p; + /* True to run checks early on in the optimization pipeline. */ + bool m_early_checks_p; }; /* Construct the pass. */ @@ -2140,11 +2160,22 @@ pass_waccess::pass_waccess (gcc::context *ctxt) : gimple_opt_pass (pass_data_waccess, ctxt), m_ptr_qry (NULL, &m_var_cache), m_var_cache (), + m_clobbers (), m_bb_uids_set (), - m_func () + m_func (), + m_check_dangling_p (), + m_early_checks_p () { } +/* Return a copy of the pass with RUN_NUMBER one greater than THIS. */ + +opt_pass* +pass_waccess::clone () +{ + return new pass_waccess (m_ctxt); +} + /* Release pointer_query cache. */ pass_waccess::~pass_waccess () @@ -2152,6 +2183,14 @@ pass_waccess::~pass_waccess () m_ptr_qry.flush_cache (); } +void +pass_waccess::set_pass_param (unsigned int n, bool early) +{ + gcc_assert (n == 0); + + m_early_checks_p = early; +} + /* Return true when any checks performed by the pass are enabled. */ bool @@ -2340,6 +2379,9 @@ maybe_warn_alloc_args_overflow (gimple *stmt, const tree args[2], void pass_waccess::check_alloca (gcall *stmt) { + if (m_early_checks_p) + return; + if ((warn_vla_limit >= HOST_WIDE_INT_MAX && warn_alloc_size_limit < warn_vla_limit) || (warn_alloca_limit >= HOST_WIDE_INT_MAX @@ -2361,6 +2403,13 @@ pass_waccess::check_alloca (gcall *stmt) void pass_waccess::check_alloc_size_call (gcall *stmt) { + if (m_early_checks_p) + return; + + if (gimple_call_num_args (stmt) < 1) + /* Avoid invalid calls to functions without a prototype. */ + return; + tree fndecl = gimple_call_fndecl (stmt); if (fndecl && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) { @@ -2413,6 +2462,9 @@ pass_waccess::check_alloc_size_call (gcall *stmt) void pass_waccess::check_strcat (gcall *stmt) { + if (m_early_checks_p) + return; + if (!warn_stringop_overflow && !warn_stringop_overread) return; @@ -2438,6 +2490,9 @@ pass_waccess::check_strcat (gcall *stmt) void pass_waccess::check_strncat (gcall *stmt) { + if (m_early_checks_p) + return; + if (!warn_stringop_overflow && !warn_stringop_overread) return; @@ -2507,6 +2562,9 @@ pass_waccess::check_strncat (gcall *stmt) void pass_waccess::check_stxcpy (gcall *stmt) { + if (m_early_checks_p) + return; + tree dst = call_arg (stmt, 0); tree src = call_arg (stmt, 1); @@ -2545,7 +2603,7 @@ pass_waccess::check_stxcpy (gcall *stmt) void pass_waccess::check_stxncpy (gcall *stmt) { - if (!warn_stringop_overflow) + if (m_early_checks_p || !warn_stringop_overflow) return; tree dst = call_arg (stmt, 0); @@ -2569,7 +2627,7 @@ pass_waccess::check_stxncpy (gcall *stmt) void pass_waccess::check_strncmp (gcall *stmt) { - if (!warn_stringop_overread) + if (m_early_checks_p || !warn_stringop_overread) return; tree arg1 = call_arg (stmt, 0); @@ -2674,6 +2732,9 @@ pass_waccess::check_strncmp (gcall *stmt) void pass_waccess::check_memop_access (gimple *stmt, tree dest, tree src, tree size) { + if (m_early_checks_p) + return; + /* For functions like memset and memcpy that operate on raw memory try to determine the size of the largest source and destination object using type-0 Object Size regardless of the object size @@ -2695,7 +2756,7 @@ pass_waccess::check_read_access (gimple *stmt, tree src, tree bound /* = NULL_TREE */, int ost /* = 1 */) { - if (!warn_stringop_overread) + if (m_early_checks_p || !warn_stringop_overread) return; if (bound && !useless_type_conversion_p (size_type_node, TREE_TYPE (bound))) @@ -2938,7 +2999,7 @@ pass_waccess::check_atomic_memmodel (gimple *stmt, tree ord_sucs, if (warning_suppressed_p (stmt, OPT_Winvalid_memory_model)) return; - if (maybe_warn_memmodel (stmt, ord_sucs, ord_fail, valid)) + if (!maybe_warn_memmodel (stmt, ord_sucs, ord_fail, valid)) return; suppress_warning (stmt, OPT_Winvalid_memory_model); @@ -3094,11 +3155,12 @@ pass_waccess::check_builtin (gcall *stmt) case BUILT_IN_FREE: case BUILT_IN_REALLOC: - { - tree arg = call_arg (stmt, 0); - if (TREE_CODE (arg) == SSA_NAME) - check_pointer_uses (stmt, arg); - } + if (!m_early_checks_p) + { + tree arg = call_arg (stmt, 0); + if (TREE_CODE (arg) == SSA_NAME) + check_pointer_uses (stmt, arg); + } return true; case BUILT_IN_GETTEXT: @@ -3725,16 +3787,67 @@ pass_waccess::maybe_check_dealloc_call (gcall *call) /* Return true if either USE_STMT's basic block (that of a pointer's use) is dominated by INVAL_STMT's (that of a pointer's invalidating statement, - or if they're in the same block, USE_STMT follows INVAL_STMT. */ + which is either a clobber or a deallocation call), or if they're in + the same block, USE_STMT follows INVAL_STMT. */ bool -pass_waccess::use_after_inval_p (gimple *inval_stmt, gimple *use_stmt) +pass_waccess::use_after_inval_p (gimple *inval_stmt, gimple *use_stmt, + bool last_block /* = false */) { + tree clobvar = + gimple_clobber_p (inval_stmt) ? gimple_assign_lhs (inval_stmt) : NULL_TREE; + basic_block inval_bb = gimple_bb (inval_stmt); basic_block use_bb = gimple_bb (use_stmt); + if (!inval_bb || !use_bb) + return false; + if (inval_bb != use_bb) - return dominated_by_p (CDI_DOMINATORS, use_bb, inval_bb); + { + if (dominated_by_p (CDI_DOMINATORS, use_bb, inval_bb)) + return true; + + if (!clobvar || !last_block) + return false; + + /* Proceed only when looking for uses of dangling pointers. */ + auto gsi = gsi_for_stmt (use_stmt); + + auto_bitmap visited; + + /* A use statement in the last basic block in a function or one that + falls through to it is after any other prior clobber of the used + variable unless it's followed by a clobber of the same variable. */ + basic_block bb = use_bb; + while (bb != inval_bb + && single_succ_p (bb) + && !(single_succ_edge (bb)->flags & (EDGE_EH|EDGE_DFS_BACK))) + { + if (!bitmap_set_bit (visited, bb->index)) + /* Avoid cycles. */ + return true; + + for (; !gsi_end_p (gsi); gsi_next_nondebug (&gsi)) + { + gimple *stmt = gsi_stmt (gsi); + if (gimple_clobber_p (stmt)) + { + if (clobvar == gimple_assign_lhs (stmt)) + /* The use is followed by a clobber. */ + return false; + } + } + + bb = single_succ (bb); + gsi = gsi_start_bb (bb); + } + + /* The use is one of a dangling pointer if a clobber of the variable + [the pointer points to] has not been found before the function exit + point. */ + return bb == EXIT_BLOCK_PTR_FOR_FN (cfun); + } if (bitmap_set_bit (m_bb_uids_set, inval_bb->index)) /* The first time this basic block is visited assign increasing ids @@ -3752,27 +3865,30 @@ pass_waccess::use_after_inval_p (gimple *inval_stmt, gimple *use_stmt) return gimple_uid (inval_stmt) < gimple_uid (use_stmt); } -/* Issue a warning for the USE_STMT of pointer PTR rendered invalid - by INVAL_STMT. PTR may be null when it's been optimized away. - MAYBE is true to issue the "maybe" kind of warning. EQUALITY is - true when the pointer is used in an equality expression. */ +/* Issue a warning for the USE_STMT of pointer or reference REF rendered + invalid by INVAL_STMT. REF may be null when it's been optimized away. + When nonnull, INVAL_STMT is the deallocation function that rendered + the pointer or reference dangling. Otherwise, VAR is the auto variable + (including an unnamed temporary such as a compound literal) whose + lifetime's rended it dangling. MAYBE is true to issue the "maybe" + kind of warning. EQUALITY is true when the pointer is used in + an equality expression. */ void -pass_waccess::warn_invalid_pointer (tree ptr, gimple *use_stmt, - gimple *inval_stmt, - bool maybe, - bool equality /* = false */) +pass_waccess::warn_invalid_pointer (tree ref, gimple *use_stmt, + gimple *inval_stmt, tree var, + bool maybe, bool equality /* = false */) { /* Avoid printing the unhelpful "<unknown>" in the diagnostics. */ - if (ptr && TREE_CODE (ptr) == SSA_NAME - && (!SSA_NAME_VAR (ptr) || DECL_ARTIFICIAL (SSA_NAME_VAR (ptr)))) - ptr = NULL_TREE; + if (ref && TREE_CODE (ref) == SSA_NAME + && (!SSA_NAME_VAR (ref) || DECL_ARTIFICIAL (SSA_NAME_VAR (ref)))) + ref = NULL_TREE; location_t use_loc = gimple_location (use_stmt); if (use_loc == UNKNOWN_LOCATION) { - use_loc = cfun->function_end_locus; - if (!ptr) + use_loc = m_func->function_end_locus; + if (!ref) /* Avoid issuing a warning with no context other than the function. That would make it difficult to debug in any but very simple cases. */ @@ -3788,12 +3904,12 @@ pass_waccess::warn_invalid_pointer (tree ptr, gimple *use_stmt, const tree inval_decl = gimple_call_fndecl (inval_stmt); - if ((ptr && warning_at (use_loc, OPT_Wuse_after_free, + if ((ref && warning_at (use_loc, OPT_Wuse_after_free, (maybe ? G_("pointer %qE may be used after %qD") : G_("pointer %qE used after %qD")), - ptr, inval_decl)) - || (!ptr && warning_at (use_loc, OPT_Wuse_after_free, + ref, inval_decl)) + || (!ref && warning_at (use_loc, OPT_Wuse_after_free, (maybe ? G_("pointer may be used after %qD") : G_("pointer used after %qD")), @@ -3805,6 +3921,52 @@ pass_waccess::warn_invalid_pointer (tree ptr, gimple *use_stmt, } return; } + + if ((maybe && warn_dangling_pointer < 2) + || warning_suppressed_p (use_stmt, OPT_Wdangling_pointer_)) + return; + + if (DECL_NAME (var)) + { + if ((ref + && warning_at (use_loc, OPT_Wdangling_pointer_, + (maybe + ? G_("dangling pointer %qE to %qD may be used") + : G_("using dangling pointer %qE to %qD")), + ref, var)) + || (!ref + && warning_at (use_loc, OPT_Wdangling_pointer_, + (maybe + ? G_("dangling pointer to %qD may be used") + : G_("using a dangling pointer to %qD")), + var))) + inform (DECL_SOURCE_LOCATION (var), + "%qD declared here", var); + suppress_warning (use_stmt, OPT_Wdangling_pointer_); + return; + } + + if ((ref + && warning_at (use_loc, OPT_Wdangling_pointer_, + (maybe + ? G_("dangling pointer %qE to an unnamed temporary " + "may be used") + : G_("using dangling pointer %qE to an unnamed " + "temporary")), + ref, var)) + || (!ref + && warning_at (use_loc, OPT_Wdangling_pointer_, + (maybe + ? G_("dangling pointer to an unnamed temporary " + "may be used") + : G_("using a dangling pointer to an unnamed " + "temporary")), + var))) + { + inform (DECL_SOURCE_LOCATION (var), + "unnamed temporary defined here"); + suppress_warning (use_stmt, OPT_Wdangling_pointer_); + } } /* If STMT is a call to either the standard realloc or to a user-defined @@ -3927,10 +4089,14 @@ pointers_related_p (gimple *stmt, tree p, tree q, pointer_query &qry) /* For a STMT either a call to a deallocation function or a clobber, warn for uses of the pointer PTR it was called with (including its copies - or others derived from it by pointer arithmetic). */ + or others derived from it by pointer arithmetic). If STMT is a clobber, + VAR is the decl of the clobbered variable. When MAYBE is true use + a "maybe" form of diagnostic. */ void -pass_waccess::check_pointer_uses (gimple *stmt, tree ptr) +pass_waccess::check_pointer_uses (gimple *stmt, tree ptr, + tree var /* = NULL_TREE */, + bool maybe /* = false */) { gcc_assert (TREE_CODE (ptr) == SSA_NAME); @@ -4013,18 +4179,25 @@ pass_waccess::check_pointer_uses (gimple *stmt, tree ptr) /* Warn if USE_STMT is dominated by the deallocation STMT. Otherwise, add the pointer to POINTERS so that the uses of any other pointers derived from it can be checked. */ - if (use_after_inval_p (stmt, use_stmt)) + if (use_after_inval_p (stmt, use_stmt, check_dangling)) { - /* TODO: Handle PHIs but careful of false positives. */ - if (gimple_code (use_stmt) != GIMPLE_PHI) + if (gimple_code (use_stmt) == GIMPLE_PHI) { - basic_block use_bb = gimple_bb (use_stmt); - bool this_maybe - = !dominated_by_p (CDI_POST_DOMINATORS, use_bb, stmt_bb); - warn_invalid_pointer (*use_p->use, use_stmt, stmt, - this_maybe, equality); - continue; + tree lhs = gimple_phi_result (use_stmt); + if (TREE_CODE (lhs) == SSA_NAME) + { + pointers.safe_push (lhs); + continue; + } } + + basic_block use_bb = gimple_bb (use_stmt); + bool this_maybe + = (maybe + || !dominated_by_p (CDI_POST_DOMINATORS, use_bb, stmt_bb)); + warn_invalid_pointer (*use_p->use, use_stmt, stmt, var, + this_maybe, equality); + continue; } if (is_gimple_assign (use_stmt)) @@ -4059,26 +4232,100 @@ pass_waccess::check_call (gcall *stmt) if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) check_builtin (stmt); - if (tree callee = gimple_call_fndecl (stmt)) - { - /* Check for uses of the pointer passed to either a standard - or a user-defined deallocation function. */ - unsigned argno = fndecl_dealloc_argno (callee); - if (argno < (unsigned) call_nargs (stmt)) - { - tree arg = call_arg (stmt, argno); - if (TREE_CODE (arg) == SSA_NAME) - check_pointer_uses (stmt, arg); - } - } + if (!m_early_checks_p) + if (tree callee = gimple_call_fndecl (stmt)) + { + /* Check for uses of the pointer passed to either a standard + or a user-defined deallocation function. */ + unsigned argno = fndecl_dealloc_argno (callee); + if (argno < (unsigned) call_nargs (stmt)) + { + tree arg = call_arg (stmt, argno); + if (TREE_CODE (arg) == SSA_NAME) + check_pointer_uses (stmt, arg); + } + } check_call_access (stmt); + check_call_dangling (stmt); + + if (m_early_checks_p) + return; maybe_check_dealloc_call (stmt); check_nonstring_args (stmt); } +/* Return true of X is a DECL with automatic storage duration. */ + +static inline bool +is_auto_decl (tree x) +{ + return DECL_P (x) && !DECL_EXTERNAL (x) && !TREE_STATIC (x); +} + +/* Check non-call STMT for invalid accesses. */ + +void +pass_waccess::check_stmt (gimple *stmt) +{ + if (m_check_dangling_p && gimple_clobber_p (stmt)) + { + /* Ignore clobber statemts in blocks with exceptional edges. */ + basic_block bb = gimple_bb (stmt); + edge e = EDGE_PRED (bb, 0); + if (e->flags & EDGE_EH) + return; + + tree var = gimple_assign_lhs (stmt); + m_clobbers.put (var, stmt); + return; + } + + if (is_gimple_assign (stmt)) + { + /* Clobbered unnamed temporaries such as compound literals can be + revived. Check for an assignment to one and remove it from + M_CLOBBERS. */ + tree lhs = gimple_assign_lhs (stmt); + while (handled_component_p (lhs)) + lhs = TREE_OPERAND (lhs, 0); + + if (is_auto_decl (lhs)) + m_clobbers.remove (lhs); + return; + } + + if (greturn *ret = dyn_cast <greturn *> (stmt)) + { + if (optimize && flag_isolate_erroneous_paths_dereference) + /* Avoid interfering with -Wreturn-local-addr (which runs only + with optimization enabled). */ + return; + + tree arg = gimple_return_retval (ret); + if (!arg || TREE_CODE (arg) != ADDR_EXPR) + return; + + arg = TREE_OPERAND (arg, 0); + while (handled_component_p (arg)) + arg = TREE_OPERAND (arg, 0); + + if (!is_auto_decl (arg)) + return; + + gimple **pclobber = m_clobbers.get (arg); + if (!pclobber) + return; + + if (!use_after_inval_p (*pclobber, stmt)) + return; + + warn_invalid_pointer (NULL_TREE, stmt, *pclobber, arg, false); + } +} + /* Check basic block BB for invalid accesses. */ void @@ -4091,6 +4338,8 @@ pass_waccess::check_block (basic_block bb) gimple *stmt = gsi_stmt (si); if (gcall *call = dyn_cast <gcall *> (stmt)) check_call (call); + else + check_stmt (stmt); } } @@ -4139,6 +4388,262 @@ pass_waccess::gimple_call_return_arg (gcall *call) return gimple_call_arg (call, argno); } +/* Return the decl referenced by the argument that the call STMT to + a built-in function returns (including with an offset) or null if + it doesn't. */ + +tree +pass_waccess::gimple_call_return_arg_ref (gcall *call) +{ + if (tree arg = gimple_call_return_arg (call)) + { + access_ref aref; + if (m_ptr_qry.get_ref (arg, call, &aref, 0) + && DECL_P (aref.ref)) + return aref.ref; + } + + return NULL_TREE; +} + +/* Check for and diagnose all uses of the dangling pointer VAR to the auto + object DECL whose lifetime has ended. OBJREF is true when VAR denotes + an access to a DECL that may have been clobbered. */ + +void +pass_waccess::check_dangling_uses (tree var, tree decl, bool maybe /* = false */, + bool objref /* = false */) +{ + if (!decl || !is_auto_decl (decl)) + return; + + gimple **pclob = m_clobbers.get (decl); + if (!pclob) + return; + + if (!objref) + { + check_pointer_uses (*pclob, var, decl, maybe); + return; + } + + gimple *use_stmt = SSA_NAME_DEF_STMT (var); + if (!use_after_inval_p (*pclob, use_stmt, true)) + return; + + basic_block use_bb = gimple_bb (use_stmt); + basic_block clob_bb = gimple_bb (*pclob); + maybe = maybe || !dominated_by_p (CDI_POST_DOMINATORS, use_bb, clob_bb); + warn_invalid_pointer (var, use_stmt, *pclob, decl, maybe, false); +} + +/* Diagnose stores in BB and (recursively) its predecessors of the addresses + of local variables into nonlocal pointers that are left dangling after + the function returns. BBS is a bitmap of basic blocks visited. */ + +void +pass_waccess::check_dangling_stores (basic_block bb, + hash_set<tree> &stores, + auto_bitmap &bbs) +{ + if (!bitmap_set_bit (bbs, bb->index)) + /* Avoid cycles. */ + return; + + /* Iterate backwards over the statements looking for a store of + the address of a local variable into a nonlocal pointer. */ + for (auto gsi = gsi_last_nondebug_bb (bb); ; gsi_prev_nondebug (&gsi)) + { + gimple *stmt = gsi_stmt (gsi); + if (!stmt) + break; + + if (is_gimple_call (stmt) + && !(gimple_call_flags (stmt) & (ECF_CONST | ECF_PURE))) + /* Avoid looking before nonconst, nonpure calls since those might + use the escaped locals. */ + return; + + if (!is_gimple_assign (stmt) || gimple_clobber_p (stmt)) + continue; + + access_ref lhs_ref; + tree lhs = gimple_assign_lhs (stmt); + if (!m_ptr_qry.get_ref (lhs, stmt, &lhs_ref, 0)) + continue; + + if (is_auto_decl (lhs_ref.ref)) + continue; + + if (DECL_P (lhs_ref.ref)) + { + if (!POINTER_TYPE_P (TREE_TYPE (lhs_ref.ref)) + || lhs_ref.deref > 0) + continue; + } + else if (TREE_CODE (lhs_ref.ref) == SSA_NAME) + { + /* Avoid looking at or before stores into unknown objects. */ + gimple *def_stmt = SSA_NAME_DEF_STMT (lhs_ref.ref); + if (!gimple_nop_p (def_stmt)) + return; + } + else if (TREE_CODE (lhs_ref.ref) == MEM_REF) + { + tree arg = TREE_OPERAND (lhs_ref.ref, 0); + if (TREE_CODE (arg) == SSA_NAME) + { + gimple *def_stmt = SSA_NAME_DEF_STMT (arg); + if (!gimple_nop_p (def_stmt)) + return; + } + } + else + continue; + + if (stores.add (lhs_ref.ref)) + continue; + + /* FIXME: Handle stores of alloca() and VLA. */ + access_ref rhs_ref; + tree rhs = gimple_assign_rhs1 (stmt); + if (!m_ptr_qry.get_ref (rhs, stmt, &rhs_ref, 0) + || rhs_ref.deref != -1) + continue; + + if (!is_auto_decl (rhs_ref.ref)) + continue; + + location_t loc = gimple_location (stmt); + if (warning_at (loc, OPT_Wdangling_pointer_, + "storing the address of local variable %qD in %qE", + rhs_ref.ref, lhs)) + { + location_t loc = DECL_SOURCE_LOCATION (rhs_ref.ref); + inform (loc, "%qD declared here", rhs_ref.ref); + + if (DECL_P (lhs_ref.ref)) + loc = DECL_SOURCE_LOCATION (lhs_ref.ref); + else if (EXPR_HAS_LOCATION (lhs_ref.ref)) + loc = EXPR_LOCATION (lhs_ref.ref); + + if (loc != UNKNOWN_LOCATION) + inform (loc, "%qE declared here", lhs_ref.ref); + } + } + + edge e; + edge_iterator ei; + FOR_EACH_EDGE (e, ei, bb->preds) + { + basic_block pred = e->src; + check_dangling_stores (pred, stores, bbs); + } +} + +/* Diagnose stores of the addresses of local variables into nonlocal + pointers that are left dangling after the function returns. */ + +void +pass_waccess::check_dangling_stores () +{ + auto_bitmap bbs; + hash_set<tree> stores; + check_dangling_stores (EXIT_BLOCK_PTR_FOR_FN (m_func), stores, bbs); +} + +/* Check for and diagnose uses of dangling pointers to auto objects + whose lifetime has ended. */ + +void +pass_waccess::check_dangling_uses () +{ + tree var; + unsigned i; + FOR_EACH_SSA_NAME (i, var, m_func) + { + /* For each SSA_NAME pointer VAR find the DECL it points to. + If the DECL is a clobbered local variable, check to see + if any of VAR's uses (or those of other pointers derived + from VAR) happens after the clobber. If so, warn. */ + tree decl = NULL_TREE; + + gimple *def_stmt = SSA_NAME_DEF_STMT (var); + if (is_gimple_assign (def_stmt)) + { + tree rhs = gimple_assign_rhs1 (def_stmt); + if (TREE_CODE (rhs) == ADDR_EXPR) + { + if (!POINTER_TYPE_P (TREE_TYPE (var))) + continue; + decl = TREE_OPERAND (rhs, 0); + } + else + { + /* For other expressions, check the base DECL to see + if it's been clobbered, most likely as a result of </cut>

4 years, 3 months

1
0
0 0

[ACTIVITY] week ending Jan. 16 2022

by Alex Bennée

Project Stratos =============== - reviewed Peter's virtio-video patches for QEMU [PR to clean up some typos in EDK2] <https://github.com/tianocore/edk2-platforms/pull/34> vhost-device maintainer effort ([UM-196]) - started reviewing https://github.com/rust-vmm/vhost-device/pull/7 - looking pretty good, see how https://github.com/rust-vmm/vm-virtio/commit/463dd20552fc32139bbbb56e9152df… would work with it [UM-196] <https://linaro.atlassian.net/browse/UM-196> QEMU Upstream Work ([UM-2]) =========================== - posted [RFC PATCH 0/6] Basic skeleton of RP2040 Raspbery Pi Pico Message-Id: <20220110175104.2908956-1-alex.bennee(a)linaro.org> - posted [PATCH v1 00/34] testing/next and other misc fixes Message-Id: <20220105135009.1584676-1-alex.bennee(a)linaro.org> - and the eventual [PULL 00/31] testing/next and other misc fixes Message-Id: <20220112112722.3641051-1-alex.bennee(a)linaro.org> - and the inevitable fixup [RFC PATCH] linux-user: expand reserved brk space for 64bit guests Message-Id: <20220113165550.4184455-1-alex.bennee(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> Upstream MTTCG tests ([QEMU-52]) - still waiting final review of [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> Completed Reviews [6/6] ======================= [PATCH] tests/docker: Add gentoo-loongarch64-cross image and run cross builds in GitLab Message-Id: <20211229062204.3726981-1-git(a)xen0n.name> [PATCH 0/2] tests/tcg: Fix float_{convs,madds} Message-Id: <20211224035541.2159966-1-richard.henderson(a)linaro.org> [PATCH v5 00/18] tests/docker: start using libvirt-ci's "lcitool" for dockerfiles Message-Id: <20211215141949.3512719-1-berrange(a)redhat.com> [PATCH] tests/tcg: Unconditionally use 90 second timeout Message-Id: <20211230235424.49155-1-richard.henderson(a)linaro.org> [PATCH] gitlab-ci: Speed up the msys2-64bit job by using --without-default-devices Message-Id: <20211216082253.43899-1-thuth(a)redhat.com> [PATCH 0/8] virtio: Add vhost-user based Video decode Message-Id: <20211209145601.331477-1-peter.griffin(a)linaro.org> Absences ======== Current Review Queue ==================== TODO [PATCH v2 00/11] Atomic cleanup + clang-12 build fix Message-Id: <20210717014121.1784956-1-richard.henderson(a)linaro.org> ============================================================================================================================ TODO [PATCH 0/7] tcg: some small towards more modular tcg Message-Id: <20210804143826.3402872-1-kraxel(a)redhat.com> ================================================================================================================= TODO [PATCH 0/6] Introduce CanoKey QEMU Message-Id: <YcSupUSXWDXOAkas@Sun> ========================================================================= TODO [PATCH] target/arm: Add missing FEAT_TLBIOS instructions Message-Id: <20211231103928.1455657-1-idan.horowitz(a)gmail.com> =========================================================================================================================== -- Alex Bennée

4 years, 3 months

1
0
0 0

[ACTIVITY] week ending Jan. 16 2022

by Alex Bennée

Project Stratos =============== - reviewed Peter's virtio-video patches for QEMU [PR to clean up some typos in EDK2] <https://github.com/tianocore/edk2-platforms/pull/34> vhost-device maintainer effort ([UM-196]) - started reviewing https://github.com/rust-vmm/vhost-device/pull/7 - looking pretty good, see how https://github.com/rust-vmm/vm-virtio/commit/463dd20552fc32139bbbb56e9152df… would work with it [UM-196] <https://linaro.atlassian.net/browse/UM-196> QEMU Upstream Work ([UM-2]) =========================== - posted [RFC PATCH 0/6] Basic skeleton of RP2040 Raspbery Pi Pico Message-Id: <20220110175104.2908956-1-alex.bennee(a)linaro.org> - posted [PATCH v1 00/34] testing/next and other misc fixes Message-Id: <20220105135009.1584676-1-alex.bennee(a)linaro.org> - and the eventual [PULL 00/31] testing/next and other misc fixes Message-Id: <20220112112722.3641051-1-alex.bennee(a)linaro.org> - and the inevitable fixup [RFC PATCH] linux-user: expand reserved brk space for 64bit guests Message-Id: <20220113165550.4184455-1-alex.bennee(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> Upstream MTTCG tests ([QEMU-52]) - still waiting final review of [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> Completed Reviews [6/6] ======================= [PATCH] tests/docker: Add gentoo-loongarch64-cross image and run cross builds in GitLab Message-Id: <20211229062204.3726981-1-git(a)xen0n.name> [PATCH 0/2] tests/tcg: Fix float_{convs,madds} Message-Id: <20211224035541.2159966-1-richard.henderson(a)linaro.org> [PATCH v5 00/18] tests/docker: start using libvirt-ci's "lcitool" for dockerfiles Message-Id: <20211215141949.3512719-1-berrange(a)redhat.com> [PATCH] tests/tcg: Unconditionally use 90 second timeout Message-Id: <20211230235424.49155-1-richard.henderson(a)linaro.org> [PATCH] gitlab-ci: Speed up the msys2-64bit job by using --without-default-devices Message-Id: <20211216082253.43899-1-thuth(a)redhat.com> [PATCH 0/8] virtio: Add vhost-user based Video decode Message-Id: <20211209145601.331477-1-peter.griffin(a)linaro.org> Absences ======== Current Review Queue ==================== TODO [PATCH 0/6] Introduce CanoKey QEMU Message-Id: <YcSupUSXWDXOAkas@Sun> ========================================================================= TODO [PATCH] target/arm: Add missing FEAT_TLBIOS instructions Message-Id: <20211231103928.1455657-1-idan.horowitz(a)gmail.com> ======================================================================================================================== TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64 Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com> =============================================================================================================== -- Alex Bennée

4 years, 3 months

1
0
0 0

[ACTIVITY] report week ending 14 Jan

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] - Most of this week was spent on continuing to work through my code-review queue :-/ - Sent a few minor cleanup patches for linux-user nits I noticed while reading the code as part of reviewing a big bsd-user patchset * QEMU-420 [GICv4 emulation] - got some reviewed ITS cleanup patches upstream - rerolled and sent v2 patchset for the rest of the cleanup patches - got back up to speed with where I left my GICv4 ITS patches before Christmas, and dealt with some minor loose ends I'd left in the last patch or two I was working on. -- PMM

4 years, 3 months

1
0
0 0

On holiday through 22 Jan

by Richard Henderson

Hi Peter, Welcome back, hope you had a good Christmas break. I'm off oh holiday myself for the next two weeks, so this would be an ideal time to pass back merge control to you. The board is mostly green now, with occasional allowed failures for centos-stream and freebsd for upstream package manager failures. See yall in a couple of weeks. r~

4 years, 4 months

1
0
0 0

[ACTIVITY] week ending 9 Jan 2022

by Richard Henderson

[UM-2] * Re-greening of gitlab-ci. - There are continuing issues with cross-i386-tci. Occasionally I see *really* long test times: https://gitlab.com/qemu-project/qemu/-/jobs/1941996332 with qtest-aarch64/qom-test taking 1738s, or 28 of the 60 minute budget. More often it's merely slow: https://gitlab.com/qemu-project/qemu/-/jobs/1954634840 with qtest-aarch64/qom-test taking 538s. Note that locally this test runs in about 100s, and I have been unable to determine why it runs so much slower on gitlab. - Worked on a ppc64-softmmu slowdown leading to timeouts. - Fixes for meson regressions affecting testing. * Refresh tcg unaligned user patch sets. r~

4 years, 4 months

1
0
0 0

[ACTIVITY] report week ending 7 Jan

by Peter Maydell

Progress (short week, 2 days): * UM-2 [QEMU upstream maintainership] - Catching up with email and codereview backlog from 3 weeks holiday :-) (Have got the codereview queue down to less than a dozen things so should be able to do some more GICv4 development next week.) -- PMM

4 years, 4 months

1
0
0 0

[ACTIVITY] week ending Dec. 19 2021

by Alex Bennée

Project Stratos =============== - got Xen working on the MachiatoBin - posted Configuring the host GIC for guest to guest IPI Message-Id: <87fsqwn2sd.fsf(a)linaro.org> QEMU Upstream Work ([UM-2]) =========================== - posted [RFC PATCH] linux-user: don't adjust base of found hole Message-Id: <20211216144442.2270605-1-alex.bennee(a)linaro.org> - posted [PATCH] hw/arm: add control knob to disable kaslr_seed via DTB Message-Id: <20211215120926.1696302-1-alex.bennee(a)linaro.org> Completed Reviews [3/3] ======================= [PATCH 00/26] arm gicv3 ITS: Various bug fixes and refactorings Message-Id: <20211211191135.1764649-1-peter.maydell(a)linaro.org> [PATCH for-7.0 0/6] target/arm: Implement LVA, LPA, LPA2 features Message-Id: <20211208231154.392029-1-richard.henderson(a)linaro.org> [PATCH-for-6.2? v2 0/5] docs/devel/style: Improve rST rendering Message-Id: <20211118145716.4116731-1-philmd(a)redhat.com> Absences ======== Off for holidays, back in the new year. Merry Christmas everyone! -- Alex Bennée

4 years, 4 months

1
0
0 0

[TCWG CI] Regression caused by gcc: tree-object-size: Use trees and support negative offsets

by ci_notify＠linaro.org

[TCWG CI] Regression caused by gcc: tree-object-size: Use trees and support negative offsets: commit 422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 Author: Siddhesh Poyarekar <siddhesh(a)gotplt.org> tree-object-size: Use trees and support negative offsets Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 550 # First few build errors in logs: # 00:01:37 ./include/linux/thread_info.h:213:25: error: call to ‘__bad_copy_to’ declared with attribute error: copy destination size is too small # 00:01:37 ./include/linux/thread_info.h:213:25: error: call to ‘__bad_copy_to’ declared with attribute error: copy destination size is too small # 00:01:37 make[1]: *** [scripts/Makefile.build:287: fs/io_uring.o] Error 1 # 00:01:37 make: *** [Makefile:1846: fs] Error 2 from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 567 # linux build successful: all THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/gnu-master-arm-mainline-allnoconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… Reproduce builds: <cut> mkdir investigate-gcc-422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 cd investigate-gcc-422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach 422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 871504b0dd5cd023d3a28cf9e5ccbda75928b102 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 Author: Siddhesh Poyarekar <siddhesh(a)gotplt.org> Date: Fri Dec 17 07:07:18 2021 +0530 tree-object-size: Use trees and support negative offsets Transform tree-object-size to operate on tree objects instead of host wide integers. This makes it easier to extend to dynamic expressions for object sizes. The compute_builtin_object_size interface also now returns a tree expression instead of HOST_WIDE_INT, so callers have been adjusted to account for that. The trees in object_sizes are each an object_size object with members size (the bytes from the pointer to the end of the object) and wholesize (the size of the whole object). This allows analysis of negative offsets, which can now be allowed to the extent of the object bounds. Tests have been added to verify that it actually works. gcc/ChangeLog: * tree-object-size.h (compute_builtin_object_size): Return tree instead of HOST_WIDE_INT. * builtins.c (fold_builtin_object_size): Adjust. * gimple-fold.c (gimple_fold_builtin_strncat): Likewise. * ubsan.c (instrument_object_size): Likewise. * tree-object-size.c (object_size): New structure. (object_sizes): Change type to vec<object_size>. (initval): New function. (unknown): Use it. (size_unknown_p, size_initval, size_unknown): New functions. (object_sizes_unknown_p): Use it. (object_sizes_get): Return tree. (object_sizes_initialize): Rename from object_sizes_set_force and set VAL parameter type as tree. Add new parameter WHOLEVAL. (object_sizes_set): Set VAL parameter type as tree and adjust implementation. Add new parameter WHOLEVAL. (size_for_offset): New function. (decl_init_size): Adjust comment. (addr_object_size): Change PSIZE parameter to tree and adjust implementation. Add new parameter PWHOLESIZE. (alloc_object_size): Return tree. (compute_builtin_object_size): Return tree in PSIZE. (expr_object_size, call_object_size, unknown_object_size): Adjust for object_sizes_set change. (merge_object_sizes): Drop OFFSET parameter and adjust implementation for tree change. (plus_stmt_object_size): Call collect_object_sizes_for directly instead of merge_object_size and call size_for_offset to get net size. (cond_expr_object_size, collect_object_sizes_for, object_sizes_execute): Adjust for change of type from HOST_WIDE_INT to tree. (check_for_plus_in_loops_1): Likewise and skip non-positive offsets. gcc/testsuite/ChangeLog: * gcc.dg/builtin-object-size-1.c (test9): New test. (main): Call it. * gcc.dg/builtin-object-size-2.c (test8): New test. (main): Call it. * gcc.dg/builtin-object-size-3.c (test9): New test. (main): Call it. * gcc.dg/builtin-object-size-4.c (test8): New test. (main): Call it. * gcc.dg/builtin-object-size-5.c (test5, test6, test7): New tests. Signed-off-by: Siddhesh Poyarekar <siddhesh(a)gotplt.org> --- gcc/builtins.c | 10 +- gcc/gimple-fold.c | 11 +- gcc/testsuite/gcc.dg/builtin-object-size-1.c | 30 ++ gcc/testsuite/gcc.dg/builtin-object-size-2.c | 30 ++ gcc/testsuite/gcc.dg/builtin-object-size-3.c | 31 +++ gcc/testsuite/gcc.dg/builtin-object-size-4.c | 30 ++ gcc/testsuite/gcc.dg/builtin-object-size-5.c | 25 ++ gcc/tree-object-size.c | 394 +++++++++++++++++---------- gcc/tree-object-size.h | 2 +- gcc/ubsan.c | 5 +- 10 files changed, 409 insertions(+), 159 deletions(-) diff --git a/gcc/builtins.c b/gcc/builtins.c index cd8947b4de2..abe342e111d 100644 --- a/gcc/builtins.c +++ b/gcc/builtins.c @@ -10255,7 +10255,7 @@ maybe_emit_sprintf_chk_warning (tree exp, enum built_in_function fcode) static tree fold_builtin_object_size (tree ptr, tree ost) { - unsigned HOST_WIDE_INT bytes; + tree bytes; int object_size_type; if (!validate_arg (ptr, POINTER_TYPE) @@ -10280,8 +10280,8 @@ fold_builtin_object_size (tree ptr, tree ost) if (TREE_CODE (ptr) == ADDR_EXPR) { compute_builtin_object_size (ptr, object_size_type, &bytes); - if (wi::fits_to_tree_p (bytes, size_type_node)) - return build_int_cstu (size_type_node, bytes); + if (int_fits_type_p (bytes, size_type_node)) + return fold_convert (size_type_node, bytes); } else if (TREE_CODE (ptr) == SSA_NAME) { @@ -10289,8 +10289,8 @@ fold_builtin_object_size (tree ptr, tree ost) later. Maybe subsequent passes will help determining it. */ if (compute_builtin_object_size (ptr, object_size_type, &bytes) - && wi::fits_to_tree_p (bytes, size_type_node)) - return build_int_cstu (size_type_node, bytes); + && int_fits_type_p (bytes, size_type_node)) + return fold_convert (size_type_node, bytes); } return NULL_TREE; diff --git a/gcc/gimple-fold.c b/gcc/gimple-fold.c index 1d8fd74f72c..64515aabc04 100644 --- a/gcc/gimple-fold.c +++ b/gcc/gimple-fold.c @@ -2493,17 +2493,16 @@ gimple_fold_builtin_strncat (gimple_stmt_iterator *gsi) if (!src_len || known_lower (stmt, len, src_len, true)) return false; - unsigned HOST_WIDE_INT dstsize; - bool found_dstsize = compute_builtin_object_size (dst, 1, &dstsize); - /* Warn on constant LEN. */ if (TREE_CODE (len) == INTEGER_CST) { bool nowarn = warning_suppressed_p (stmt, OPT_Wstringop_overflow_); + tree dstsize; - if (!nowarn && found_dstsize) + if (!nowarn && compute_builtin_object_size (dst, 1, &dstsize) + && TREE_CODE (dstsize) == INTEGER_CST) { - int cmpdst = compare_tree_int (len, dstsize); + int cmpdst = tree_int_cst_compare (len, dstsize); if (cmpdst >= 0) { @@ -2519,7 +2518,7 @@ gimple_fold_builtin_strncat (gimple_stmt_iterator *gsi) ? G_("%qD specified bound %E equals " "destination size") : G_("%qD specified bound %E exceeds " - "destination size %wu"), + "destination size %E"), fndecl, len, dstsize); if (nowarn) suppress_warning (stmt, OPT_Wstringop_overflow_); diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-1.c b/gcc/testsuite/gcc.dg/builtin-object-size-1.c index 8cdae49a6b1..0154f4e9695 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-1.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-1.c @@ -424,6 +424,35 @@ test8 (void) abort (); } +void +__attribute__ ((noinline)) +test9 (unsigned cond) +{ + char *buf2 = malloc (10); + char *p; + + if (cond) + p = &buf2[8]; + else + p = &buf2[4]; + + if (__builtin_object_size (&p[-4], 0) != 10) + abort (); + + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 0) != 10) + abort (); + + p = &y.c[8]; + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 0) != sizeof (y)) + abort (); +} + int main (void) { @@ -437,5 +466,6 @@ main (void) test6 (4); test7 (); test8 (); + test9 (1); exit (0); } diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-2.c b/gcc/testsuite/gcc.dg/builtin-object-size-2.c index ad2dd296a9a..5cf29291aff 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-2.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-2.c @@ -382,6 +382,35 @@ test7 (void) abort (); } +void +__attribute__ ((noinline)) +test8 (unsigned cond) +{ + char *buf2 = malloc (10); + char *p; + + if (cond) + p = &buf2[8]; + else + p = &buf2[4]; + + if (__builtin_object_size (&p[-4], 1) != 10) + abort (); + + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 1) != 10) + abort (); + + p = &y.c[8]; + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 1) != sizeof (y.c)) + abort (); +} + int main (void) { @@ -394,5 +423,6 @@ main (void) test5 (4); test6 (); test7 (); + test8 (1); exit (0); } diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-3.c b/gcc/testsuite/gcc.dg/builtin-object-size-3.c index d5ca5047ee9..3a692c4e3d2 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-3.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-3.c @@ -430,6 +430,36 @@ test8 (void) abort (); } +void +__attribute__ ((noinline)) +test9 (unsigned cond) +{ + char *buf2 = malloc (10); + char *p; + + if (cond) + p = &buf2[8]; + else + p = &buf2[4]; + + if (__builtin_object_size (&p[-4], 2) != 6) + abort (); + + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 2) != 2) + abort (); + + p = &y.c[8]; + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 2) + != sizeof (y) - __builtin_offsetof (struct A, c) - 8) + abort (); +} + int main (void) { @@ -443,5 +473,6 @@ main (void) test6 (4); test7 (); test8 (); + test9 (1); exit (0); } diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-4.c b/gcc/testsuite/gcc.dg/builtin-object-size-4.c index 9f159e36a0f..87381620cc9 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-4.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-4.c @@ -395,6 +395,35 @@ test7 (void) abort (); } +void +__attribute__ ((noinline)) +test8 (unsigned cond) +{ + char *buf2 = malloc (10); + char *p; + + if (cond) + p = &buf2[8]; + else + p = &buf2[4]; + + if (__builtin_object_size (&p[-4], 3) != 6) + abort (); + + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 3) != 2) + abort (); + + p = &y.c[8]; + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 3) != sizeof (y.c) - 8) + abort (); +} + int main (void) { @@ -407,5 +436,6 @@ main (void) test5 (4); test6 (); test7 (); + test8 (1); exit (0); } diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-5.c b/gcc/testsuite/gcc.dg/builtin-object-size-5.c index 7c274cdfd42..8e63d9c7a5e 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-5.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-5.c @@ -53,4 +53,29 @@ test4 (size_t x) abort (); } +void +test5 (void) +{ + char *p = &buf[0x90000004]; + if (__builtin_object_size (p + 2, 0) != 0) + abort (); +} + +void +test6 (void) +{ + char *p = &buf[-4]; + if (__builtin_object_size (p + 2, 0) != 0) + abort (); +} + +void +test7 (void) +{ + char *buf2 = __builtin_malloc (8); + char *p = &buf2[0x90000004]; + if (__builtin_object_size (p + 2, 0) != 0) + abort (); +} + /* { dg-final { scan-assembler-not "abort" } } */ diff --git a/gcc/tree-object-size.c b/gcc/tree-object-size.c index b4881ef198f..32ef6dd5133 100644 --- a/gcc/tree-object-size.c +++ b/gcc/tree-object-size.c @@ -45,6 +45,14 @@ struct object_size_info unsigned int *stack, *tos; }; +struct GTY(()) object_size +{ + /* Estimate of bytes till the end of the object. */ + tree size; + /* Estimate of the size of the whole object. */ + tree wholesize; +}; + enum { OST_SUBOBJECT = 1, @@ -54,13 +62,12 @@ enum static tree compute_object_offset (const_tree, const_tree); static bool addr_object_size (struct object_size_info *, - const_tree, int, unsigned HOST_WIDE_INT *); -static unsigned HOST_WIDE_INT alloc_object_size (const gcall *, int); + const_tree, int, tree *, tree *t = NULL); +static tree alloc_object_size (const gcall *, int); static tree pass_through_call (const gcall *); static void collect_object_sizes_for (struct object_size_info *, tree); static void expr_object_size (struct object_size_info *, tree, tree); -static bool merge_object_sizes (struct object_size_info *, tree, tree, - unsigned HOST_WIDE_INT); +static bool merge_object_sizes (struct object_size_info *, tree, tree); static bool plus_stmt_object_size (struct object_size_info *, tree, gimple *); static bool cond_expr_object_size (struct object_size_info *, tree, gimple *); static void init_offset_limit (void); @@ -68,13 +75,13 @@ static void check_for_plus_in_loops (struct object_size_info *, tree); static void check_for_plus_in_loops_1 (struct object_size_info *, tree, unsigned int); -/* object_sizes[0] is upper bound for number of bytes till the end of - the object. - object_sizes[1] is upper bound for number of bytes till the end of - the subobject (innermost array or field with address taken). - object_sizes[2] is lower bound for number of bytes till the end of - the object and object_sizes[3] lower bound for subobject. */ -static vec<unsigned HOST_WIDE_INT> object_sizes[OST_END]; +/* object_sizes[0] is upper bound for the object size and number of bytes till + the end of the object. + object_sizes[1] is upper bound for the object size and number of bytes till + the end of the subobject (innermost array or field with address taken). + object_sizes[2] is lower bound for the object size and number of bytes till + the end of the object and object_sizes[3] lower bound for subobject. */ +static vec<object_size> object_sizes[OST_END]; /* Bitmaps what object sizes have been computed already. */ static bitmap computed[OST_END]; @@ -82,10 +89,46 @@ static bitmap computed[OST_END]; /* Maximum value of offset we consider to be addition. */ static unsigned HOST_WIDE_INT offset_limit; +/* Initial value of object sizes; zero for maximum and SIZE_MAX for minimum + object size. */ + +static inline unsigned HOST_WIDE_INT +initval (int object_size_type) +{ + return (object_size_type & OST_MINIMUM) ? HOST_WIDE_INT_M1U : 0; +} + +/* Unknown object size value; it's the opposite of initval. */ + static inline unsigned HOST_WIDE_INT unknown (int object_size_type) { - return ((unsigned HOST_WIDE_INT) -((object_size_type >> 1) ^ 1)); + return ~initval (object_size_type); +} + +/* Return true if VAL is represents an unknown size for OBJECT_SIZE_TYPE. */ + +static inline bool +size_unknown_p (tree val, int object_size_type) +{ + return (tree_fits_uhwi_p (val) + && tree_to_uhwi (val) == unknown (object_size_type)); +} + +/* Return a tree with initial value for OBJECT_SIZE_TYPE. */ + +static inline tree +size_initval (int object_size_type) +{ + return size_int (initval (object_size_type)); +} + +/* Return a tree with unknown value for OBJECT_SIZE_TYPE. */ + +static inline tree +size_unknown (int object_size_type) +{ + return size_int (unknown (object_size_type)); } /* Grow object_sizes[OBJECT_SIZE_TYPE] to num_ssa_names. */ @@ -110,47 +153,57 @@ object_sizes_release (int object_size_type) static inline bool object_sizes_unknown_p (int object_size_type, unsigned varno) { - return (object_sizes[object_size_type][varno] - == unknown (object_size_type)); + return size_unknown_p (object_sizes[object_size_type][varno].size, + object_size_type); } -/* Return size for VARNO corresponding to OSI. */ +/* Return size for VARNO corresponding to OSI. If WHOLE is true, return the + whole object size. */ -static inline unsigned HOST_WIDE_INT -object_sizes_get (struct object_size_info *osi, unsigned varno) +static inline tree +object_sizes_get (struct object_size_info *osi, unsigned varno, + bool whole = false) { - return object_sizes[osi->object_size_type][varno]; + if (whole) + return object_sizes[osi->object_size_type][varno].wholesize; + else + return object_sizes[osi->object_size_type][varno].size; } /* Set size for VARNO corresponding to OSI to VAL. */ -static inline bool -object_sizes_set_force (struct object_size_info *osi, unsigned varno, - unsigned HOST_WIDE_INT val) +static inline void +object_sizes_initialize (struct object_size_info *osi, unsigned varno, + tree val, tree wholeval) { - object_sizes[osi->object_size_type][varno] = val; - return true; + int object_size_type = osi->object_size_type; + + object_sizes[object_size_type][varno].size = val; + object_sizes[object_size_type][varno].wholesize = wholeval; } /* Set size for VARNO corresponding to OSI to VAL if it is the new minimum or maximum. */ static inline bool -object_sizes_set (struct object_size_info *osi, unsigned varno, - unsigned HOST_WIDE_INT val) +object_sizes_set (struct object_size_info *osi, unsigned varno, tree val, + tree wholeval) { int object_size_type = osi->object_size_type; - if ((object_size_type & OST_MINIMUM) == 0) - { - if (object_sizes[object_size_type][varno] < val) - return object_sizes_set_force (osi, varno, val); - } - else - { - if (object_sizes[object_size_type][varno] > val) - return object_sizes_set_force (osi, varno, val); - } - return false; + object_size osize = object_sizes[object_size_type][varno]; + + tree oldval = osize.size; + tree old_wholeval = osize.wholesize; + + enum tree_code code = object_size_type & OST_MINIMUM ? MIN_EXPR : MAX_EXPR; + + val = size_binop (code, val, oldval); + wholeval = size_binop (code, wholeval, old_wholeval); + + object_sizes[object_size_type][varno].size = val; + object_sizes[object_size_type][varno].wholesize = wholeval; + return (tree_int_cst_compare (oldval, val) != 0 + || tree_int_cst_compare (old_wholeval, wholeval) != 0); } /* Initialize OFFSET_LIMIT variable. */ @@ -164,6 +217,48 @@ init_offset_limit (void) offset_limit /= 2; } +/* Bytes at end of the object with SZ from offset OFFSET. If WHOLESIZE is not + NULL_TREE, use it to get the net offset of the pointer, which should always + be positive and hence, be within OFFSET_LIMIT for valid offsets. */ + +static tree +size_for_offset (tree sz, tree offset, tree wholesize = NULL_TREE) +{ + gcc_checking_assert (TREE_CODE (offset) == INTEGER_CST); + gcc_checking_assert (TREE_CODE (sz) == INTEGER_CST); + gcc_checking_assert (types_compatible_p (TREE_TYPE (sz), sizetype)); + + /* For negative offsets, if we have a distinct WHOLESIZE, use it to get a net + offset from the whole object. */ + if (wholesize && tree_int_cst_compare (sz, wholesize)) + { + gcc_checking_assert (TREE_CODE (wholesize) == INTEGER_CST); + gcc_checking_assert (types_compatible_p (TREE_TYPE (wholesize), + sizetype)); + + /* Restructure SZ - OFFSET as + WHOLESIZE - (WHOLESIZE + OFFSET - SZ) so that the offset part, i.e. + WHOLESIZE + OFFSET - SZ is only allowed to be positive. */ + tree tmp = size_binop (MAX_EXPR, wholesize, sz); + offset = fold_build2 (PLUS_EXPR, sizetype, tmp, offset); + offset = fold_build2 (MINUS_EXPR, sizetype, offset, sz); + sz = tmp; + } + + /* Safe to convert now, since a valid net offset should be non-negative. */ + if (!types_compatible_p (TREE_TYPE (offset), sizetype)) + fold_convert (sizetype, offset); + + if (integer_zerop (offset)) + return sz; + + /* Negative or too large offset even after adjustment, cannot be within + bounds of an object. */ + if (compare_tree_int (offset, offset_limit) > 0) + return size_zero_node; + + return size_binop (MINUS_EXPR, size_binop (MAX_EXPR, sz, offset), offset); +} /* Compute offset of EXPR within VAR. Return error_mark_node if unknown. */ @@ -274,19 +369,22 @@ decl_init_size (tree decl, bool min) /* Compute __builtin_object_size for PTR, which is a ADDR_EXPR. OBJECT_SIZE_TYPE is the second argument from __builtin_object_size. - If unknown, return unknown (object_size_type). */ + If unknown, return size_unknown (object_size_type). */ static bool addr_object_size (struct object_size_info *osi, const_tree ptr, - int object_size_type, unsigned HOST_WIDE_INT *psize) + int object_size_type, tree *psize, tree *pwholesize) { - tree pt_var, pt_var_size = NULL_TREE, var_size, bytes; + tree pt_var, pt_var_size = NULL_TREE, pt_var_wholesize = NULL_TREE; + tree var_size, bytes, wholebytes; gcc_assert (TREE_CODE (ptr) == ADDR_EXPR); /* Set to unknown and overwrite just before returning if the size could be determined. */ - *psize = unknown (object_size_type); + *psize = size_unknown (object_size_type); + if (pwholesize) + *pwholesize = size_unknown (object_size_type); pt_var = TREE_OPERAND (ptr, 0); while (handled_component_p (pt_var)) @@ -297,13 +395,14 @@ addr_object_size (struct object_size_info *osi, const_tree ptr, if (TREE_CODE (pt_var) == MEM_REF) { - unsigned HOST_WIDE_INT sz; + tree sz, wholesize; if (!osi || (object_size_type & OST_SUBOBJECT) != 0 || TREE_CODE (TREE_OPERAND (pt_var, 0)) != SSA_NAME) { compute_builtin_object_size (TREE_OPERAND (pt_var, 0), object_size_type & ~OST_SUBOBJECT, &sz); + wholesize = sz; } else { @@ -312,46 +411,47 @@ addr_object_size (struct object_size_info *osi, const_tree ptr, collect_object_sizes_for (osi, var); if (bitmap_bit_p (computed[object_size_type], SSA_NAME_VERSION (var))) - sz = object_sizes_get (osi, SSA_NAME_VERSION (var)); + { + sz = object_sizes_get (osi, SSA_NAME_VERSION (var)); + wholesize = object_sizes_get (osi, SSA_NAME_VERSION (var), true); + } else - sz = unknown (object_size_type); + sz = wholesize = size_unknown (object_size_type); } - if (sz != unknown (object_size_type)) + if (!size_unknown_p (sz, object_size_type)) { - offset_int mem_offset; - if (mem_ref_offset (pt_var).is_constant (&mem_offset)) - { - offset_int dsz = wi::sub (sz, mem_offset); - if (wi::neg_p (dsz)) - sz = 0; - else if (wi::fits_uhwi_p (dsz)) - sz = dsz.to_uhwi (); - else - sz = unknown (object_size_type); - } + tree offset = TREE_OPERAND (pt_var, 1); + if (TREE_CODE (offset) != INTEGER_CST + || TREE_CODE (sz) != INTEGER_CST) + sz = wholesize = size_unknown (object_size_type); else - sz = unknown (object_size_type); + sz = size_for_offset (sz, offset, wholesize); } - if (sz != unknown (object_size_type) && sz < offset_limit) - pt_var_size = size_int (sz); + if (!size_unknown_p (sz, object_size_type) + && TREE_CODE (sz) == INTEGER_CST + && compare_tree_int (sz, offset_limit) < 0) + { + pt_var_size = sz; + pt_var_wholesize = wholesize; + } } else if (DECL_P (pt_var)) { - pt_var_size = decl_init_size (pt_var, object_size_type & OST_MINIMUM); + pt_var_size = pt_var_wholesize + = decl_init_size (pt_var, object_size_type & OST_MINIMUM); if (!pt_var_size) return false; } else if (TREE_CODE (pt_var) == STRING_CST) - pt_var_size = TYPE_SIZE_UNIT (TREE_TYPE (pt_var)); + pt_var_size = pt_var_wholesize = TYPE_SIZE_UNIT (TREE_TYPE (pt_var)); else return false; if (pt_var_size) { /* Validate the size determined above. */ - if (!tree_fits_uhwi_p (pt_var_size) - || tree_to_uhwi (pt_var_size) >= offset_limit) + if (compare_tree_int (pt_var_size, offset_limit) >= 0) return false; } @@ -496,28 +596,35 @@ addr_object_size (struct object_size_info *osi, const_tree ptr, bytes = size_binop (MIN_EXPR, bytes, bytes2); } } + + wholebytes + = object_size_type & OST_SUBOBJECT ? var_size : pt_var_wholesize; } else if (!pt_var_size) return false; else - bytes = pt_var_size; - - if (tree_fits_uhwi_p (bytes)) { - *psize = tree_to_uhwi (bytes); - return true; + bytes = pt_var_size; + wholebytes = pt_var_wholesize; } - return false; + if (TREE_CODE (bytes) != INTEGER_CST + || TREE_CODE (wholebytes) != INTEGER_CST) + return false; + + *psize = bytes; + if (pwholesize) + *pwholesize = wholebytes; + return true; } /* Compute __builtin_object_size for CALL, which is a GIMPLE_CALL. Handles calls to functions declared with attribute alloc_size. OBJECT_SIZE_TYPE is the second argument from __builtin_object_size. - If unknown, return unknown (object_size_type). */ + If unknown, return size_unknown (object_size_type). */ -static unsigned HOST_WIDE_INT +static tree alloc_object_size (const gcall *call, int object_size_type) { gcc_assert (is_gimple_call (call)); @@ -529,7 +636,7 @@ alloc_object_size (const gcall *call, int object_size_type) calltype = gimple_call_fntype (call); if (!calltype) - return unknown (object_size_type); + return size_unknown (object_size_type); /* Set to positions of alloc_size arguments. */ int arg1 = -1, arg2 = -1; @@ -549,7 +656,7 @@ alloc_object_size (const gcall *call, int object_size_type) || (arg2 >= 0 && (arg2 >= (int)gimple_call_num_args (call) || TREE_CODE (gimple_call_arg (call, arg2)) != INTEGER_CST))) - return unknown (object_size_type); + return size_unknown (object_size_type); tree bytes = NULL_TREE; if (arg2 >= 0) @@ -559,10 +666,7 @@ alloc_object_size (const gcall *call, int object_size_type) else if (arg1 >= 0) bytes = fold_convert (sizetype, gimple_call_arg (call, arg1)); - if (bytes && tree_fits_uhwi_p (bytes)) - return tree_to_uhwi (bytes); - - return unknown (object_size_type); + return bytes; } @@ -598,13 +702,13 @@ pass_through_call (const gcall *call) bool compute_builtin_object_size (tree ptr, int object_size_type, - unsigned HOST_WIDE_INT *psize) + tree *psize) { gcc_assert (object_size_type >= 0 && object_size_type < OST_END); /* Set to unknown and overwrite just before returning if the size could be determined. */ - *psize = unknown (object_size_type); + *psize = size_unknown (object_size_type); if (! offset_limit) init_offset_limit (); @@ -638,8 +742,7 @@ compute_builtin_object_size (tree ptr, int object_size_type, psize)) { /* Return zero when the offset is out of bounds. */ - unsigned HOST_WIDE_INT off = tree_to_shwi (offset); - *psize = off < *psize ? *psize - off : 0; + *psize = size_for_offset (*psize, offset); return true; } } @@ -747,12 +850,13 @@ compute_builtin_object_size (tree ptr, int object_size_type, print_generic_expr (dump_file, ssa_name (i), dump_flags); fprintf (dump_file, - ": %s %sobject size " - HOST_WIDE_INT_PRINT_UNSIGNED "\n", + ": %s %sobject size ", ((object_size_type & OST_MINIMUM) ? "minimum" : "maximum"), - (object_size_type & OST_SUBOBJECT) ? "sub" : "", - object_sizes_get (&osi, i)); + (object_size_type & OST_SUBOBJECT) ? "sub" : ""); + print_generic_expr (dump_file, object_sizes_get (&osi, i), + dump_flags); + fprintf (dump_file, "\n"); } } @@ -761,7 +865,7 @@ compute_builtin_object_size (tree ptr, int object_size_type, } *psize = object_sizes_get (&osi, SSA_NAME_VERSION (ptr)); - return *psize != unknown (object_size_type); + return !size_unknown_p (*psize, object_size_type); } /* Compute object_sizes for PTR, defined to VALUE, which is not an SSA_NAME. */ @@ -771,7 +875,7 @@ expr_object_size (struct object_size_info *osi, tree ptr, tree value) { int object_size_type = osi->object_size_type; unsigned int varno = SSA_NAME_VERSION (ptr); - unsigned HOST_WIDE_INT bytes; + tree bytes, wholesize; gcc_assert (!object_sizes_unknown_p (object_size_type, varno)); gcc_assert (osi->pass == 0); @@ -784,11 +888,11 @@ expr_object_size (struct object_size_info *osi, tree ptr, tree value) || !POINTER_TYPE_P (TREE_TYPE (value))); if (TREE_CODE (value) == ADDR_EXPR) - addr_object_size (osi, value, object_size_type, &bytes); + addr_object_size (osi, value, object_size_type, &bytes, &wholesize); else - bytes = unknown (object_size_type); + bytes = wholesize = size_unknown (object_size_type); - object_sizes_set (osi, varno, bytes); + object_sizes_set (osi, varno, bytes, wholesize); } @@ -799,16 +903,14 @@ call_object_size (struct object_size_info *osi, tree ptr, gcall *call) { int object_size_type = osi->object_size_type; unsigned int varno = SSA_NAME_VERSION (ptr); - unsigned HOST_WIDE_INT bytes; gcc_assert (is_gimple_call (call)); gcc_assert (!object_sizes_unknown_p (object_size_type, varno)); gcc_assert (osi->pass == 0); + tree bytes = alloc_object_size (call, object_size_type); - bytes = alloc_object_size (call, object_size_type); - - object_sizes_set (osi, varno, bytes); + object_sizes_set (osi, varno, bytes, bytes); } @@ -822,8 +924,9 @@ unknown_object_size (struct object_size_info *osi, tree ptr) gcc_checking_assert (!object_sizes_unknown_p (object_size_type, varno)); gcc_checking_assert (osi->pass == 0); + tree bytes = size_unknown (object_size_type); - object_sizes_set (osi, varno, unknown (object_size_type)); + object_sizes_set (osi, varno, bytes, bytes); } @@ -831,30 +934,22 @@ unknown_object_size (struct object_size_info *osi, tree ptr) the object size might need reexamination later. */ static bool -merge_object_sizes (struct object_size_info *osi, tree dest, tree orig, - unsigned HOST_WIDE_INT offset) +merge_object_sizes (struct object_size_info *osi, tree dest, tree orig) { int object_size_type = osi->object_size_type; unsigned int varno = SSA_NAME_VERSION (dest); - unsigned HOST_WIDE_INT orig_bytes; + tree orig_bytes, wholesize; if (object_sizes_unknown_p (object_size_type, varno)) return false; - if (offset >= offset_limit) - { - object_sizes_set (osi, varno, unknown (object_size_type)); - return false; - } if (osi->pass == 0) collect_object_sizes_for (osi, orig); orig_bytes = object_sizes_get (osi, SSA_NAME_VERSION (orig)); - if (orig_bytes != unknown (object_size_type)) - orig_bytes = (offset > orig_bytes) - ? HOST_WIDE_INT_0U : orig_bytes - offset; + wholesize = object_sizes_get (osi, SSA_NAME_VERSION (orig), true); - if (object_sizes_set (osi, varno, orig_bytes)) + if (object_sizes_set (osi, varno, orig_bytes, wholesize)) osi->changed = true; return bitmap_bit_p (osi->reexamine, SSA_NAME_VERSION (orig)); @@ -870,8 +965,9 @@ plus_stmt_object_size (struct object_size_info *osi, tree var, gimple *stmt) { int object_size_type = osi->object_size_type; unsigned int varno = SSA_NAME_VERSION (var); - unsigned HOST_WIDE_INT bytes; + tree bytes, wholesize; tree op0, op1; + bool reexamine = false; if (gimple_assign_rhs_code (stmt) == POINTER_PLUS_EXPR) { @@ -896,31 +992,38 @@ plus_stmt_object_size (struct object_size_info *osi, tree var, gimple *stmt) && (TREE_CODE (op0) == SSA_NAME || TREE_CODE (op0) == ADDR_EXPR)) { - if (! tree_fits_uhwi_p (op1)) - bytes = unknown (object_size_type); - else if (TREE_CODE (op0) == SSA_NAME) - return merge_object_sizes (osi, var, op0, tree_to_uhwi (op1)); + if (TREE_CODE (op0) == SSA_NAME) + { + if (osi->pass == 0) + collect_object_sizes_for (osi, op0); + + bytes = object_sizes_get (osi, SSA_NAME_VERSION (op0)); + wholesize = object_sizes_get (osi, SSA_NAME_VERSION (op0), true); + reexamine = bitmap_bit_p (osi->reexamine, SSA_NAME_VERSION (op0)); + } else { - unsigned HOST_WIDE_INT off = tree_to_uhwi (op1); - - /* op0 will be ADDR_EXPR here. */ - addr_object_size (osi, op0, object_size_type, &bytes); - if (bytes == unknown (object_size_type)) - ; - else if (off > offset_limit) - bytes = unknown (object_size_type); - else if (off > bytes) - bytes = 0; - else - bytes -= off; + /* op0 will be ADDR_EXPR here. We should never come here during + reexamination. */ + gcc_checking_assert (osi->pass == 0); + addr_object_size (osi, op0, object_size_type, &bytes, &wholesize); } + + /* In the first pass, do not compute size for offset if either the + maximum size is unknown or the minimum size is not initialized yet; + the latter indicates a dependency loop and will be resolved in + subsequent passes. We attempt to compute offset for 0 minimum size + too because a negative offset could be within bounds of WHOLESIZE, + giving a non-zero result for VAR. */ + if (osi->pass != 0 || !size_unknown_p (bytes, 0)) + bytes = size_for_offset (bytes, op1, wholesize); } else - bytes = unknown (object_size_type); </cut>

4 years, 4 months

1
0
0 0

[TCWG CI] 464.h264ref slowed down by 4% after llvm: [SLP]Improve multinode analysis.

by ci_notify＠linaro.org

After llvm commit bd053769867f988500dc1b451c6439eefcf7643f Author: Alexey Bataev <a.bataev(a)outlook.com> [SLP]Improve multinode analysis. the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 4% from 11205 to 11640 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-bd053769867f988500dc1b451c6439eefcf7643f cd investigate-llvm-bd053769867f988500dc1b451c6439eefcf7643f # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach bd053769867f988500dc1b451c6439eefcf7643f ../artifacts/test.sh # Reproduce last_good build git checkout --detach 135d5d4a6d37f30173c1b9ea85a3a969c364b241 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit bd053769867f988500dc1b451c6439eefcf7643f Author: Alexey Bataev <a.bataev(a)outlook.com> Date: Tue Apr 6 08:35:52 2021 -0700 [SLP]Improve multinode analysis. Changes the preliminary multinode analysis: 1. Introduced scores for reversed loads/extractelements. 2. Improved shallow score calculation. 3. Lowered the cost of external uses (no need to consider it several times, just ones). 4. The initial lane for analysis is the one with the minimal possible reorderings. These changes in general shall reduce compile time and improve the reordering in many cases. Part of D57059. Differential Revision: https://reviews.llvm.org/D101109 --- llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 253 ++++++++++++++++----- .../AArch64/transpose-inseltpoison.ll | 30 +-- .../Transforms/SLPVectorizer/AArch64/transpose.ll | 30 +-- .../AArch64/vectorize-free-extracts-inserts.ll | 20 +- llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll | 2 +- llvm/test/Transforms/SLPVectorizer/X86/addsub.ll | 24 +- .../Transforms/SLPVectorizer/X86/commutativity.ll | 20 +- .../SLPVectorizer/X86/crash_exceed_scheduling.ll | 6 +- .../Transforms/SLPVectorizer/X86/crash_smallpt.ll | 18 +- .../Transforms/SLPVectorizer/X86/extractelement.ll | 4 +- .../Transforms/SLPVectorizer/X86/insert-shuffle.ll | 34 ++- .../test/Transforms/SLPVectorizer/X86/lookahead.ll | 35 +-- .../Transforms/SLPVectorizer/X86/operandorder.ll | 44 ++-- .../Transforms/SLPVectorizer/X86/store-jumbled.ll | 4 +- .../SLPVectorizer/X86/stores_vectorize.ll | 6 +- .../test/Transforms/SLPVectorizer/X86/supernode.ll | 2 +- 16 files changed, 328 insertions(+), 204 deletions(-) diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp index d145b04c0694..c685432ae28e 100644 --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp @@ -1016,18 +1016,25 @@ public: std::swap(OpsVec[OpIdx1][Lane], OpsVec[OpIdx2][Lane]); } - // The hard-coded scores listed here are not very important. When computing - // the scores of matching one sub-tree with another, we are basically - // counting the number of values that are matching. So even if all scores - // are set to 1, we would still get a decent matching result. + // The hard-coded scores listed here are not very important, though it shall + // be higher for better matches to improve the resulting cost. When + // computing the scores of matching one sub-tree with another, we are + // basically counting the number of values that are matching. So even if all + // scores are set to 1, we would still get a decent matching result. // However, sometimes we have to break ties. For example we may have to // choose between matching loads vs matching opcodes. This is what these - // scores are helping us with: they provide the order of preference. + // scores are helping us with: they provide the order of preference. Also, + // this is important if the scalar is externally used or used in another + // tree entry node in the different lane. /// Loads from consecutive memory addresses, e.g. load(A[i]), load(A[i+1]). - static const int ScoreConsecutiveLoads = 3; + static const int ScoreConsecutiveLoads = 4; + /// Loads from reversed memory addresses, e.g. load(A[i+1]), load(A[i]). + static const int ScoreReversedLoads = 3; /// ExtractElementInst from same vector and consecutive indexes. - static const int ScoreConsecutiveExtracts = 3; + static const int ScoreConsecutiveExtracts = 4; + /// ExtractElementInst from same vector and reversed indices. + static const int ScoreReversedExtracts = 3; /// Constants. static const int ScoreConstants = 2; /// Instructions with the same opcode. @@ -1047,7 +1054,10 @@ public: /// \returns the score of placing \p V1 and \p V2 in consecutive lanes. static int getShallowScore(Value *V1, Value *V2, const DataLayout &DL, - ScalarEvolution &SE) { + ScalarEvolution &SE, int NumLanes) { + if (V1 == V2) + return VLOperands::ScoreSplat; + auto *LI1 = dyn_cast<LoadInst>(V1); auto *LI2 = dyn_cast<LoadInst>(V2); if (LI1 && LI2) { @@ -1057,8 +1067,17 @@ public: Optional<int> Dist = getPointersDiff( LI1->getType(), LI1->getPointerOperand(), LI2->getType(), LI2->getPointerOperand(), DL, SE, /*StrictCheck=*/true); - return (Dist && *Dist == 1) ? VLOperands::ScoreConsecutiveLoads - : VLOperands::ScoreFail; + if (!Dist) + return VLOperands::ScoreFail; + // The distance is too large - still may be profitable to use masked + // loads/gathers. + if (std::abs(*Dist) > NumLanes / 2) + return VLOperands::ScoreAltOpcodes; + // This still will detect consecutive loads, but we might have "holes" + // in some cases. It is ok for non-power-2 vectorization and may produce + // better results. It should not affect current vectorization. + return (*Dist > 0) ? VLOperands::ScoreConsecutiveLoads + : VLOperands::ScoreReversedLoads; } auto *C1 = dyn_cast<Constant>(V1); @@ -1068,18 +1087,41 @@ public: // Extracts from consecutive indexes of the same vector better score as // the extracts could be optimized away. - Value *EV; - ConstantInt *Ex1Idx, *Ex2Idx; - if (match(V1, m_ExtractElt(m_Value(EV), m_ConstantInt(Ex1Idx))) && - match(V2, m_ExtractElt(m_Deferred(EV), m_ConstantInt(Ex2Idx))) && - Ex1Idx->getZExtValue() + 1 == Ex2Idx->getZExtValue()) - return VLOperands::ScoreConsecutiveExtracts; + Value *EV1; + ConstantInt *Ex1Idx; + if (match(V1, m_ExtractElt(m_Value(EV1), m_ConstantInt(Ex1Idx)))) { + // Undefs are always profitable for extractelements. + if (isa<UndefValue>(V2)) + return VLOperands::ScoreConsecutiveExtracts; + Value *EV2 = nullptr; + ConstantInt *Ex2Idx = nullptr; + if (match(V2, + m_ExtractElt(m_Value(EV2), m_CombineOr(m_ConstantInt(Ex2Idx), + m_Undef())))) { + // Undefs are always profitable for extractelements. + if (!Ex2Idx) + return VLOperands::ScoreConsecutiveExtracts; + if (isUndefVector(EV2) && EV2->getType() == EV1->getType()) + return VLOperands::ScoreConsecutiveExtracts; + if (EV2 == EV1) { + int Idx1 = Ex1Idx->getZExtValue(); + int Idx2 = Ex2Idx->getZExtValue(); + int Dist = Idx2 - Idx1; + // The distance is too large - still may be profitable to use + // shuffles. + if (std::abs(Dist) > NumLanes / 2) + return VLOperands::ScoreAltOpcodes; + return (Dist > 0) ? VLOperands::ScoreConsecutiveExtracts + : VLOperands::ScoreReversedExtracts; + } + } + } auto *I1 = dyn_cast<Instruction>(V1); auto *I2 = dyn_cast<Instruction>(V2); if (I1 && I2) { - if (I1 == I2) - return VLOperands::ScoreSplat; + if (I1->getParent() != I2->getParent()) + return VLOperands::ScoreFail; InstructionsState S = getSameOpcode({I1, I2}); // Note: Only consider instructions with <= 2 operands to avoid // complexity explosion. @@ -1094,11 +1136,13 @@ public: return VLOperands::ScoreFail; } - /// Holds the values and their lane that are taking part in the look-ahead + /// Holds the values and their lanes that are taking part in the look-ahead /// score calculation. This is used in the external uses cost calculation. - SmallDenseMap<Value *, int> InLookAheadValues; + /// Need to hold all the lanes in case of splat/broadcast at least to + /// correctly check for the use in the different lane. + SmallDenseMap<Value *, SmallSet<int, 4>> InLookAheadValues; - /// \Returns the additinal cost due to uses of \p LHS and \p RHS that are + /// \returns the additional cost due to uses of \p LHS and \p RHS that are /// either external to the vectorized code, or require shuffling. int getExternalUsesCost(const std::pair<Value *, int> &LHS, const std::pair<Value *, int> &RHS) { @@ -1122,22 +1166,30 @@ public: for (User *U : V->users()) { if (const TreeEntry *UserTE = R.getTreeEntry(U)) { // The user is in the VectorizableTree. Check if we need to insert. - auto It = llvm::find(UserTE->Scalars, U); - assert(It != UserTE->Scalars.end() && "U is in UserTE"); - int UserLn = std::distance(UserTE->Scalars.begin(), It); + int UserLn = UserTE->findLaneForValue(U); assert(UserLn >= 0 && "Bad lane"); - if (UserLn != Ln) + // If the values are different, check just the line of the current + // value. If the values are the same, need to add UserInDiffLaneCost + // only if UserLn does not match both line numbers. + if ((LHS.first != RHS.first && UserLn != Ln) || + (LHS.first == RHS.first && UserLn != LHS.second && + UserLn != RHS.second)) { Cost += UserInDiffLaneCost; + break; + } } else { // Check if the user is in the look-ahead code. auto It2 = InLookAheadValues.find(U); if (It2 != InLookAheadValues.end()) { // The user is in the look-ahead code. Check the lane. - if (It2->second != Ln) + if (!It2->getSecond().contains(Ln)) { Cost += UserInDiffLaneCost; + break; + } } else { // The user is neither in SLP tree nor in the look-ahead code. Cost += ExternalUseCost; + break; } } // Limit the number of visited uses to cap compilation time. @@ -1176,32 +1228,36 @@ public: Value *V1 = LHS.first; Value *V2 = RHS.first; // Get the shallow score of V1 and V2. - int ShallowScoreAtThisLevel = - std::max((int)ScoreFail, getShallowScore(V1, V2, DL, SE) - - getExternalUsesCost(LHS, RHS)); + int ShallowScoreAtThisLevel = std::max( + (int)ScoreFail, getShallowScore(V1, V2, DL, SE, getNumLanes()) - + getExternalUsesCost(LHS, RHS)); int Lane1 = LHS.second; int Lane2 = RHS.second; // If reached MaxLevel, // or if V1 and V2 are not instructions, // or if they are SPLAT, - // or if they are not consecutive, early return the current cost. + // or if they are not consecutive, + // or if profitable to vectorize loads or extractelements, early return + // the current cost. auto *I1 = dyn_cast<Instruction>(V1); auto *I2 = dyn_cast<Instruction>(V2); if (CurrLevel == MaxLevel || !(I1 && I2) || I1 == I2 || ShallowScoreAtThisLevel == VLOperands::ScoreFail || - (isa<LoadInst>(I1) && isa<LoadInst>(I2) && ShallowScoreAtThisLevel)) + (((isa<LoadInst>(I1) && isa<LoadInst>(I2)) || + (isa<ExtractElementInst>(I1) && isa<ExtractElementInst>(I2))) && + ShallowScoreAtThisLevel)) return ShallowScoreAtThisLevel; assert(I1 && I2 && "Should have early exited."); // Keep track of in-tree values for determining the external-use cost. - InLookAheadValues[V1] = Lane1; - InLookAheadValues[V2] = Lane2; + InLookAheadValues[V1].insert(Lane1); + InLookAheadValues[V2].insert(Lane2); // Contains the I2 operand indexes that got matched with I1 operands. SmallSet<unsigned, 4> Op2Used; - // Recursion towards the operands of I1 and I2. We are trying all possbile + // Recursion towards the operands of I1 and I2. We are trying all possible // operand pairs, and keeping track of the best score. for (unsigned OpIdx1 = 0, NumOperands1 = I1->getNumOperands(); OpIdx1 != NumOperands1; ++OpIdx1) { @@ -1325,27 +1381,79 @@ public: return None; } - /// Helper for reorderOperandVecs. \Returns the lane that we should start - /// reordering from. This is the one which has the least number of operands - /// that can freely move about. + /// Helper for reorderOperandVecs. + /// \returns the lane that we should start reordering from. This is the one + /// which has the least number of operands that can freely move about or + /// less profitable because it already has the most optimal set of operands. unsigned getBestLaneToStartReordering() const { - unsigned BestLane = 0; unsigned Min = UINT_MAX; - for (unsigned Lane = 0, NumLanes = getNumLanes(); Lane != NumLanes; - ++Lane) { - unsigned NumFreeOps = getMaxNumOperandsThatCanBeReordered(Lane); - if (NumFreeOps < Min) { - Min = NumFreeOps; - BestLane = Lane; + unsigned SameOpNumber = 0; + // std::pair<unsigned, unsigned> is used to implement a simple voting + // algorithm and choose the lane with the least number of operands that + // can freely move about or less profitable because it already has the + // most optimal set of operands. The first unsigned is a counter for + // voting, the second unsigned is the counter of lanes with instructions + // with same/alternate opcodes and same parent basic block. + MapVector<unsigned, std::pair<unsigned, unsigned>> HashMap; + // Try to be closer to the original results, if we have multiple lanes + // with same cost. If 2 lanes have the same cost, use the one with the + // lowest index. + for (int I = getNumLanes(); I > 0; --I) { + unsigned Lane = I - 1; + OperandsOrderData NumFreeOpsHash = + getMaxNumOperandsThatCanBeReordered(Lane); + // Compare the number of operands that can move and choose the one with + // the least number. + if (NumFreeOpsHash.NumOfAPOs < Min) { + Min = NumFreeOpsHash.NumOfAPOs; + SameOpNumber = NumFreeOpsHash.NumOpsWithSameOpcodeParent; + HashMap.clear(); + HashMap[NumFreeOpsHash.Hash] = std::make_pair(1, Lane); + } else if (NumFreeOpsHash.NumOfAPOs == Min && + NumFreeOpsHash.NumOpsWithSameOpcodeParent < SameOpNumber) { + // Select the most optimal lane in terms of number of operands that + // should be moved around. + SameOpNumber = NumFreeOpsHash.NumOpsWithSameOpcodeParent; + HashMap[NumFreeOpsHash.Hash] = std::make_pair(1, Lane); + } else if (NumFreeOpsHash.NumOfAPOs == Min && + NumFreeOpsHash.NumOpsWithSameOpcodeParent == SameOpNumber) { + ++HashMap[NumFreeOpsHash.Hash].first; + } + } + // Select the lane with the minimum counter. + unsigned BestLane = 0; + unsigned CntMin = UINT_MAX; + for (const auto &Data : reverse(HashMap)) { + if (Data.second.first < CntMin) { + CntMin = Data.second.first; + BestLane = Data.second.second; } } return BestLane; } - /// \Returns the maximum number of operands that are allowed to be reordered - /// for \p Lane. This is used as a heuristic for selecting the first lane to - /// start operand reordering. - unsigned getMaxNumOperandsThatCanBeReordered(unsigned Lane) const { + /// Data structure that helps to reorder operands. + struct OperandsOrderData { + /// The best number of operands with the same APOs, which can be + /// reordered. + unsigned NumOfAPOs = UINT_MAX; + /// Number of operands with the same/alternate instruction opcode and + /// parent. + unsigned NumOpsWithSameOpcodeParent = 0; + /// Hash for the actual operands ordering. + /// Used to count operands, actually their position id and opcode + /// value. It is used in the voting mechanism to find the lane with the + /// least number of operands that can freely move about or less profitable + /// because it already has the most optimal set of operands. Can be + /// replaced with SmallVector<unsigned> instead but hash code is faster + /// and requires less memory. + unsigned Hash = 0; + }; + /// \returns the maximum number of operands that are allowed to be reordered + /// for \p Lane and the number of compatible instructions(with the same + /// parent/opcode). This is used as a heuristic for selecting the first lane + /// to start operand reordering. + OperandsOrderData getMaxNumOperandsThatCanBeReordered(unsigned Lane) const { unsigned CntTrue = 0; unsigned NumOperands = getNumOperands(); // Operands with the same APO can be reordered. We therefore need to count @@ -1354,11 +1462,45 @@ public: // a map. Instead we can simply count the number of operands that // correspond to one of them (in this case the 'true' APO), and calculate // the other by subtracting it from the total number of operands. - for (unsigned OpIdx = 0; OpIdx != NumOperands; ++OpIdx) - if (getData(OpIdx, Lane).APO) + // Operands with the same instruction opcode and parent are more + // profitable since we don't need to move them in many cases, with a high + // probability such lane already can be vectorized effectively. + bool AllUndefs = true; + unsigned NumOpsWithSameOpcodeParent = 0; + Instruction *OpcodeI = nullptr; + BasicBlock *Parent = nullptr; + unsigned Hash = 0; + for (unsigned OpIdx = 0; OpIdx != NumOperands; ++OpIdx) { + const OperandData &OpData = getData(OpIdx, Lane); + if (OpData.APO) ++CntTrue; - unsigned CntFalse = NumOperands - CntTrue; - return std::max(CntTrue, CntFalse); + // Use Boyer-Moore majority voting for finding the majority opcode and + // the number of times it occurs. + if (auto *I = dyn_cast<Instruction>(OpData.V)) { + if (!OpcodeI || !getSameOpcode({OpcodeI, I}).getOpcode() || + I->getParent() != Parent) { + if (NumOpsWithSameOpcodeParent == 0) { + NumOpsWithSameOpcodeParent = 1; + OpcodeI = I; + Parent = I->getParent(); + } else { + --NumOpsWithSameOpcodeParent; + } + } else { + ++NumOpsWithSameOpcodeParent; + } + } + Hash = hash_combine( + Hash, hash_value((OpIdx + 1) * (OpData.V->getValueID() + 1))); + AllUndefs = AllUndefs && isa<UndefValue>(OpData.V); + } + if (AllUndefs) + return {}; + OperandsOrderData Data; + Data.NumOfAPOs = std::max(CntTrue, NumOperands - CntTrue); + Data.NumOpsWithSameOpcodeParent = NumOpsWithSameOpcodeParent; + Data.Hash = Hash; + return Data; } /// Go through the instructions in VL and append their operands. @@ -2876,7 +3018,8 @@ void BoUpSLP::reorderTopToBottom() { // their ordering. DenseMap<const TreeEntry *, OrdersType> GathersToOrders; // Find all reorderable nodes with the given VF. - // Currently the are vectorized loads,extracts + some gathering of extracts. + // Currently the are vectorized stores,loads,extracts + some gathering of + // extracts. for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders]( const std::unique_ptr<TreeEntry> &TE) { if (Optional<OrdersType> CurrentOrder = @@ -3497,11 +3640,9 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth, } } - // If any of the scalars is marked as a value that needs to stay scalar, then - // we need to gather the scalars. // The reduction nodes (stored in UserIgnoreList) also should stay scalar. for (Value *V : VL) { - if (MustGather.count(V) || is_contained(UserIgnoreList, V)) { + if (is_contained(UserIgnoreList, V)) { LLVM_DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n"); if (TryToFindDuplicates(S)) newTreeEntry(VL, None /*not vectorized*/, S, UserTreeIdx, diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll index fa95ec7357aa..c8aa06677f8f 100644 --- a/llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll @@ -167,25 +167,17 @@ define <4 x i32> @build_vec_v4i32_reuse_1(<2 x i32> %v0, <2 x i32> %v1) { define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) { ; CHECK-LABEL: @build_vec_v4i32_3_binops( -; CHECK-NEXT: [[V0_0:%.*]] = extractelement <2 x i32> [[V0:%.*]], i64 0 -; CHECK-NEXT: [[V0_1:%.*]] = extractelement <2 x i32> [[V0]], i64 1 -; CHECK-NEXT: [[V1_0:%.*]] = extractelement <2 x i32> [[V1:%.*]], i64 0 -; CHECK-NEXT: [[V1_1:%.*]] = extractelement <2 x i32> [[V1]], i64 1 -; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[V0_0]], [[V1_0]] -; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[V0_1]], [[V1_1]] -; CHECK-NEXT: [[TMP1_0:%.*]] = mul i32 [[V0_0]], [[V1_0]] -; CHECK-NEXT: [[TMP1_1:%.*]] = mul i32 [[V0_1]], [[V1_1]] -; CHECK-NEXT: [[TMP1:%.*]] = xor <2 x i32> [[V0]], [[V1]] -; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <2 x i32> zeroinitializer -; CHECK-NEXT: [[TMP3:%.*]] = xor <2 x i32> [[V0]], [[V1]] -; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> poison, <2 x i32> <i32 1, i32 1> -; CHECK-NEXT: [[TMP2_0:%.*]] = add i32 [[TMP0_0]], [[TMP0_1]] -; CHECK-NEXT: [[TMP2_1:%.*]] = add i32 [[TMP1_0]], [[TMP1_1]] -; CHECK-NEXT: [[TMP5:%.*]] = add <2 x i32> [[TMP2]], [[TMP4]] -; CHECK-NEXT: [[TMP3_0:%.*]] = insertelement <4 x i32> poison, i32 [[TMP2_0]], i64 0 -; CHECK-NEXT: [[TMP3_1:%.*]] = insertelement <4 x i32> [[TMP3_0]], i32 [[TMP2_1]], i64 1 -; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <4 x i32> [[TMP3_1]], <4 x i32> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> +; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i32> [[V0:%.*]], [[V1:%.*]] +; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2> +; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3> +; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> zeroinitializer +; CHECK-NEXT: [[TMP7:%.*]] = xor <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <2 x i32> <i32 1, i32 1> +; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]] +; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]] +; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; CHECK-NEXT: ret <4 x i32> [[TMP3_31]] ; %v0.0 = extractelement <2 x i32> %v0, i32 0 diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll index dcfdbee9bc5f..307480ce8018 100644 --- a/llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll @@ -167,25 +167,17 @@ define <4 x i32> @build_vec_v4i32_reuse_1(<2 x i32> %v0, <2 x i32> %v1) { define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) { ; CHECK-LABEL: @build_vec_v4i32_3_binops( -; CHECK-NEXT: [[V0_0:%.*]] = extractelement <2 x i32> [[V0:%.*]], i64 0 -; CHECK-NEXT: [[V0_1:%.*]] = extractelement <2 x i32> [[V0]], i64 1 -; CHECK-NEXT: [[V1_0:%.*]] = extractelement <2 x i32> [[V1:%.*]], i64 0 -; CHECK-NEXT: [[V1_1:%.*]] = extractelement <2 x i32> [[V1]], i64 1 -; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[V0_0]], [[V1_0]] -; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[V0_1]], [[V1_1]] -; CHECK-NEXT: [[TMP1_0:%.*]] = mul i32 [[V0_0]], [[V1_0]] -; CHECK-NEXT: [[TMP1_1:%.*]] = mul i32 [[V0_1]], [[V1_1]] -; CHECK-NEXT: [[TMP1:%.*]] = xor <2 x i32> [[V0]], [[V1]] -; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <2 x i32> zeroinitializer -; CHECK-NEXT: [[TMP3:%.*]] = xor <2 x i32> [[V0]], [[V1]] -; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> poison, <2 x i32> <i32 1, i32 1> -; CHECK-NEXT: [[TMP2_0:%.*]] = add i32 [[TMP0_0]], [[TMP0_1]] -; CHECK-NEXT: [[TMP2_1:%.*]] = add i32 [[TMP1_0]], [[TMP1_1]] -; CHECK-NEXT: [[TMP5:%.*]] = add <2 x i32> [[TMP2]], [[TMP4]] -; CHECK-NEXT: [[TMP3_0:%.*]] = insertelement <4 x i32> undef, i32 [[TMP2_0]], i64 0 -; CHECK-NEXT: [[TMP3_1:%.*]] = insertelement <4 x i32> [[TMP3_0]], i32 [[TMP2_1]], i64 1 -; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <4 x i32> [[TMP3_1]], <4 x i32> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> +; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i32> [[V0:%.*]], [[V1:%.*]] +; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2> +; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3> +; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> zeroinitializer +; CHECK-NEXT: [[TMP7:%.*]] = xor <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <2 x i32> <i32 1, i32 1> +; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]] +; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]] +; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; CHECK-NEXT: ret <4 x i32> [[TMP3_31]] ; %v0.0 = extractelement <2 x i32> %v0, i32 0 diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll index d7ef813d6b72..b79d2d494aa4 100644 --- a/llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll @@ -282,19 +282,21 @@ define void @extracts_jumbled_4_lanes(<9 x double>* %ptr.1, <4 x double>* %ptr.2 ; CHECK-NEXT: [[V2_LANE_0:%.*]] = extractelement <4 x double> [[V_2]], i32 0 ; CHECK-NEXT: [[V2_LANE_1:%.*]] = extractelement <4 x double> [[V_2]], i32 1 ; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2 -; CHECK-NEXT: [[A_LANE_0:%.*]] = fmul double [[V1_LANE_0]], [[V2_LANE_2]] -; CHECK-NEXT: [[A_LANE_1:%.*]] = fmul double [[V1_LANE_2]], [[V2_LANE_1]] -; CHECK-NEXT: [[A_LANE_2:%.*]] = fmul double [[V1_LANE_1]], [[V2_LANE_2]] -; CHECK-NEXT: [[A_LANE_3:%.*]] = fmul double [[V1_LANE_3]], [[V2_LANE_0]] -; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <9 x double> undef, double [[A_LANE_0]], i32 0 -; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <9 x double> [[A_INS_0]], double [[A_LANE_1]], i32 1 -; CHECK-NEXT: [[A_INS_2:%.*]] = insertelement <9 x double> [[A_INS_1]], double [[A_LANE_2]], i32 2 -; CHECK-NEXT: [[A_INS_3:%.*]] = insertelement <9 x double> [[A_INS_2]], double [[A_LANE_3]], i32 3 +; CHECK-NEXT: [[TMP0:%.*]] = insertelement <4 x double> poison, double [[V1_LANE_0]], i32 0 +; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> [[TMP0]], double [[V1_LANE_2]], i32 1 +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x double> [[TMP1]], double [[V1_LANE_1]], i32 2 +; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x double> [[TMP2]], double [[V1_LANE_3]], i32 3 +; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x double> poison, double [[V2_LANE_2]], i32 0 +; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x double> [[TMP4]], double [[V2_LANE_1]], i32 1 +; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x double> [[TMP5]], double [[V2_LANE_2]], i32 2 +; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x double> [[TMP6]], double [[V2_LANE_0]], i32 3 +; CHECK-NEXT: [[TMP8:%.*]] = fmul <4 x double> [[TMP3]], [[TMP7]] +; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x double> [[TMP8]], <4 x double> poison, <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> ; CHECK-NEXT: call void @use(double [[V1_LANE_0]]) ; CHECK-NEXT: call void @use(double [[V1_LANE_1]]) ; CHECK-NEXT: call void @use(double [[V1_LANE_2]]) ; CHECK-NEXT: call void @use(double [[V1_LANE_3]]) -; CHECK-NEXT: store <9 x double> [[A_INS_3]], <9 x double>* [[PTR_1]], align 8 +; CHECK-NEXT: store <9 x double> [[TMP9]], <9 x double>* [[PTR_1]], align 8 ; CHECK-NEXT: ret void ; bb: diff --git a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll index 51a6e1ed81b1..7668747a75ac 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll @@ -1,6 +1,6 @@ ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py ; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-6 | FileCheck %s --check-prefix=CHECK -; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-8 -slp-min-tree-size=6 | FileCheck %s --check-prefix=FORCE_REDUCTION +; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-7 -slp-min-tree-size=6 | FileCheck %s --check-prefix=FORCE_REDUCTION define void @Test(i32) { ; CHECK-LABEL: @Test( diff --git a/llvm/test/Transforms/SLPVectorizer/X86/addsub.ll b/llvm/test/Transforms/SLPVectorizer/X86/addsub.ll index c9cb8951e882..ebbbefc9f81f 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/addsub.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/addsub.ll @@ -342,18 +342,18 @@ define void @vec_shuff_reorder() #0 { ; CHECK-LABEL: @vec_shuff_reorder( ; CHECK-NEXT: [[TMP1:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 0), align 4 ; CHECK-NEXT: [[TMP2:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 0), align 4 -; CHECK-NEXT: [[TMP3:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1) to <2 x float>*), align 4 -; CHECK-NEXT: [[TMP4:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1) to <2 x float>*), align 4 -; CHECK-NEXT: [[TMP5:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 3), align 4 -; CHECK-NEXT: [[TMP6:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 3), align 4 -; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x float> poison, float [[TMP2]], i32 0 -; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x float> [[TMP3]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x float> [[TMP7]], <4 x float> [[TMP8]], <4 x i32> <i32 0, i32 4, i32 5, i32 3> -; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x float> [[TMP9]], float [[TMP5]], i32 3 -; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x float> poison, float [[TMP1]], i32 0 -; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <4 x float> [[TMP11]], <4 x float> [[TMP12]], <4 x i32> <i32 0, i32 4, i32 5, i32 3> -; CHECK-NEXT: [[TMP14:%.*]] = insertelement <4 x float> [[TMP13]], float [[TMP6]], i32 3 +; CHECK-NEXT: [[TMP3:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1), align 4 +; CHECK-NEXT: [[TMP4:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1), align 4 +; CHECK-NEXT: [[TMP5:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 2) to <2 x float>*), align 4 +; CHECK-NEXT: [[TMP6:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 2) to <2 x float>*), align 4 +; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x float> poison, float [[TMP1]], i32 0 +; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x float> [[TMP7]], float [[TMP3]], i32 1 +; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <2 x float> [[TMP5]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> +; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x float> [[TMP8]], <4 x float> [[TMP9]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> +; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x float> poison, float [[TMP2]], i32 0 +; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x float> [[TMP11]], float [[TMP4]], i32 1 +; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <2 x float> [[TMP6]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> +; CHECK-NEXT: [[TMP14:%.*]] = shufflevector <4 x float> [[TMP12]], <4 x float> [[TMP13]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> ; CHECK-NEXT: [[TMP15:%.*]] = fadd <4 x float> [[TMP10]], [[TMP14]] ; CHECK-NEXT: [[TMP16:%.*]] = fsub <4 x float> [[TMP10]], [[TMP14]] ; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <4 x float> [[TMP15]], <4 x float> [[TMP16]], <4 x i32> <i32 0, i32 5, i32 2, i32 7> diff --git a/llvm/test/Transforms/SLPVectorizer/X86/commutativity.ll b/llvm/test/Transforms/SLPVectorizer/X86/commutativity.ll index d23dc9b1d822..1a218ae02aef 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/commutativity.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/commutativity.ll @@ -16,21 +16,21 @@ define void @splat(i8 %a, i8 %b, i8 %c) { ; SSE-LABEL: @splat( -; SSE-NEXT: [[TMP1:%.*]] = insertelement <16 x i8> poison, i8 [[C:%.*]], i32 0 -; SSE-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> zeroinitializer -; SSE-NEXT: [[TMP2:%.*]] = insertelement <16 x i8> poison, i8 [[A:%.*]], i32 0 -; SSE-NEXT: [[TMP3:%.*]] = insertelement <16 x i8> [[TMP2]], i8 [[B:%.*]], i32 1 -; SSE-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0> +; SSE-NEXT: [[TMP1:%.*]] = insertelement <16 x i8> poison, i8 [[A:%.*]], i32 0 +; SSE-NEXT: [[TMP2:%.*]] = insertelement <16 x i8> [[TMP1]], i8 [[B:%.*]], i32 1 +; SSE-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i8> [[TMP2]], <16 x i8> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0> +; SSE-NEXT: [[TMP3:%.*]] = insertelement <16 x i8> poison, i8 [[C:%.*]], i32 0 +; SSE-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> zeroinitializer ; SSE-NEXT: [[TMP4:%.*]] = xor <16 x i8> [[SHUFFLE]], [[SHUFFLE1]] ; SSE-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast ([32 x i8]* @cle to <16 x i8>*), align 16 ; SSE-NEXT: ret void ; ; AVX-LABEL: @splat( -; AVX-NEXT: [[TMP1:%.*]] = insertelement <16 x i8> poison, i8 [[C:%.*]], i32 0 -; AVX-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> zeroinitializer -; AVX-NEXT: [[TMP2:%.*]] = insertelement <16 x i8> poison, i8 [[A:%.*]], i32 0 -; AVX-NEXT: [[TMP3:%.*]] = insertelement <16 x i8> [[TMP2]], i8 [[B:%.*]], i32 1 -; AVX-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0> +; AVX-NEXT: [[TMP1:%.*]] = insertelement <16 x i8> poison, i8 [[A:%.*]], i32 0 +; AVX-NEXT: [[TMP2:%.*]] = insertelement <16 x i8> [[TMP1]], i8 [[B:%.*]], i32 1 +; AVX-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i8> [[TMP2]], <16 x i8> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0> +; AVX-NEXT: [[TMP3:%.*]] = insertelement <16 x i8> poison, i8 [[C:%.*]], i32 0 +; AVX-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> zeroinitializer ; AVX-NEXT: [[TMP4:%.*]] = xor <16 x i8> [[SHUFFLE]], [[SHUFFLE1]] ; AVX-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast ([32 x i8]* @cle to <16 x i8>*), align 16 ; AVX-NEXT: ret void diff --git a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll index 6be7dda2375d..098b83bb0259 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll @@ -34,9 +34,9 @@ define void @exceed(double %0, double %1) { ; CHECK-NEXT: [[TMP11:%.*]] = fadd fast <2 x double> [[TMP3]], [[TMP5]] ; CHECK-NEXT: [[TMP12:%.*]] = fmul fast <2 x double> [[TMP10]], [[TMP11]] ; CHECK-NEXT: [[IXX101:%.*]] = fsub double undef, undef -; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x double> <double poison, double undef>, double [[TMP7]], i32 0 -; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x double> <double undef, double poison>, double [[TMP1]], i32 1 -; CHECK-NEXT: [[TMP15:%.*]] = fmul fast <2 x double> [[TMP13]], [[TMP14]] +; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x double> poison, double [[TMP1]], i32 1 +; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x double> [[TMP13]], double [[TMP7]], i32 0 +; CHECK-NEXT: [[TMP15:%.*]] = fmul fast <2 x double> [[TMP14]], undef ; CHECK-NEXT: switch i32 undef, label [[BB1:%.*]] [ ; CHECK-NEXT: i32 0, label [[BB2:%.*]] ; CHECK-NEXT: ] diff --git a/llvm/test/Transforms/SLPVectorizer/X86/crash_smallpt.ll b/llvm/test/Transforms/SLPVectorizer/X86/crash_smallpt.ll index c8beac34fc90..9c8fbf8a2ed9 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/crash_smallpt.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/crash_smallpt.ll @@ -30,15 +30,19 @@ define void @main() #0 { ; CHECK-NEXT: br i1 undef, label [[COND_TRUE63_US:%.*]], label [[COND_FALSE66_US:%.*]] ; CHECK: cond.false66.us: ; CHECK-NEXT: [[ADD_I276_US:%.*]] = fadd double 0.000000e+00, undef -; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> <double poison, double undef>, double [[ADD_I276_US]], i32 0 -; CHECK-NEXT: [[TMP1:%.*]] = fadd <2 x double> [[TMP0]], <double 0.000000e+00, double 0xBFA5CC2D1960285F> +; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> <double poison, double 0xBFA5CC2D1960285F>, double [[ADD_I276_US]], i32 0 +; CHECK-NEXT: [[TMP1:%.*]] = fadd <2 x double> <double 0.000000e+00, double undef>, [[TMP0]] ; CHECK-NEXT: [[TMP2:%.*]] = fmul <2 x double> [[TMP1]], <double 1.400000e+02, double 1.400000e+02> ; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP2]], <double 5.000000e+01, double 5.200000e+01> -; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> undef, [[TMP1]] -; CHECK-NEXT: [[TMP5:%.*]] = bitcast double* [[AGG_TMP99208_SROA_0_0_IDX]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP5]], align 8 -; CHECK-NEXT: [[TMP6:%.*]] = bitcast double* [[AGG_TMP101211_SROA_0_0_IDX]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP6]], align 8 +; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[TMP1]], i32 0 +; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[TMP1]], i32 1 +; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> <double poison, double undef>, double [[TMP4]], i32 0 +; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> <double undef, double poison>, double [[TMP5]], i32 1 +; CHECK-NEXT: [[TMP8:%.*]] = fmul <2 x double> [[TMP6]], [[TMP7]] +; CHECK-NEXT: [[TMP9:%.*]] = bitcast double* [[AGG_TMP99208_SROA_0_0_IDX]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP9]], align 8 +; CHECK-NEXT: [[TMP10:%.*]] = bitcast double* [[AGG_TMP101211_SROA_0_0_IDX]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP8]], <2 x double>* [[TMP10]], align 8 ; CHECK-NEXT: unreachable ; CHECK: cond.true63.us: ; CHECK-NEXT: unreachable diff --git a/llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll b/llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll index 0a0c0e6763fd..1fff6841a538 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll @@ -85,7 +85,7 @@ define float @f_used_twice_in_tree(<2 x float> %x) { ; THRESH1-NEXT: [[TMP1:%.*]] = extractelement <2 x float> [[X:%.*]], i32 1 ; THRESH1-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[TMP1]], i32 0 ; THRESH1-NEXT: [[TMP3:%.*]] = insertelement <2 x float> [[TMP2]], float [[TMP1]], i32 1 -; THRESH1-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[X]], [[TMP3]] +; THRESH1-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[TMP3]], [[X]] ; THRESH1-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP4]], i32 0 ; THRESH1-NEXT: [[TMP6:%.*]] = extractelement <2 x float> [[TMP4]], i32 1 ; THRESH1-NEXT: [[ADD:%.*]] = fadd float [[TMP5]], [[TMP6]] @@ -95,7 +95,7 @@ define float @f_used_twice_in_tree(<2 x float> %x) { ; THRESH2-NEXT: [[TMP1:%.*]] = extractelement <2 x float> [[X:%.*]], i32 1 ; THRESH2-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[TMP1]], i32 0 ; THRESH2-NEXT: [[TMP3:%.*]] = insertelement <2 x float> [[TMP2]], float [[TMP1]], i32 1 -; THRESH2-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[X]], [[TMP3]] +; THRESH2-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[TMP3]], [[X]] ; THRESH2-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP4]], i32 0 ; THRESH2-NEXT: [[TMP6:%.*]] = extractelement <2 x float> [[TMP4]], i32 1 ; THRESH2-NEXT: [[ADD:%.*]] = fadd float [[TMP5]], [[TMP6]] diff --git a/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll b/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll index 2c983a353623..7d43465eecf8 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll @@ -11,25 +11,23 @@ define { <2 x float>, <2 x float> } @foo(%struct.sw* %v) { ; CHECK-NEXT: [[Y:%.*]] = getelementptr inbounds [[STRUCT_SW]], %struct.sw* [[V]], i64 0, i32 1 ; CHECK-NEXT: [[TMP1:%.*]] = bitcast float* [[X]] to <2 x float>* ; CHECK-NEXT: [[TMP2:%.*]] = load <2 x float>, <2 x float>* [[TMP1]], align 16 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 0, i32 1> ; CHECK-NEXT: [[TMP3:%.*]] = load float, float* undef, align 4 -; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> <float poison, float undef, float poison, float poison>, float [[TMP0]], i32 0 -; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x float> [[TMP4]], <4 x float> [[TMP5]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> -; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x float> poison, <4 x float> [[TMP7]], <4 x i32> <i32 4, i32 5, i32 2, i32 3> -; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x float> [[TMP8]], float [[TMP3]], i32 2 -; CHECK-NEXT: [[TMP10:%.*]] = fmul <4 x float> [[TMP6]], [[TMP9]] -; CHECK-NEXT: [[TMP11:%.*]] = fadd <4 x float> poison, [[TMP10]] -; CHECK-NEXT: [[TMP12:%.*]] = fadd <4 x float> [[TMP11]], poison -; CHECK-NEXT: [[TMP13:%.*]] = fadd <4 x float> [[TMP12]], poison -; CHECK-NEXT: [[TMP14:%.*]] = extractelement <4 x float> [[TMP13]], i32 0 -; CHECK-NEXT: [[VEC1:%.*]] = insertelement <2 x float> undef, float [[TMP14]], i32 0 -; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x float> [[TMP13]], i32 1 -; CHECK-NEXT: [[VEC2:%.*]] = insertelement <2 x float> [[VEC1]], float [[TMP15]], i32 1 -; CHECK-NEXT: [[TMP16:%.*]] = extractelement <4 x float> [[TMP13]], i32 2 -; CHECK-NEXT: [[VEC3:%.*]] = insertelement <2 x float> undef, float [[TMP16]], i32 0 -; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x float> [[TMP13]], i32 3 -; CHECK-NEXT: [[VEC4:%.*]] = insertelement <2 x float> [[VEC3]], float [[TMP17]], i32 1 +; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> poison, float [[TMP0]], i32 0 +; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x float> [[TMP4]], float [[TMP3]], i32 1 +; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x float> [[TMP5]], <4 x float> poison, <4 x i32> <i32 0, i32 undef, i32 1, i32 undef> +; CHECK-NEXT: [[TMP6:%.*]] = fmul <4 x float> [[SHUFFLE]], [[SHUFFLE1]] +; CHECK-NEXT: [[TMP7:%.*]] = fadd <4 x float> poison, [[TMP6]] +; CHECK-NEXT: [[TMP8:%.*]] = fadd <4 x float> [[TMP7]], poison +; CHECK-NEXT: [[TMP9:%.*]] = fadd <4 x float> [[TMP8]], poison +; CHECK-NEXT: [[TMP10:%.*]] = extractelement <4 x float> [[TMP9]], i32 0 +; CHECK-NEXT: [[VEC1:%.*]] = insertelement <2 x float> undef, float [[TMP10]], i32 0 +; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x float> [[TMP9]], i32 1 +; CHECK-NEXT: [[VEC2:%.*]] = insertelement <2 x float> [[VEC1]], float [[TMP11]], i32 1 +; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x float> [[TMP9]], i32 2 +; CHECK-NEXT: [[VEC3:%.*]] = insertelement <2 x float> undef, float [[TMP12]], i32 0 +; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x float> [[TMP9]], i32 3 +; CHECK-NEXT: [[VEC4:%.*]] = insertelement <2 x float> [[VEC3]], float [[TMP13]], i32 1 ; CHECK-NEXT: [[INS1:%.*]] = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> [[VEC2]], 0 ; CHECK-NEXT: [[INS2:%.*]] = insertvalue { <2 x float>, <2 x float> } [[INS1]], <2 x float> [[VEC4]], 1 ; CHECK-NEXT: ret { <2 x float>, <2 x float> } [[INS2]] diff --git a/llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll b/llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll index 96502d44acee..ba3bd26d3861 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll @@ -37,7 +37,7 @@ define void @lookahead_basic(double* %array) { ; CHECK-NEXT: [[TMP7:%.*]] = load <2 x double>, <2 x double>* [[TMP6]], align 8 ; CHECK-NEXT: [[TMP8:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]] ; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP5]], [[TMP7]] -; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP8]], [[TMP9]] +; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP9]], [[TMP8]] ; CHECK-NEXT: [[TMP11:%.*]] = bitcast double* [[IDX0]] to <2 x double>* ; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8 ; CHECK-NEXT: ret void @@ -175,7 +175,7 @@ define void @lookahead_alt2(double* %array) { ; CHECK-NEXT: [[TMP11:%.*]] = fadd fast <2 x double> [[TMP1]], [[TMP3]] ; CHECK-NEXT: [[TMP12:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]] ; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <2 x double> [[TMP11]], <2 x double> [[TMP12]], <2 x i32> <i32 0, i32 3> -; CHECK-NEXT: [[TMP14:%.*]] = fadd fast <2 x double> [[TMP13]], [[TMP10]] +; CHECK-NEXT: [[TMP14:%.*]] = fadd fast <2 x double> [[TMP10]], [[TMP13]] ; CHECK-NEXT: [[TMP15:%.*]] = bitcast double* [[IDX0]] to <2 x double>* ; CHECK-NEXT: store <2 x double> [[TMP14]], <2 x double>* [[TMP15]], align 8 ; CHECK-NEXT: ret void @@ -237,28 +237,29 @@ define void @lookahead_external_uses(double* %A, double *%B, double *%C, double ; CHECK-NEXT: [[IDXB2:%.*]] = getelementptr inbounds double, double* [[B]], i64 2 ; CHECK-NEXT: [[IDXA2:%.*]] = getelementptr inbounds double, double* [[A]], i64 2 ; CHECK-NEXT: [[IDXB1:%.*]] = getelementptr inbounds double, double* [[B]], i64 1 -; CHECK-NEXT: [[A0:%.*]] = load double, double* [[IDXA0]], align 8 +; CHECK-NEXT: [[B0:%.*]] = load double, double* [[IDXB0]], align 8 ; CHECK-NEXT: [[C0:%.*]] = load double, double* [[IDXC0]], align 8 ; CHECK-NEXT: [[D0:%.*]] = load double, double* [[IDXD0]], align 8 -; CHECK-NEXT: [[A1:%.*]] = load double, double* [[IDXA1]], align 8 +; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[IDXA0]] to <2 x double>* +; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 8 ; CHECK-NEXT: [[B2:%.*]] = load double, double* [[IDXB2]], align 8 ; CHECK-NEXT: [[A2:%.*]] = load double, double* [[IDXA2]], align 8 -; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[IDXB0]] to <2 x double>* -; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 8 -; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[C0]], i32 0 -; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[A1]], i32 1 -; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> poison, double [[D0]], i32 0 -; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[B2]], i32 1 -; CHECK-NEXT: [[TMP6:%.*]] = fsub fast <2 x double> [[TMP3]], [[TMP5]] -; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> poison, double [[A0]], i32 0 -; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[A2]], i32 1 -; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP8]], [[TMP1]] -; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP9]], [[TMP6]] +; CHECK-NEXT: [[B1:%.*]] = load double, double* [[IDXB1]], align 8 +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[B0]], i32 0 +; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[B2]], i32 1 +; CHECK-NEXT: [[TMP4:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]] +; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> poison, double [[C0]], i32 0 +; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[A2]], i32 1 +; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> poison, double [[D0]], i32 0 +; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[B1]], i32 1 +; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP6]], [[TMP8]] +; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP4]], [[TMP9]] ; CHECK-NEXT: [[IDXS0:%.*]] = getelementptr inbounds double, double* [[S:%.*]], i64 0 ; CHECK-NEXT: [[IDXS1:%.*]] = getelementptr inbounds double, double* [[S]], i64 1 ; CHECK-NEXT: [[TMP11:%.*]] = bitcast double* [[IDXS0]] to <2 x double>* ; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8 -; CHECK-NEXT: store double [[A1]], double* [[EXT1:%.*]], align 8 +; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[TMP1]], i32 1 +; CHECK-NEXT: store double [[TMP12]], double* [[EXT1:%.*]], align 8 ; CHECK-NEXT: ret void ; entry: @@ -607,7 +608,7 @@ define void @ChecksExtractScores_different_vectors(double* %storeArray, double* ; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> poison, double [[EXTRA0]], i32 0 ; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> [[TMP6]], double [[EXTRB1]], i32 1 ; CHECK-NEXT: [[TMP8:%.*]] = fmul <2 x double> [[TMP7]], [[TMP2]] -; CHECK-NEXT: [[TMP9:%.*]] = fadd <2 x double> [[TMP8]], [[SHUFFLE]] +; CHECK-NEXT: [[TMP9:%.*]] = fadd <2 x double> [[SHUFFLE]], [[TMP8]] ; CHECK-NEXT: [[SIDX0:%.*]] = getelementptr inbounds double, double* [[STOREARRAY:%.*]], i64 0 ; CHECK-NEXT: [[SIDX1:%.*]] = getelementptr inbounds double, double* [[STOREARRAY]], i64 1 ; CHECK-NEXT: [[TMP10:%.*]] = bitcast double* [[SIDX0]] to <2 x double>* diff --git a/llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll b/llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll index a0554d7c5a81..125cd23d0140 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll @@ -142,16 +142,13 @@ define void @shuffle_nodes_match1(double * noalias %from, double * noalias %to, ; CHECK-NEXT: br label [[LP:%.*]] ; CHECK: lp: ; CHECK-NEXT: [[P:%.*]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.*]] ] -; CHECK-NEXT: [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1 -; CHECK-NEXT: [[V0_1:%.*]] = load double, double* [[FROM]], align 4 -; CHECK-NEXT: [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4 -; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0 -; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1 -; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0 -; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer -; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP1]], [[TMP3]] -; CHECK-NEXT: [[TMP5:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 4 +; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[FROM:%.*]] to <2 x double>* +; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 4 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0> +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[P]], i64 1 +; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP2]], [[SHUFFLE]] +; CHECK-NEXT: [[TMP4:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 4 ; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]] ; CHECK: ext: ; CHECK-NEXT: ret void @@ -183,11 +180,11 @@ define void @vecload_vs_broadcast4(double * noalias %from, double * noalias %to, ; CHECK-NEXT: [[P:%.*]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.*]] ] ; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[FROM:%.*]] to <2 x double>* ; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 4 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0> ; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[P]], i64 1 -; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0> -; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP2]], [[TMP3]] -; CHECK-NEXT: [[TMP5:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 4 +; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP2]], [[SHUFFLE]] +; CHECK-NEXT: [[TMP4:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 4 ; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]] ; CHECK: ext: ; CHECK-NEXT: ret void @@ -218,16 +215,13 @@ define void @shuffle_nodes_match2(double * noalias %from, double * noalias %to, ; CHECK-NEXT: br label [[LP:%.*]] ; CHECK: lp: ; CHECK-NEXT: [[P:%.*]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.*]] ] -; CHECK-NEXT: [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1 -; CHECK-NEXT: [[V0_1:%.*]] = load double, double* [[FROM]], align 4 -; CHECK-NEXT: [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4 -; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0 -; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <2 x double> [[TMP0]], <2 x double> poison, <2 x i32> zeroinitializer -; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0 -; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[P]], i64 1 -; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP1]], [[TMP3]] -; CHECK-NEXT: [[TMP5:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 4 +; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[FROM:%.*]] to <2 x double>* +; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 4 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0> +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[P]], i64 1 +; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[SHUFFLE]], [[TMP2]] +; CHECK-NEXT: [[TMP4:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 4 ; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]] ; CHECK: ext: ; CHECK-NEXT: ret void @@ -348,7 +342,7 @@ define void @load_reorder_double(double* nocapture %c, double* noalias nocapture ; CHECK-NEXT: [[TMP2:%.*]] = load <2 x double>, <2 x double>* [[TMP1]], align 4 ; CHECK-NEXT: [[TMP3:%.*]] = bitcast double* [[A:%.*]] to <2 x double>* ; CHECK-NEXT: [[TMP4:%.*]] = load <2 x double>, <2 x double>* [[TMP3]], align 4 -; CHECK-NEXT: [[TMP5:%.*]] = fadd <2 x double> [[TMP4]], [[TMP2]] +; CHECK-NEXT: [[TMP5:%.*]] = fadd <2 x double> [[TMP2]], [[TMP4]] ; CHECK-NEXT: [[TMP6:%.*]] = bitcast double* [[C:%.*]] to <2 x double>* ; CHECK-NEXT: store <2 x double> [[TMP5]], <2 x double>* [[TMP6]], align 4 ; CHECK-NEXT: ret void diff --git a/llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll b/llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll index ced403ae5375..19f654e5a4f8 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll @@ -22,9 +22,9 @@ define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn ; CHECK-NEXT: [[GEP_8:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 1 ; CHECK-NEXT: [[GEP_9:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 2 ; CHECK-NEXT: [[GEP_10:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 3 -; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 0, i32 2> +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 0, i32 2> ; CHECK-NEXT: [[TMP6:%.*]] = bitcast i32* [[GEP_7]] to <4 x i32>* -; CHECK-NEXT: store <4 x i32> [[REORDER_SHUFFLE]], <4 x i32>* [[TMP6]], align 4 +; CHECK-NEXT: store <4 x i32> [[SHUFFLE]], <4 x i32>* [[TMP6]], align 4 ; CHECK-NEXT: ret i32 undef ; %in.addr = getelementptr inbounds i32, i32* %in, i64 0 diff --git a/llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll b/llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll index 9983578a7058..65d1fce9e130 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll @@ -97,9 +97,9 @@ define void @store_reverse(i64* %p3) { ; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i64>, <4 x i64>* [[TMP2]], align 8 ; CHECK-NEXT: [[TMP4:%.*]] = shl <4 x i64> [[TMP1]], [[TMP3]] ; CHECK-NEXT: [[ARRAYIDX14:%.*]] = getelementptr inbounds i64, i64* [[P3]], i64 4 -; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0> -; CHECK-NEXT: [[TMP6:%.*]] = bitcast i64* [[ARRAYIDX14]] to <4 x i64>* -; CHECK-NEXT: store <4 x i64> [[TMP5]], <4 x i64>* [[TMP6]], align 8 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0> +; CHECK-NEXT: [[TMP5:%.*]] = bitcast i64* [[ARRAYIDX14]] to <4 x i64>* +; CHECK-NEXT: store <4 x i64> [[SHUFFLE]], <4 x i64>* [[TMP5]], align 8 ; CHECK-NEXT: ret void ; entry: diff --git a/llvm/test/Transforms/SLPVectorizer/X86/supernode.ll b/llvm/test/Transforms/SLPVectorizer/X86/supernode.ll index bf98a148e9dc..f1ff95b51f8b 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/supernode.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/supernode.ll @@ -23,7 +23,7 @@ define void @test_supernode_add(double* %Aarray, double* %Barray, double *%Carra ; ENABLED-NEXT: [[C1:%.*]] = load double, double* [[IDXC1]], align 8 ; ENABLED-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[A0]], i32 0 ; ENABLED-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[C1]], i32 1 -; ENABLED-NEXT: [[TMP4:%.*]] = fadd fast <2 x double> [[TMP3]], [[TMP1]] +; ENABLED-NEXT: [[TMP4:%.*]] = fadd fast <2 x double> [[TMP1]], [[TMP3]] ; ENABLED-NEXT: [[TMP5:%.*]] = insertelement <2 x double> poison, double [[C0]], i32 0 ; ENABLED-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[A1]], i32 1 ; ENABLED-NEXT: [[TMP7:%.*]] = fadd fast <2 x double> [[TMP4]], [[TMP6]] </cut>

4 years, 4 months

1
0
0 0

[ACTIVITY] week ending Dec. 12 2021

by Alex Bennée

Project Stratos =============== - posted Potential demo setup for a TSN/XDP networking Message-Id: <87wnkfkp2f.fsf(a)linaro.org> - final Stratos call of the year - CC and Arnd will look at fat virtq - nice update from EPAM on Zephyr - had another round of getting working ACPI on MachiatoBin - posted [PR to clean up some typos in EDK2] - might have a working Xen setup without needing SMC hacks [PR to clean up some typos in EDK2] <https://github.com/tianocore/edk2-platforms/pull/34> vhost-device maintainer effort ([UM-196]) - finished review of https://github.com/rust-vmm/vhost-device/pull/4 [UM-196] <https://linaro.atlassian.net/browse/UM-196> QEMU Upstream Work ([UM-2]) =========================== - discussion around Suggestions for TCG performance improvements Message-Id: <c76bde31-8f3b-2d03-b7c7-9e026d4b5873(a)huawei.com> - did a bunch of bug triage and tagging [UM-2] <https://linaro.atlassian.net/browse/UM-2> Upstream MTTCG tests ([QEMU-52]) - awaiting final review of [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> Completed Reviews [3/3] ======================= [PATCH] tests/plugin/syscall.c: fix compiler warnings Message-Id: <20211128011551.2115468-1-juro.bystricky(a)intel.com> [PATCH for-6.2? 0/2] arm_gicv3: Fix handling of LPIs in list registers Message-Id: <20211126163915.1048353-2-peter.maydell(a)linaro.org> [PATCH] tests/docker: add libfuse3 development headers Message-Id: <20211207160025.52466-1-stefanha(a)redhat.com> Absences ======== Current Review Queue ==================== TODO [PATCH 0/8] virtio: Add vhost-user based Video decode Message-Id: <20211209145601.331477-1-peter.griffin(a)linaro.org> ======================================================================================================================== TODO [PATCH for-7.0 0/6] target/arm: Implement LVA, LPA, LPA2 features Message-Id: <20211208231154.392029-1-richard.henderson(a)linaro.org> ======================================================================================================================================== TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64 Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com> =============================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== -- Alex Bennée

4 years, 5 months

1
0
0 0

[ACTIVITY] report week ending 10 Dec

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] - More code review: now have a target-arm.next poised and ready to send once 6.2 is released * QEMU-420 [GICv4 emulation] - Working on the ITS changes needed for GICv4 support (this turns out to be a more tractable end to start than the redistributor) - I have a preliminary set of 25 or so patches to the ITS which clean up the code and fix some pre-existing bugs that I found while working on the GICv4 changes - have implemented the new VMAPI, VMAPTI, VMAPP ITS commands -- PMM

4 years, 5 months

1
0
0 0

[TCWG CI] 464.h264ref slowed down by 7% after llvm: [LV] Pass compare predicate to getCmpSelInstrCost.

by ci_notify＠linaro.org

After llvm commit 3d549dddf75b6ff9e0ec8c053677750bde4226ea Author: Sander de Smalen <sander.desmalen(a)arm.com> [LV] Pass compare predicate to getCmpSelInstrCost. the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 7% from 11115 to 11846 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-3d549dddf75b6ff9e0ec8c053677750bde4226ea cd investigate-llvm-3d549dddf75b6ff9e0ec8c053677750bde4226ea # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 3d549dddf75b6ff9e0ec8c053677750bde4226ea ../artifacts/test.sh # Reproduce last_good build git checkout --detach ab31d003e16e483bff298ea2f28fec0f23e8eb79 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 3d549dddf75b6ff9e0ec8c053677750bde4226ea Author: Sander de Smalen <sander.desmalen(a)arm.com> Date: Mon Dec 6 11:14:27 2021 +0000 [LV] Pass compare predicate to getCmpSelInstrCost. If the condition of a select is a compare, pass its predicate to TTI::getCmpSelInstrCost to get a more accurate cost value instead of passing BAD_ICMP_PREDICATE. I noticed that the commit message from D90070 had a comment about the vectorized select predicate possibly being composed of other compares with different predicate values, but I wasn't able to construct an example where this was an actual issue. If this is an issue, I guess we could add another check that the block isn't predicated for any reason. Reviewed By: dmgreen, fhahn Differential Revision: https://reviews.llvm.org/D114646 --- llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | 11 ++++++++--- llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll | 14 +++++++------- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index 050879144afd..c03e506b7474 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -7570,8 +7570,12 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, Type *CondTy = SI->getCondition()->getType(); if (!ScalarCond) CondTy = VectorType::get(CondTy, VF); - return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy, - CmpInst::BAD_ICMP_PREDICATE, CostKind, I); + + CmpInst::Predicate Pred = CmpInst::BAD_ICMP_PREDICATE; + if (auto *Cmp = dyn_cast<CmpInst>(SI->getCondition())) + Pred = Cmp->getPredicate(); + return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy, Pred, + CostKind, I); } case Instruction::ICmp: case Instruction::FCmp: { @@ -7581,7 +7585,8 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, ValTy = IntegerType::get(ValTy->getContext(), MinBWs[Op0AsInstruction]); VectorTy = ToVectorTy(ValTy, VF); return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, nullptr, - CmpInst::BAD_ICMP_PREDICATE, CostKind, I); + cast<CmpInst>(I)->getPredicate(), CostKind, + I); } case Instruction::Store: case Instruction::Load: { diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll index 62b18f44fbc5..20d2dc0b7cda 100644 --- a/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll +++ b/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll @@ -5,17 +5,17 @@ target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128" target triple = "arm64-apple-ios5.0.0" define void @selects_1(i32* nocapture %dst, i32 %A, i32 %B, i32 %C, i32 %N) { -; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and -; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and -; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 +; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 -; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and -; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and -; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 +; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 ; CHECK-LABEL: define void @selects_1( ; CHECK: vector.body: -; CHECK: select <2 x i1> +; CHECK: select <4 x i1> entry: %cmp26 = icmp sgt i32 %N, 0 </cut>

4 years, 5 months

1
0
0 0

clang-thumbv7-full-2stage is red for 20 days

by Galina Kistanova

Dear Linaro Toolchain Working Group, clang-thumbv7-full-2stage is red for 20 days. Could you take it to the staging area and make it green again, please? Thanks Galina

4 years, 5 months

2
2
0 0

[TCWG CI] 433.milc slowed down by 4% after llvm: Add missing header

by ci_notify＠linaro.org

After llvm commit bd4c6a476fd037fb07a1c484f75d93ee40713d3d Author: David Blaikie <dblaikie(a)gmail.com> Add missing header the following benchmarks slowed down by more than 2%: - 433.milc slowed down by 4% from 12427 to 12916 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-bd4c6a476fd037fb07a1c484f75d93ee40713d3d cd investigate-llvm-bd4c6a476fd037fb07a1c484f75d93ee40713d3d # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach bd4c6a476fd037fb07a1c484f75d93ee40713d3d ../artifacts/test.sh # Reproduce last_good build git checkout --detach 7d4da4e1ab7f79e51db0d5c2a0f5ef1711122dd7 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit bd4c6a476fd037fb07a1c484f75d93ee40713d3d Author: David Blaikie <dblaikie(a)gmail.com> Date: Mon Nov 29 16:29:25 2021 -0800 Add missing header --- llvm/lib/Demangle/DLangDemangle.cpp | 1 + 1 file changed, 1 insertion(+) diff --git a/llvm/lib/Demangle/DLangDemangle.cpp b/llvm/lib/Demangle/DLangDemangle.cpp index faf91b239490..f380aa90035e 100644 --- a/llvm/lib/Demangle/DLangDemangle.cpp +++ b/llvm/lib/Demangle/DLangDemangle.cpp @@ -17,6 +17,7 @@ #include "llvm/Demangle/StringView.h" #include "llvm/Demangle/Utility.h" +#include <cctype> #include <cstring> #include <limits> </cut>

4 years, 5 months

3
3
0 0

[TCWG CI] 453.povray failed to build after llvm: [SLP]Fix reused extracts cost.

by ci_notify＠linaro.org

After llvm commit ba74bb3a226e1b4660537f274627285b1bf41ee1 Author: Alexey Bataev <a.bataev(a)outlook.com> [SLP]Fix reused extracts cost. the following benchmarks slowed down by more than 2%: - 453.povray failed to build Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-ba74bb3a226e1b4660537f274627285b1bf41ee1 cd investigate-llvm-ba74bb3a226e1b4660537f274627285b1bf41ee1 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach ba74bb3a226e1b4660537f274627285b1bf41ee1 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 78cc133c63173a4b5b7a43750cc507d4cff683cf ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit ba74bb3a226e1b4660537f274627285b1bf41ee1 Author: Alexey Bataev <a.bataev(a)outlook.com> Date: Thu Dec 2 04:22:55 2021 -0800 [SLP]Fix reused extracts cost. If the extractelement instruction is used multiple times in the different tree entries (either vectorized, or gathered), need to compensate the scalar cost of such instructions. They are completely removed if all users are part of the tree but we need to compensate the cost only once for each instruction. Differential Revision: https://reviews.llvm.org/D114958 --- llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 29 +++++++++++++--------- .../X86/extractelement-multiple-uses.ll | 23 +++++++++-------- 2 files changed, 29 insertions(+), 23 deletions(-) diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp index 95061e9053fa..335ad6c85387 100644 --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp @@ -4287,8 +4287,8 @@ bool BoUpSLP::canReuseExtract(ArrayRef<Value *> VL, Value *OpValue, bool BoUpSLP::areAllUsersVectorized(Instruction *I, ArrayRef<Value *> VectorizedVals) const { return (I->hasOneUse() && is_contained(VectorizedVals, I)) || - llvm::all_of(I->users(), [this](User *U) { - return ScalarToTreeEntry.count(U) > 0; + all_of(I->users(), [this](User *U) { + return ScalarToTreeEntry.count(U) > 0 || MustGather.contains(U); }); } @@ -4442,9 +4442,9 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, // FIXME: it tries to fix a problem with MSVC buildbots. TargetTransformInfo &TTIRef = *TTI; auto &&AdjustExtractsCost = [this, &TTIRef, CostKind, VL, VecTy, - VectorizedVals](InstructionCost &Cost, - bool IsGather) { + VectorizedVals, E](InstructionCost &Cost) { DenseMap<Value *, int> ExtractVectorsTys; + SmallPtrSet<Value *, 4> CheckedExtracts; for (auto *V : VL) { if (isa<UndefValue>(V)) continue; @@ -4452,7 +4452,12 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, // instruction itself is not going to be vectorized, consider this // instruction as dead and remove its cost from the final cost of the // vectorized tree. - if (!areAllUsersVectorized(cast<Instruction>(V), VectorizedVals)) + // Also, avoid adjusting the cost for extractelements with multiple uses + // in different graph entries. + const TreeEntry *VE = getTreeEntry(V); + if (!CheckedExtracts.insert(V).second || + !areAllUsersVectorized(cast<Instruction>(V), VectorizedVals) || + (VE && VE != E)) continue; auto *EE = cast<ExtractElementInst>(V); Optional<unsigned> EEIdx = getExtractIndex(EE); @@ -4549,11 +4554,6 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, } return GatherCost; } - if (isSplat(VL)) { - // Found the broadcasting of the single scalar, calculate the cost as the - // broadcast. - return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy); - } if ((E->getOpcode() == Instruction::ExtractElement || all_of(E->Scalars, [](Value *V) { @@ -4571,13 +4571,18 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, // single input vector or of 2 input vectors. InstructionCost Cost = computeExtractCost(VL, VecTy, *ShuffleKind, Mask, *TTI); - AdjustExtractsCost(Cost, /*IsGather=*/true); + AdjustExtractsCost(Cost); if (NeedToShuffleReuses) Cost += TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, FinalVecTy, E->ReuseShuffleIndices); return Cost; } } + if (isSplat(VL)) { + // Found the broadcasting of the single scalar, calculate the cost as the + // broadcast. + return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy); + } InstructionCost ReuseShuffleCost = 0; if (NeedToShuffleReuses) ReuseShuffleCost = TTI->getShuffleCost( @@ -4755,7 +4760,7 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, I); } } else { - AdjustExtractsCost(CommonCost, /*IsGather=*/false); + AdjustExtractsCost(CommonCost); } return CommonCost; } diff --git a/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll b/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll index c47f255f0bfe..31696752bbb3 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll @@ -2,24 +2,25 @@ ; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -march=core-avx2 -pass-remarks-output=%t | FileCheck %s ; RUN: FileCheck %s --input-file=%t --check-prefix=YAML -; YAML: --- !Missed +; YAML: --- !Passed ; YAML: Pass: slp-vectorizer -; YAML: Name: NotBeneficial +; YAML: Name: VectorizedList ; YAML: Function: multi_uses ; YAML: Args: -; YAML: - String: 'List vectorization was possible but not beneficial with cost ' -; YAML: - Cost: '0' -; YAML: - String: ' >= ' -; YAML: - Treshold: '0' +; YAML: - String: 'SLP vectorized with cost ' +; YAML: - Cost: '-1' +; YAML: - String: ' and with tree size ' +; YAML: - TreeSize: '3' define float @multi_uses(<2 x float> %x, <2 x float> %y) { ; CHECK-LABEL: @multi_uses( -; CHECK-NEXT: [[X0:%.*]] = extractelement <2 x float> [[X:%.*]], i32 0 -; CHECK-NEXT: [[X1:%.*]] = extractelement <2 x float> [[X]], i32 1 ; CHECK-NEXT: [[Y1:%.*]] = extractelement <2 x float> [[Y:%.*]], i32 1 -; CHECK-NEXT: [[X0X0:%.*]] = fmul float [[X0]], [[Y1]] -; CHECK-NEXT: [[X1X1:%.*]] = fmul float [[X1]], [[Y1]] -; CHECK-NEXT: [[ADD:%.*]] = fadd float [[X0X0]], [[X1X1]] +; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x float> poison, float [[Y1]], i32 0 +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> [[TMP1]], float [[Y1]], i32 1 +; CHECK-NEXT: [[TMP3:%.*]] = fmul <2 x float> [[X:%.*]], [[TMP2]] +; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP3]], i32 0 +; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP3]], i32 1 +; CHECK-NEXT: [[ADD:%.*]] = fadd float [[TMP4]], [[TMP5]] ; CHECK-NEXT: ret float [[ADD]] ; %x0 = extractelement <2 x float> %x, i32 0 </cut>

4 years, 5 months

3
3
0 0

[TCWG CI] 464.h264ref slowed down by 3% after llvm: [SLP]Improve isFixedVectorShuffle and its use.

by ci_notify＠linaro.org

After llvm commit dce6c434ead3ccbaa67b8db2301b2a9fb4319123 Author: Alexey Bataev <a.bataev(a)outlook.com> [SLP]Improve isFixedVectorShuffle and its use. the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 3% from 10824 to 11101 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-dce6c434ead3ccbaa67b8db2301b2a9fb4319123 cd investigate-llvm-dce6c434ead3ccbaa67b8db2301b2a9fb4319123 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach dce6c434ead3ccbaa67b8db2301b2a9fb4319123 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 7a7c059d867554e116244ad5639d05d75ed1a7cd ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit dce6c434ead3ccbaa67b8db2301b2a9fb4319123 Author: Alexey Bataev <a.bataev(a)outlook.com> Date: Wed Nov 17 11:14:38 2021 -0800 [SLP]Improve isFixedVectorShuffle and its use. Extended support for undefined source vector/extract indices/non-fixed vector types, also no need to check for the parent of the extractelement instructions with the constant indicies. Differential Revision: https://reviews.llvm.org/D114121 --- llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 67 +++++++++++++++------- .../X86/alternate-int-inseltpoison.ll | 24 ++++---- .../Transforms/SLPVectorizer/X86/alternate-int.ll | 24 ++++---- 3 files changed, 66 insertions(+), 49 deletions(-) diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp index e3d3d8992c23..4db630fbd063 100644 --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp @@ -327,7 +327,11 @@ static bool isCommutative(Instruction *I) { /// TargetTransformInfo::getInstructionThroughput? static Optional<TargetTransformInfo::ShuffleKind> isFixedVectorShuffle(ArrayRef<Value *> VL, SmallVectorImpl<int> &Mask) { - auto *EI0 = cast<ExtractElementInst>(VL[0]); + const auto *It = + find_if(VL, [](Value *V) { return isa<ExtractElementInst>(V); }); + if (It == VL.end()) + return None; + auto *EI0 = cast<ExtractElementInst>(*It); if (isa<ScalableVectorType>(EI0->getVectorOperandType())) return None; unsigned Size = @@ -336,33 +340,41 @@ isFixedVectorShuffle(ArrayRef<Value *> VL, SmallVectorImpl<int> &Mask) { Value *Vec2 = nullptr; enum ShuffleMode { Unknown, Select, Permute }; ShuffleMode CommonShuffleMode = Unknown; + Mask.assign(VL.size(), UndefMaskElem); for (unsigned I = 0, E = VL.size(); I < E; ++I) { + // Undef can be represented as an undef element in a vector. + if (isa<UndefValue>(VL[I])) + continue; auto *EI = cast<ExtractElementInst>(VL[I]); + if (isa<ScalableVectorType>(EI->getVectorOperandType())) + return None; auto *Vec = EI->getVectorOperand(); + // We can extractelement from undef or poison vector. + if (isa<UndefValue>(Vec)) + continue; // All vector operands must have the same number of vector elements. if (cast<FixedVectorType>(Vec->getType())->getNumElements() != Size) return None; + if (isa<UndefValue>(EI->getIndexOperand())) + continue; auto *Idx = dyn_cast<ConstantInt>(EI->getIndexOperand()); if (!Idx) return None; // Undefined behavior if Idx is negative or >= Size. - if (Idx->getValue().uge(Size)) { - Mask.push_back(UndefMaskElem); + if (Idx->getValue().uge(Size)) continue; - } unsigned IntIdx = Idx->getValue().getZExtValue(); - Mask.push_back(IntIdx); - // We can extractelement from undef or poison vector. - if (isa<UndefValue>(Vec)) - continue; + Mask[I] = IntIdx; // For correct shuffling we have to have at most 2 different vector operands // in all extractelement instructions. - if (!Vec1 || Vec1 == Vec) + if (!Vec1 || Vec1 == Vec) { Vec1 = Vec; - else if (!Vec2 || Vec2 == Vec) + } else if (!Vec2 || Vec2 == Vec) { Vec2 = Vec; - else + Mask[I] += Size; + } else { return None; + } if (CommonShuffleMode == Permute) continue; // If the extract index is not the same as the operation number, it is a @@ -4414,15 +4426,19 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, bool IsGather) { DenseMap<Value *, int> ExtractVectorsTys; for (auto *V : VL) { + if (isa<UndefValue>(V)) + continue; // If all users of instruction are going to be vectorized and this // instruction itself is not going to be vectorized, consider this // instruction as dead and remove its cost from the final cost of the // vectorized tree. - if (!areAllUsersVectorized(cast<Instruction>(V), VectorizedVals) || - (IsGather && ScalarToTreeEntry.count(V))) + if (!areAllUsersVectorized(cast<Instruction>(V), VectorizedVals)) continue; auto *EE = cast<ExtractElementInst>(V); - unsigned Idx = *getExtractIndex(EE); + Optional<unsigned> EEIdx = getExtractIndex(EE); + if (!EEIdx) + continue; + unsigned Idx = *EEIdx; if (TTIRef.getNumberOfParts(VecTy) != TTIRef.getNumberOfParts(EE->getVectorOperandType())) { auto It = @@ -4454,6 +4470,8 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, for (const auto &Data : ExtractVectorsTys) { auto *EEVTy = cast<FixedVectorType>(Data.first->getType()); unsigned NumElts = VecTy->getNumElements(); + if (Data.second % NumElts == 0) + continue; if (TTIRef.getNumberOfParts(EEVTy) > TTIRef.getNumberOfParts(VecTy)) { unsigned Idx = (Data.second / NumElts) * NumElts; unsigned EENumElts = EEVTy->getNumElements(); @@ -4516,10 +4534,12 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, // broadcast. return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy); } - if (E->getOpcode() == Instruction::ExtractElement && allSameType(VL) && - allSameBlock(VL) && - !isa<ScalableVectorType>( - cast<ExtractElementInst>(E->getMainOp())->getVectorOperandType())) { + if ((E->getOpcode() == Instruction::ExtractElement || + all_of(E->Scalars, + [](Value *V) { + return isa<ExtractElementInst, UndefValue>(V); + })) && + allSameType(VL)) { // Check that gather of extractelements can be represented as just a // shuffle of a single/two vectors the scalars are extracted from. SmallVector<int> Mask; @@ -5111,7 +5131,11 @@ bool BoUpSLP::isFullyVectorizableTinyTree(bool ForReduction) const { [this](Value *V) { return EphValues.contains(V); }) && (allConstant(TE->Scalars) || isSplat(TE->Scalars) || TE->Scalars.size() < Limit || - (TE->getOpcode() == Instruction::ExtractElement && + ((TE->getOpcode() == Instruction::ExtractElement || + all_of(TE->Scalars, + [](Value *V) { + return isa<ExtractElementInst, UndefValue>(V); + })) && isFixedVectorShuffle(TE->Scalars, Mask)) || (TE->State == TreeEntry::NeedToGather && TE->getOpcode() == Instruction::Load && !TE->isAltShuffle())); @@ -9183,8 +9207,9 @@ bool SLPVectorizerPass::vectorizeInsertElementInst(InsertElementInst *IEI, SmallVector<Value *, 16> BuildVectorOpds; SmallVector<int> Mask; if (!findBuildAggregate(IEI, TTI, BuildVectorOpds, BuildVectorInsts) || - (llvm::all_of(BuildVectorOpds, - [](Value *V) { return isa<ExtractElementInst>(V); }) && + (llvm::all_of( + BuildVectorOpds, + [](Value *V) { return isa<ExtractElementInst, UndefValue>(V); }) && isFixedVectorShuffle(BuildVectorOpds, Mask))) return false; diff --git a/llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll b/llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll index 8ab137cc2d7d..9c19a32b2f41 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll @@ -230,25 +230,21 @@ define <8 x i32> @ashr_shl_v8i32_const(<8 x i32> %a) { define <8 x i32> @ashr_lshr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) { ; SSE-LABEL: @ashr_lshr_shl_v8i32( -; SSE-NEXT: [[A6:%.*]] = extractelement <8 x i32> [[A:%.*]], i32 6 -; SSE-NEXT: [[A7:%.*]] = extractelement <8 x i32> [[A]], i32 7 -; SSE-NEXT: [[B6:%.*]] = extractelement <8 x i32> [[B:%.*]], i32 6 -; SSE-NEXT: [[B7:%.*]] = extractelement <8 x i32> [[B]], i32 7 -; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> -; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> +; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> +; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; SSE-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]] ; SSE-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]] ; SSE-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7> ; SSE-NEXT: [[TMP6:%.*]] = lshr <8 x i32> [[A]], [[B]] ; SSE-NEXT: [[TMP7:%.*]] = shufflevector <8 x i32> [[TMP6]], <8 x i32> poison, <2 x i32> <i32 4, i32 5> -; SSE-NEXT: [[AB6:%.*]] = shl i32 [[A6]], [[B6]] -; SSE-NEXT: [[AB7:%.*]] = shl i32 [[A7]], [[B7]] -; SSE-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> -; SSE-NEXT: [[TMP9:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> -; SSE-NEXT: [[R51:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef> -; SSE-NEXT: [[R6:%.*]] = insertelement <8 x i32> [[R51]], i32 [[AB6]], i32 6 -; SSE-NEXT: [[R7:%.*]] = insertelement <8 x i32> [[R6]], i32 [[AB7]], i32 7 -; SSE-NEXT: ret <8 x i32> [[R7]] +; SSE-NEXT: [[TMP8:%.*]] = shl <8 x i32> [[A]], [[B]] +; SSE-NEXT: [[TMP9:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> poison, <2 x i32> <i32 6, i32 7> +; SSE-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[R52:%.*]] = shufflevector <8 x i32> [[TMP10]], <8 x i32> [[TMP11]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef> +; SSE-NEXT: [[TMP12:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[R71:%.*]] = shufflevector <8 x i32> [[R52]], <8 x i32> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 8, i32 9> +; SSE-NEXT: ret <8 x i32> [[R71]] ; ; SLM-LABEL: @ashr_lshr_shl_v8i32( ; SLM-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> diff --git a/llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll b/llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll index 3af16bf404a3..783b50ae4b17 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll @@ -230,25 +230,21 @@ define <8 x i32> @ashr_shl_v8i32_const(<8 x i32> %a) { define <8 x i32> @ashr_lshr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) { ; SSE-LABEL: @ashr_lshr_shl_v8i32( -; SSE-NEXT: [[A6:%.*]] = extractelement <8 x i32> [[A:%.*]], i32 6 -; SSE-NEXT: [[A7:%.*]] = extractelement <8 x i32> [[A]], i32 7 -; SSE-NEXT: [[B6:%.*]] = extractelement <8 x i32> [[B:%.*]], i32 6 -; SSE-NEXT: [[B7:%.*]] = extractelement <8 x i32> [[B]], i32 7 -; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> -; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> +; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> +; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; SSE-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]] ; SSE-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]] ; SSE-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7> ; SSE-NEXT: [[TMP6:%.*]] = lshr <8 x i32> [[A]], [[B]] ; SSE-NEXT: [[TMP7:%.*]] = shufflevector <8 x i32> [[TMP6]], <8 x i32> poison, <2 x i32> <i32 4, i32 5> -; SSE-NEXT: [[AB6:%.*]] = shl i32 [[A6]], [[B6]] -; SSE-NEXT: [[AB7:%.*]] = shl i32 [[A7]], [[B7]] -; SSE-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> -; SSE-NEXT: [[TMP9:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> -; SSE-NEXT: [[R51:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef> -; SSE-NEXT: [[R6:%.*]] = insertelement <8 x i32> [[R51]], i32 [[AB6]], i32 6 -; SSE-NEXT: [[R7:%.*]] = insertelement <8 x i32> [[R6]], i32 [[AB7]], i32 7 -; SSE-NEXT: ret <8 x i32> [[R7]] +; SSE-NEXT: [[TMP8:%.*]] = shl <8 x i32> [[A]], [[B]] +; SSE-NEXT: [[TMP9:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> poison, <2 x i32> <i32 6, i32 7> +; SSE-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[R52:%.*]] = shufflevector <8 x i32> [[TMP10]], <8 x i32> [[TMP11]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef> +; SSE-NEXT: [[TMP12:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[R71:%.*]] = shufflevector <8 x i32> [[R52]], <8 x i32> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 8, i32 9> +; SSE-NEXT: ret <8 x i32> [[R71]] ; ; SLM-LABEL: @ashr_lshr_shl_v8i32( ; SLM-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> </cut>

4 years, 5 months

1
0
0 0

[TCWG CI] 453.povray slowed down by 3% after llvm: [llvm] Use range-based for loops (NFC)

by ci_notify＠linaro.org

After llvm commit f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c Author: Kazu Hirata <kazu(a)google.com> [llvm] Use range-based for loops (NFC) the following benchmarks slowed down by more than 2%: - 453.povray slowed down by 3% from 4906 to 5047 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c cd investigate-llvm-f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c ../artifacts/test.sh # Reproduce last_good build git checkout --detach c572eb1ad9d8a528bcaff0160888aff31b1f4b5f ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c Author: Kazu Hirata <kazu(a)google.com> Date: Mon Nov 29 09:04:44 2021 -0800 [llvm] Use range-based for loops (NFC) --- llvm/lib/CodeGen/MachinePipeliner.cpp | 7 ++--- .../CodeGen/SelectionDAG/ResourcePriorityQueue.cpp | 4 +-- llvm/lib/Object/ELFObjectFile.cpp | 4 +-- llvm/lib/ObjectYAML/COFFEmitter.cpp | 32 ++++++++++------------ llvm/lib/Passes/StandardInstrumentations.cpp | 4 +-- llvm/lib/ProfileData/InstrProf.cpp | 11 ++++---- llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp | 18 ++++++------ llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp | 32 ++++++++++------------ 8 files changed, 50 insertions(+), 62 deletions(-) diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp index 21be6718b7d9..8d6459a627fa 100644 --- a/llvm/lib/CodeGen/MachinePipeliner.cpp +++ b/llvm/lib/CodeGen/MachinePipeliner.cpp @@ -1519,9 +1519,8 @@ static bool pred_L(SetVector<SUnit *> &NodeOrder, SmallSetVector<SUnit *, 8> &Preds, const NodeSet *S = nullptr) { Preds.clear(); - for (SetVector<SUnit *>::iterator I = NodeOrder.begin(), E = NodeOrder.end(); - I != E; ++I) { - for (const SDep &Pred : (*I)->Preds) { + for (const SUnit *SU : NodeOrder) { + for (const SDep &Pred : SU->Preds) { if (S && S->count(Pred.getSUnit()) == 0) continue; if (ignoreDependence(Pred, true)) @@ -1530,7 +1529,7 @@ static bool pred_L(SetVector<SUnit *> &NodeOrder, Preds.insert(Pred.getSUnit()); } // Back-edges are predecessors with an anti-dependence. - for (const SDep &Succ : (*I)->Succs) { + for (const SDep &Succ : SU->Succs) { if (Succ.getKind() != SDep::Anti) continue; if (S && S->count(Succ.getSUnit()) == 0) diff --git a/llvm/lib/CodeGen/SelectionDAG/ResourcePriorityQueue.cpp b/llvm/lib/CodeGen/SelectionDAG/ResourcePriorityQueue.cpp index 55fe26eb64cd..2695ed36991c 100644 --- a/llvm/lib/CodeGen/SelectionDAG/ResourcePriorityQueue.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/ResourcePriorityQueue.cpp @@ -268,8 +268,8 @@ bool ResourcePriorityQueue::isResourceAvailable(SUnit *SU) { // Now see if there are no other dependencies // to instructions already in the packet. - for (unsigned i = 0, e = Packet.size(); i != e; ++i) - for (const SDep &Succ : Packet[i]->Succs) { + for (const SUnit *S : Packet) + for (const SDep &Succ : S->Succs) { // Since we do not add pseudos to packets, might as well // ignore order deps. if (Succ.isCtrl()) diff --git a/llvm/lib/Object/ELFObjectFile.cpp b/llvm/lib/Object/ELFObjectFile.cpp index 50035d6c7523..cf1f12d9a9a7 100644 --- a/llvm/lib/Object/ELFObjectFile.cpp +++ b/llvm/lib/Object/ELFObjectFile.cpp @@ -682,7 +682,7 @@ readDynsymVersionsImpl(const ELFFile<ELFT> &EF, std::vector<VersionEntry> Ret; size_t I = 0; - for (auto It = Symbols.begin(), E = Symbols.end(); It != E; ++It) { + for (const ELFSymbolRef &Sym : Symbols) { ++I; Expected<const typename ELFT::Versym *> VerEntryOrErr = EF.template getEntry<typename ELFT::Versym>(*VerSec, I); @@ -691,7 +691,7 @@ readDynsymVersionsImpl(const ELFFile<ELFT> &EF, " from " + describe(EF, *VerSec) + ": " + toString(VerEntryOrErr.takeError())); - Expected<uint32_t> FlagsOrErr = It->getFlags(); + Expected<uint32_t> FlagsOrErr = Sym.getFlags(); if (!FlagsOrErr) return createError("unable to read flags for symbol with index " + Twine(I) + ": " + toString(FlagsOrErr.takeError())); diff --git a/llvm/lib/ObjectYAML/COFFEmitter.cpp b/llvm/lib/ObjectYAML/COFFEmitter.cpp index 5f38ca13cfc2..66ad16db1ba4 100644 --- a/llvm/lib/ObjectYAML/COFFEmitter.cpp +++ b/llvm/lib/ObjectYAML/COFFEmitter.cpp @@ -476,29 +476,25 @@ static bool writeCOFF(COFFParser &CP, raw_ostream &OS) { assert(OS.tell() == CP.SectionTableStart); // Output section table. - for (std::vector<COFFYAML::Section>::iterator i = CP.Obj.Sections.begin(), - e = CP.Obj.Sections.end(); - i != e; ++i) { - OS.write(i->Header.Name, COFF::NameSize); - OS << binary_le(i->Header.VirtualSize) - << binary_le(i->Header.VirtualAddress) - << binary_le(i->Header.SizeOfRawData) - << binary_le(i->Header.PointerToRawData) - << binary_le(i->Header.PointerToRelocations) - << binary_le(i->Header.PointerToLineNumbers) - << binary_le(i->Header.NumberOfRelocations) - << binary_le(i->Header.NumberOfLineNumbers) - << binary_le(i->Header.Characteristics); + for (const COFFYAML::Section &S : CP.Obj.Sections) { + OS.write(S.Header.Name, COFF::NameSize); + OS << binary_le(S.Header.VirtualSize) + << binary_le(S.Header.VirtualAddress) + << binary_le(S.Header.SizeOfRawData) + << binary_le(S.Header.PointerToRawData) + << binary_le(S.Header.PointerToRelocations) + << binary_le(S.Header.PointerToLineNumbers) + << binary_le(S.Header.NumberOfRelocations) + << binary_le(S.Header.NumberOfLineNumbers) + << binary_le(S.Header.Characteristics); } assert(OS.tell() == CP.SectionTableStart + CP.SectionTableSize); unsigned CurSymbol = 0; StringMap<unsigned> SymbolTableIndexMap; - for (std::vector<COFFYAML::Symbol>::iterator I = CP.Obj.Symbols.begin(), - E = CP.Obj.Symbols.end(); - I != E; ++I) { - SymbolTableIndexMap[I->Name] = CurSymbol; - CurSymbol += 1 + I->Header.NumberOfAuxSymbols; + for (const COFFYAML::Symbol &Sym : CP.Obj.Symbols) { + SymbolTableIndexMap[Sym.Name] = CurSymbol; + CurSymbol += 1 + Sym.Header.NumberOfAuxSymbols; } // Output section data. diff --git a/llvm/lib/Passes/StandardInstrumentations.cpp b/llvm/lib/Passes/StandardInstrumentations.cpp index 8e6be6730ea4..27a6c519ff82 100644 --- a/llvm/lib/Passes/StandardInstrumentations.cpp +++ b/llvm/lib/Passes/StandardInstrumentations.cpp @@ -225,8 +225,8 @@ std::string doSystemDiff(StringRef Before, StringRef After, return "Unable to read result."; // Clean up. - for (unsigned I = 0; I < NumFiles; ++I) { - std::error_code EC = sys::fs::remove(FileName[I]); + for (const std::string &I : FileName) { + std::error_code EC = sys::fs::remove(I); if (EC) return "Unable to remove temporary file."; } diff --git a/llvm/lib/ProfileData/InstrProf.cpp b/llvm/lib/ProfileData/InstrProf.cpp index 1168ad27fe52..ab3487ecffe8 100644 --- a/llvm/lib/ProfileData/InstrProf.cpp +++ b/llvm/lib/ProfileData/InstrProf.cpp @@ -657,19 +657,18 @@ void InstrProfValueSiteRecord::merge(InstrProfValueSiteRecord &Input, Input.sortByTargetValues(); auto I = ValueData.begin(); auto IE = ValueData.end(); - for (auto J = Input.ValueData.begin(), JE = Input.ValueData.end(); J != JE; - ++J) { - while (I != IE && I->Value < J->Value) + for (const InstrProfValueData &J : Input.ValueData) { + while (I != IE && I->Value < J.Value) ++I; - if (I != IE && I->Value == J->Value) { + if (I != IE && I->Value == J.Value) { bool Overflowed; - I->Count = SaturatingMultiplyAdd(J->Count, Weight, I->Count, &Overflowed); + I->Count = SaturatingMultiplyAdd(J.Count, Weight, I->Count, &Overflowed); if (Overflowed) Warn(instrprof_error::counter_overflow); ++I; continue; } - ValueData.insert(I, *J); + ValueData.insert(I, J); } } diff --git a/llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp b/llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp index 43f0758f6598..8c3b9572201e 100644 --- a/llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp +++ b/llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp @@ -476,10 +476,10 @@ namespace { } // end anonymous namespace static const NodeSet *node_class(GepNode *N, NodeSymRel &Rel) { - for (NodeSymRel::iterator I = Rel.begin(), E = Rel.end(); I != E; ++I) - if (I->count(N)) - return &*I; - return nullptr; + for (const NodeSet &S : Rel) + if (S.count(N)) + return &S; + return nullptr; } // Create an ordered pair of GepNode pointers. The pair will be used in @@ -589,9 +589,8 @@ void HexagonCommonGEP::common() { dbgs() << "{ " << I->first << ", " << I->second << " }\n"; dbgs() << "Gep equivalence classes:\n"; - for (NodeSymRel::iterator I = EqRel.begin(), E = EqRel.end(); I != E; ++I) { + for (const NodeSet &S : EqRel) { dbgs() << '{'; - const NodeSet &S = *I; for (NodeSet::const_iterator J = S.begin(), F = S.end(); J != F; ++J) { if (J != S.begin()) dbgs() << ','; @@ -604,8 +603,7 @@ void HexagonCommonGEP::common() { // Create a projection from a NodeSet to the minimal element in it. using ProjMap = std::map<const NodeSet *, GepNode *>; ProjMap PM; - for (NodeSymRel::iterator I = EqRel.begin(), E = EqRel.end(); I != E; ++I) { - const NodeSet &S = *I; + for (const NodeSet &S : EqRel) { GepNode *Min = *std::min_element(S.begin(), S.end(), NodeOrder); std::pair<ProjMap::iterator,bool> Ins = PM.insert(std::make_pair(&S, Min)); (void)Ins; @@ -1280,8 +1278,8 @@ bool HexagonCommonGEP::runOnFunction(Function &F) { return false; // For now bail out on C++ exception handling. - for (Function::iterator A = F.begin(), Z = F.end(); A != Z; ++A) - for (BasicBlock::iterator I = A->begin(), E = A->end(); I != E; ++I) + for (const BasicBlock &BB : F) + for (const Instruction &I : BB) if (isa<InvokeInst>(I) || isa<LandingPadInst>(I)) return false; diff --git a/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp b/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp index c2ddfd6164f4..c35e67d6726f 100644 --- a/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp +++ b/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp @@ -130,10 +130,8 @@ VisitGlobalVariableForEmission(const GlobalVariable *GV, for (unsigned i = 0, e = GV->getNumOperands(); i != e; ++i) DiscoverDependentGlobals(GV->getOperand(i), Others); - for (DenseSet<const GlobalVariable *>::iterator I = Others.begin(), - E = Others.end(); - I != E; ++I) - VisitGlobalVariableForEmission(*I, Order, Visited, Visiting); + for (const GlobalVariable *GV : Others) + VisitGlobalVariableForEmission(GV, Order, Visited, Visiting); // Now we can visit ourself Order.push_back(GV); @@ -699,35 +697,33 @@ static bool useFuncSeen(const Constant *C, void NVPTXAsmPrinter::emitDeclarations(const Module &M, raw_ostream &O) { DenseMap<const Function *, bool> seenMap; - for (Module::const_iterator FI = M.begin(), FE = M.end(); FI != FE; ++FI) { - const Function *F = &*FI; - - if (F->getAttributes().hasFnAttr("nvptx-libcall-callee")) { - emitDeclaration(F, O); + for (const Function &F : M) { + if (F.getAttributes().hasFnAttr("nvptx-libcall-callee")) { + emitDeclaration(&F, O); continue; } - if (F->isDeclaration()) { - if (F->use_empty()) + if (F.isDeclaration()) { + if (F.use_empty()) continue; - if (F->getIntrinsicID()) + if (F.getIntrinsicID()) continue; - emitDeclaration(F, O); + emitDeclaration(&F, O); continue; } - for (const User *U : F->users()) { + for (const User *U : F.users()) { if (const Constant *C = dyn_cast<Constant>(U)) { if (usedInGlobalVarDef(C)) { // The use is in the initialization of a global variable // that is a function pointer, so print a declaration // for the original function - emitDeclaration(F, O); + emitDeclaration(&F, O); break; } // Emit a declaration of this function if the function that // uses this constant expr has already been seen. if (useFuncSeen(C, seenMap)) { - emitDeclaration(F, O); + emitDeclaration(&F, O); break; } } @@ -746,11 +742,11 @@ void NVPTXAsmPrinter::emitDeclarations(const Module &M, raw_ostream &O) { // appearing in the module before the callee. so print out // a declaration for the callee. if (seenMap.find(caller) != seenMap.end()) { - emitDeclaration(F, O); + emitDeclaration(&F, O); break; } } - seenMap[F] = true; + seenMap[&F] = true; } } </cut>

4 years, 5 months

1
0
0 0

[TCWG CI] 458.sjeng slowed down by 5% after llvm: Reland "[LICM] Hoist LOAD without sinking the STORE"

by ci_notify＠linaro.org

After llvm commit 2cdc6f2ca62e83fec445114fbbe6276e9ab2a7d0 Author: Djordje Todorovic <djordje.todorovic(a)syrmia.com> Reland "[LICM] Hoist LOAD without sinking the STORE" the following benchmarks slowed down by more than 2%: - 458.sjeng slowed down by 5% from 13781 to 14482 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-2cdc6f2ca62e83fec445114fbbe6276e9ab2a7d0 cd investigate-llvm-2cdc6f2ca62e83fec445114fbbe6276e9ab2a7d0 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 2cdc6f2ca62e83fec445114fbbe6276e9ab2a7d0 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 47616c8855fd44abcbd7cad3f7d8153d28db347b ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 2cdc6f2ca62e83fec445114fbbe6276e9ab2a7d0 Author: Djordje Todorovic <djordje.todorovic(a)syrmia.com> Date: Thu Dec 2 03:40:00 2021 -0800 Reland "[LICM] Hoist LOAD without sinking the STORE" When doing load/store promotion within LICM, if we cannot prove that it is safe to sink the store we won't hoist the load, even though we can prove the load could be dereferenced and moved outside the loop. This patch implements the load promotion by moving it in the loop preheader by inserting proper PHI in the loop. The store is kept as is in the loop. By doing this, we avoid doing the load from a memory location in each iteration. Please consider this small example: loop { var = *ptr; if (var) break; *ptr= var + 1; } After this patch, it will be: var0 = *ptr; loop { var1 = phi (var0, var2); if (var1) break; var2 = var1 + 1; *ptr = var2; } This addresses some problems from [0]. [0] https://bugs.llvm.org/show_bug.cgi?id=51193 Differential revision: https://reviews.llvm.org/D113289 --- llvm/include/llvm/Transforms/Utils/SSAUpdater.h | 4 +++ llvm/lib/Transforms/Scalar/LICM.cpp | 41 +++++++++++++++++----- llvm/lib/Transforms/Utils/SSAUpdater.cpp | 3 ++ .../Transforms/InstMerge/st_sink_bugfix_22613.ll | 6 ++-- .../Transforms/LICM/hoist-load-without-store.ll | 5 +-- llvm/test/Transforms/LICM/promote-capture.ll | 8 +++-- .../Transforms/LICM/scalar-promote-memmodel.ll | 8 +++-- .../Transforms/LICM/scalar-promote-opaque-ptrs.ll | 8 +++-- llvm/test/Transforms/LICM/scalar-promote.ll | 8 +++-- 9 files changed, 65 insertions(+), 26 deletions(-) diff --git a/llvm/include/llvm/Transforms/Utils/SSAUpdater.h b/llvm/include/llvm/Transforms/Utils/SSAUpdater.h index 22b2295cc9d7..c233e3dc168e 100644 --- a/llvm/include/llvm/Transforms/Utils/SSAUpdater.h +++ b/llvm/include/llvm/Transforms/Utils/SSAUpdater.h @@ -169,6 +169,10 @@ public: /// Called to update debug info associated with the instruction. virtual void updateDebugInfo(Instruction *I) const {} + + /// Return false if a sub-class wants to keep one of the loads/stores + /// after the SSA construction. + virtual bool shouldDelete(Instruction *I) const { return true; } }; } // end namespace llvm diff --git a/llvm/lib/Transforms/Scalar/LICM.cpp b/llvm/lib/Transforms/Scalar/LICM.cpp index 0d52448efb2b..6f97f3e93123 100644 --- a/llvm/lib/Transforms/Scalar/LICM.cpp +++ b/llvm/lib/Transforms/Scalar/LICM.cpp @@ -1860,6 +1860,7 @@ class LoopPromoter : public LoadAndStorePromoter { bool UnorderedAtomic; AAMDNodes AATags; ICFLoopSafetyInfo &SafetyInfo; + bool CanInsertStoresInExitBlocks; // We're about to add a use of V in a loop exit block. Insert an LCSSA phi // (if legal) if doing so would add an out-of-loop use to an instruction @@ -1886,12 +1887,13 @@ public: SmallVectorImpl<MemoryAccess *> &MSSAIP, PredIteratorCache &PIC, MemorySSAUpdater *MSSAU, LoopInfo &li, DebugLoc dl, Align Alignment, bool UnorderedAtomic, const AAMDNodes &AATags, - ICFLoopSafetyInfo &SafetyInfo) + ICFLoopSafetyInfo &SafetyInfo, bool CanInsertStoresInExitBlocks) : LoadAndStorePromoter(Insts, S), SomePtr(SP), PointerMustAliases(PMA), LoopExitBlocks(LEB), LoopInsertPts(LIP), MSSAInsertPts(MSSAIP), PredCache(PIC), MSSAU(MSSAU), LI(li), DL(std::move(dl)), Alignment(Alignment), UnorderedAtomic(UnorderedAtomic), AATags(AATags), - SafetyInfo(SafetyInfo) {} + SafetyInfo(SafetyInfo), + CanInsertStoresInExitBlocks(CanInsertStoresInExitBlocks) {} bool isInstInList(Instruction *I, const SmallVectorImpl<Instruction *> &) const override { @@ -1903,7 +1905,7 @@ public: return PointerMustAliases.count(Ptr); } - void doExtraRewritesBeforeFinalDeletion() override { + void insertStoresInLoopExitBlocks() { // Insert stores after in the loop exit blocks. Each exit block gets a // store of the live-out values that feed them. Since we've already told // the SSA updater about the defs in the loop and the preheader @@ -1937,10 +1939,21 @@ public: } } + void doExtraRewritesBeforeFinalDeletion() override { + if (CanInsertStoresInExitBlocks) + insertStoresInLoopExitBlocks(); + } + void instructionDeleted(Instruction *I) const override { SafetyInfo.removeInstruction(I); MSSAU->removeMemoryAccess(I); } + + bool shouldDelete(Instruction *I) const override { + if (isa<StoreInst>(I)) + return CanInsertStoresInExitBlocks; + return true; + } }; bool isNotCapturedBeforeOrInLoop(const Value *V, const Loop *L, @@ -2039,6 +2052,7 @@ bool llvm::promoteLoopAccessesToScalars( bool DereferenceableInPH = false; bool SafeToInsertStore = false; + bool FoundLoadToPromote = false; SmallVector<Instruction *, 64> LoopUses; @@ -2086,6 +2100,7 @@ bool llvm::promoteLoopAccessesToScalars( SawUnorderedAtomic |= Load->isAtomic(); SawNotAtomic |= !Load->isAtomic(); + FoundLoadToPromote = true; Align InstAlignment = Load->getAlign(); @@ -2197,13 +2212,20 @@ bool llvm::promoteLoopAccessesToScalars( } } - // If we've still failed to prove we can sink the store, give up. - if (!SafeToInsertStore) + // If we've still failed to prove we can sink the store, hoist the load + // only, if possible. + if (!SafeToInsertStore && !FoundLoadToPromote) + // If we cannot hoist the load either, give up. return false; - // Otherwise, this is safe to promote, lets do it! - LLVM_DEBUG(dbgs() << "LICM: Promoting value stored to in loop: " << *SomePtr - << '\n'); + // Lets do the promotion! + if (SafeToInsertStore) + LLVM_DEBUG(dbgs() << "LICM: Promoting load/store of the value: " << *SomePtr + << '\n'); + else + LLVM_DEBUG(dbgs() << "LICM: Promoting load of the value: " << *SomePtr + << '\n'); + ORE->emit([&]() { return OptimizationRemark(DEBUG_TYPE, "PromoteLoopAccessesToScalar", LoopUses[0]) @@ -2222,7 +2244,8 @@ bool llvm::promoteLoopAccessesToScalars( SSAUpdater SSA(&NewPHIs); LoopPromoter Promoter(SomePtr, LoopUses, SSA, PointerMustAliases, ExitBlocks, InsertPts, MSSAInsertPts, PIC, MSSAU, *LI, DL, - Alignment, SawUnorderedAtomic, AATags, *SafetyInfo); + Alignment, SawUnorderedAtomic, AATags, *SafetyInfo, + SafeToInsertStore); // Set up the preheader to have a definition of the value. It is the live-out // value from the preheader that uses in the loop will use. diff --git a/llvm/lib/Transforms/Utils/SSAUpdater.cpp b/llvm/lib/Transforms/Utils/SSAUpdater.cpp index 5893ce15b129..7d9992176658 100644 --- a/llvm/lib/Transforms/Utils/SSAUpdater.cpp +++ b/llvm/lib/Transforms/Utils/SSAUpdater.cpp @@ -446,6 +446,9 @@ void LoadAndStorePromoter::run(const SmallVectorImpl<Instruction *> &Insts) { // Now that everything is rewritten, delete the old instructions from the // function. They should all be dead now. for (Instruction *User : Insts) { + if (!shouldDelete(User)) + continue; + // If this is a load that still has uses, then the load must have been added // as a live value in the SSAUpdate data structure for a block (e.g. because // the loaded value was stored later). In this case, we need to recursively diff --git a/llvm/test/Transforms/InstMerge/st_sink_bugfix_22613.ll b/llvm/test/Transforms/InstMerge/st_sink_bugfix_22613.ll index 48882eca44cc..e5a75cca8ee7 100644 --- a/llvm/test/Transforms/InstMerge/st_sink_bugfix_22613.ll +++ b/llvm/test/Transforms/InstMerge/st_sink_bugfix_22613.ll @@ -5,12 +5,12 @@ target triple = "x86_64-unknown-linux-gnu" ; RUN: opt -O2 -S < %s | FileCheck %s ; CHECK-LABEL: main -; CHECK: if.end -; CHECK: store ; CHECK: memset ; CHECK: if.then ; CHECK: store -; CHECK: memset +; CHECK: if.end +; CHECK: store +; CHECK: store @d = common global i32 0, align 4 @b = common global i32 0, align 4 diff --git a/llvm/test/Transforms/LICM/hoist-load-without-store.ll b/llvm/test/Transforms/LICM/hoist-load-without-store.ll index b464f6b7328d..275a53172737 100644 --- a/llvm/test/Transforms/LICM/hoist-load-without-store.ll +++ b/llvm/test/Transforms/LICM/hoist-load-without-store.ll @@ -18,10 +18,11 @@ define dso_local void @f(i32* nocapture %ptr, i32 %n) { ; CHECK-NEXT: [[CMP7:%.*]] = icmp slt i32 0, [[N:%.*]] ; CHECK-NEXT: br i1 [[CMP7]], label [[FOR_BODY_LR_PH:%.*]], label [[CLEANUP1:%.*]] ; CHECK: for.body.lr.ph: +; CHECK-NEXT: [[PTR_PROMOTED:%.*]] = load i32, i32* [[PTR:%.*]], align 4 ; CHECK-NEXT: br label [[FOR_BODY:%.*]] ; CHECK: for.body: -; CHECK-NEXT: [[I_08:%.*]] = phi i32 [ 0, [[FOR_BODY_LR_PH]] ], [ [[INC:%.*]], [[IF_END:%.*]] ] -; CHECK-NEXT: [[TMP0:%.*]] = load i32, i32* [[PTR:%.*]], align 4 +; CHECK-NEXT: [[TMP0:%.*]] = phi i32 [ [[PTR_PROMOTED]], [[FOR_BODY_LR_PH]] ], [ 1, [[IF_END:%.*]] ] +; CHECK-NEXT: [[I_08:%.*]] = phi i32 [ 0, [[FOR_BODY_LR_PH]] ], [ [[INC:%.*]], [[IF_END]] ] ; CHECK-NEXT: [[TOBOOL_NOT:%.*]] = icmp eq i32 [[TMP0]], 0 ; CHECK-NEXT: br i1 [[TOBOOL_NOT]], label [[IF_END]], label [[FOR_BODY_CLEANUP1_CRIT_EDGE:%.*]] ; CHECK: if.end: diff --git a/llvm/test/Transforms/LICM/promote-capture.ll b/llvm/test/Transforms/LICM/promote-capture.ll index 1a2603d1c986..945036e6e175 100644 --- a/llvm/test/Transforms/LICM/promote-capture.ll +++ b/llvm/test/Transforms/LICM/promote-capture.ll @@ -111,17 +111,19 @@ define void @test_captured_before_loop(i32 %len) { ; CHECK-NEXT: [[COUNT:%.*]] = alloca i32, align 4 ; CHECK-NEXT: store i32 0, i32* [[COUNT]], align 4 ; CHECK-NEXT: call void @capture(i32* [[COUNT]]) +; CHECK-NEXT: [[COUNT_PROMOTED:%.*]] = load i32, i32* [[COUNT]], align 4 ; CHECK-NEXT: br label [[LOOP:%.*]] ; CHECK: loop: -; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[I_NEXT:%.*]], [[LATCH:%.*]] ] +; CHECK-NEXT: [[C_INC2:%.*]] = phi i32 [ [[COUNT_PROMOTED]], [[ENTRY:%.*]] ], [ [[C_INC1:%.*]], [[LATCH:%.*]] ] +; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[I_NEXT:%.*]], [[LATCH]] ] ; CHECK-NEXT: [[COND:%.*]] = call i1 @cond(i32 [[I]]) ; CHECK-NEXT: br i1 [[COND]], label [[IF:%.*]], label [[LATCH]] ; CHECK: if: -; CHECK-NEXT: [[C:%.*]] = load i32, i32* [[COUNT]], align 4 -; CHECK-NEXT: [[C_INC:%.*]] = add i32 [[C]], 1 +; CHECK-NEXT: [[C_INC:%.*]] = add i32 [[C_INC2]], 1 ; CHECK-NEXT: store i32 [[C_INC]], i32* [[COUNT]], align 4 ; CHECK-NEXT: br label [[LATCH]] ; CHECK: latch: +; CHECK-NEXT: [[C_INC1]] = phi i32 [ [[C_INC]], [[IF]] ], [ [[C_INC2]], [[LOOP]] ] ; CHECK-NEXT: [[I_NEXT]] = add nuw i32 [[I]], 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[LEN:%.*]] ; CHECK-NEXT: br i1 [[CMP]], label [[EXIT:%.*]], label [[LOOP]] diff --git a/llvm/test/Transforms/LICM/scalar-promote-memmodel.ll b/llvm/test/Transforms/LICM/scalar-promote-memmodel.ll index c3bae731fb6b..33076b39e908 100644 --- a/llvm/test/Transforms/LICM/scalar-promote-memmodel.ll +++ b/llvm/test/Transforms/LICM/scalar-promote-memmodel.ll @@ -11,19 +11,21 @@ define void @bar(i32 %n, i32 %b) nounwind uwtable ssp { ; CHECK-LABEL: @bar( ; CHECK-NEXT: entry: ; CHECK-NEXT: [[TOBOOL:%.*]] = icmp eq i32 [[B:%.*]], 0 +; CHECK-NEXT: [[G_PROMOTED:%.*]] = load i32, i32* @g, align 4 ; CHECK-NEXT: br label [[FOR_COND:%.*]] ; CHECK: for.cond: -; CHECK-NEXT: [[I_0:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INC5:%.*]], [[FOR_INC:%.*]] ] +; CHECK-NEXT: [[INC2:%.*]] = phi i32 [ [[G_PROMOTED]], [[ENTRY:%.*]] ], [ [[INC1:%.*]], [[FOR_INC:%.*]] ] +; CHECK-NEXT: [[I_0:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[INC5:%.*]], [[FOR_INC]] ] ; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[I_0]], [[N:%.*]] ; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY:%.*]], label [[FOR_END:%.*]] ; CHECK: for.body: ; CHECK-NEXT: br i1 [[TOBOOL]], label [[FOR_INC]], label [[IF_THEN:%.*]] ; CHECK: if.then: -; CHECK-NEXT: [[TMP3:%.*]] = load i32, i32* @g, align 4 -; CHECK-NEXT: [[INC:%.*]] = add nsw i32 [[TMP3]], 1 +; CHECK-NEXT: [[INC:%.*]] = add nsw i32 [[INC2]], 1 ; CHECK-NEXT: store i32 [[INC]], i32* @g, align 4 ; CHECK-NEXT: br label [[FOR_INC]] ; CHECK: for.inc: +; CHECK-NEXT: [[INC1]] = phi i32 [ [[INC]], [[IF_THEN]] ], [ [[INC2]], [[FOR_BODY]] ] ; CHECK-NEXT: [[INC5]] = add nsw i32 [[I_0]], 1 ; CHECK-NEXT: br label [[FOR_COND]] ; CHECK: for.end: diff --git a/llvm/test/Transforms/LICM/scalar-promote-opaque-ptrs.ll b/llvm/test/Transforms/LICM/scalar-promote-opaque-ptrs.ll index da4bae936dc1..b239b6fb0296 100644 --- a/llvm/test/Transforms/LICM/scalar-promote-opaque-ptrs.ll +++ b/llvm/test/Transforms/LICM/scalar-promote-opaque-ptrs.ll @@ -314,17 +314,19 @@ define i32 @test7bad() { ; CHECK-NEXT: entry: ; CHECK-NEXT: [[LOCAL:%.*]] = alloca i32, align 4 ; CHECK-NEXT: call void @capture(ptr [[LOCAL]]) +; CHECK-NEXT: [[LOCAL_PROMOTED:%.*]] = load i32, ptr [[LOCAL]], align 4 ; CHECK-NEXT: br label [[LOOP:%.*]] ; CHECK: loop: -; CHECK-NEXT: [[J:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[NEXT:%.*]], [[ELSE:%.*]] ] -; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[LOCAL]], align 4 -; CHECK-NEXT: [[X2:%.*]] = call i32 @opaque(i32 [[X]]) +; CHECK-NEXT: [[X22:%.*]] = phi i32 [ [[LOCAL_PROMOTED]], [[ENTRY:%.*]] ], [ [[X21:%.*]], [[ELSE:%.*]] ] +; CHECK-NEXT: [[J:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[NEXT:%.*]], [[ELSE]] ] +; CHECK-NEXT: [[X2:%.*]] = call i32 @opaque(i32 [[X22]]) ; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[X2]], 0 ; CHECK-NEXT: br i1 [[CMP]], label [[IF:%.*]], label [[ELSE]] ; CHECK: if: ; CHECK-NEXT: store i32 [[X2]], ptr [[LOCAL]], align 4 ; CHECK-NEXT: br label [[ELSE]] ; CHECK: else: +; CHECK-NEXT: [[X21]] = phi i32 [ [[X2]], [[IF]] ], [ [[X22]], [[LOOP]] ] ; CHECK-NEXT: [[NEXT]] = add i32 [[J]], 1 ; CHECK-NEXT: [[COND:%.*]] = icmp eq i32 [[NEXT]], 0 ; CHECK-NEXT: br i1 [[COND]], label [[EXIT:%.*]], label [[LOOP]] diff --git a/llvm/test/Transforms/LICM/scalar-promote.ll b/llvm/test/Transforms/LICM/scalar-promote.ll index 290e990f8513..c064edb8cd93 100644 --- a/llvm/test/Transforms/LICM/scalar-promote.ll +++ b/llvm/test/Transforms/LICM/scalar-promote.ll @@ -315,17 +315,19 @@ define i32 @test7bad() { ; CHECK-NEXT: entry: ; CHECK-NEXT: [[LOCAL:%.*]] = alloca i32, align 4 ; CHECK-NEXT: call void @capture(i32* [[LOCAL]]) +; CHECK-NEXT: [[LOCAL_PROMOTED:%.*]] = load i32, i32* [[LOCAL]], align 4 ; CHECK-NEXT: br label [[LOOP:%.*]] ; CHECK: loop: -; CHECK-NEXT: [[J:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[NEXT:%.*]], [[ELSE:%.*]] ] -; CHECK-NEXT: [[X:%.*]] = load i32, i32* [[LOCAL]], align 4 -; CHECK-NEXT: [[X2:%.*]] = call i32 @opaque(i32 [[X]]) +; CHECK-NEXT: [[X22:%.*]] = phi i32 [ [[LOCAL_PROMOTED]], [[ENTRY:%.*]] ], [ [[X21:%.*]], [[ELSE:%.*]] ] +; CHECK-NEXT: [[J:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[NEXT:%.*]], [[ELSE]] ] +; CHECK-NEXT: [[X2:%.*]] = call i32 @opaque(i32 [[X22]]) ; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[X2]], 0 ; CHECK-NEXT: br i1 [[CMP]], label [[IF:%.*]], label [[ELSE]] ; CHECK: if: ; CHECK-NEXT: store i32 [[X2]], i32* [[LOCAL]], align 4 ; CHECK-NEXT: br label [[ELSE]] ; CHECK: else: +; CHECK-NEXT: [[X21]] = phi i32 [ [[X2]], [[IF]] ], [ [[X22]], [[LOOP]] ] ; CHECK-NEXT: [[NEXT]] = add i32 [[J]], 1 ; CHECK-NEXT: [[COND:%.*]] = icmp eq i32 [[NEXT]], 0 ; CHECK-NEXT: br i1 [[COND]], label [[EXIT:%.*]], label [[LOOP]] </cut>

4 years, 5 months

1
0
0 0

[ACTIVITY] week ending Dec. 5 2021

by Alex Bennée

VirtIO Initiative ([STR-9]) =========================== - synced up on [AX_XDP task with Akashi-san] - synced on rust-vmm [AX_XDP task with Akashi-san] <https://linaro.atlassian.net/browse/STR-68> vhost-device maintainer effort ([UM-196]) - started looking at https://github.com/rust-vmm/vhost-device/pull/4 QEMU Upstream Work ([UM-2]) =========================== - posted [PULL for 6.2 0/8] more tcg, plugin, test and build fixes Message-Id: <20211129171449.4176301-1-alex.bennee(a)linaro.org> - commented on Re: Follow-up on the CXL discussion at OFTC Message-Id: <20211119015207.62fhk5mjmvaj5nz4(a)intel.com> to see if I can unblock - posted [RFC PATCH] blog post: how to get your new feature up-streamed Message-Id: <20211126203319.3298089-1-alex.bennee(a)linaro.org> - posted [PATCH for 6.2?] Revert "vga: don't abort when adding a duplicate isa-vga device" Message-Id: <20211202164929.1119036-1-alex.bennee(a)linaro.org> Upstream MTTCG tests ([QEMU-52]) - posted [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> Other ===== - wrote [RFC PATCH 0/2] insn plugin tweaks for measuring frequency Message-Id: <20211203144421.1445232-1-alex.bennee(a)linaro.org> - might make a good basis for a TCG plugins blog post Completed Reviews [2/2] ======================= [PATCH] tests/plugin/syscall.c: fix compiler warnings Message-Id: <20211128011551.2115468-1-juro.bystricky(a)intel.com> [PATCH for-6.2? 0/2] arm_gicv3: Fix handling of LPIs in list registers Message-Id: <20211126163915.1048353-2-peter.maydell(a)linaro.org> Current Review Queue ==================== TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64 Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com> =============================================================================================================== TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com> =================================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== -- Alex Bennée

4 years, 5 months

1
0
0 0

[ACTIVITY] report week ending 3 Dec

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] - Code review: worked through some of the backlog and accumulated a list of series to take once the tree reopens for 7.0 - Wrote and sent some cleanup patches relating to the qemu-common.h header file - Fixed a bug where we miscalculated the length for TLB range invalidations * QEMU-420 [GICv4 emulation] - Found the problem with PCI passthrough in my nested test setup: apparently virtio PCI devices need an extra command line argument to get them to honour the presence of an IOMMU. Everything is now working and I've put some notes about the setup into https://linaro.atlassian.net/browse/QEMU-447 - started to implement the GICv4 redistributor changes -- PMM

4 years, 5 months

1
0
0 0

[ACTIVITY] week ending Nov. 28 2021

by Alex Bennée

VirtIO Initiative ([STR-9]) =========================== - [this weeks sync], topics on AF_XDP, virtio-video and virtio-watchdog [upstream rust-vmm sync meeting] <https://etherpad.opendev.org/p/rust-vmm-sync-2021&sa=D&source=calendar&ust=…> QEMU Upstream Work ([UM-2]) =========================== - posted [PATCH for 6.2 v2 0/7] more tcg, plugin, test and build fixes Message-Id: <20211125154144.2904741-1-alex.bennee(a)linaro.org> Upstream MTTCG tests ([QEMU-52]) - posted [kvm-unit-tests PATCH v8 00/10] MTTCG sanity tests for ARM Message-Id: <20211118184650.661575-1-alex.bennee(a)linaro.org> [mttcg tests to current state and fixed up] <https://github.com/stsquad/qemu/tree/mttcg/current-tests-v8> Other ===== - renewal feedback Completed Reviews [2/2] ======================= [PATCH v2 0/3] KVM: qemu patches for few KVM features I developed Message-Id: <20211101132300.192584-1-mlevitsk(a)redhat.com> [PATCH v2] hw/intc/arm_gicv3: Update cached state after LPI state changes Message-Id: <20211124202005.989935-1-peter.maydell(a)linaro.org> Absences ======== - off 2 days sick Current Review Queue ==================== TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64 Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com> =============================================================================================================== TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com> =================================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== -- Alex Bennée

4 years, 5 months

1
0
0 0

[ACTIVITY] report week ending 26 Nov

by Peter Maydell

Progress: * QEMU-420 [GICv4 emulation] - Tracked down and fixed a bug in our ITS emulation which would (intermittently?) result in a Linux guest reporting "irq 54: nobody cared" and hanging, because we were not correctly recalculating the highest priority pending interrupt when the guest acknowledged a pending LPI. This fix will go into 6.2. - Set up a test environment for GICv4 work -- because the major feature of GICv4 is support for directly injecting interrupts into a VM, the test setup needs to be nested virtualization, where an outer L1 guest runs on pure emulated QEMU, the inner L2 guest uses KVM (as provided by L1), and we pass a PCI device (emulated by QEMU) through from L1 to L2. I think I have this correctly set up now, but... - ...the L2 guest hangs because it apparently never sees an interrupt from the passed-through PCI device. This implies a bug in our current GICv3 emulation somewhere: need to track this down before starting in on GICv4 work. - Separately, I found through code inspection a bug where we do the wrong thing in the non-passthrough case when the L1 guest sets a virtual interrupt for the L2 guest in the GIC list registers and that interrupt has an ID > 1023 (ie it is an LPI). We got this wrong both for acknowledging and ending an interrupt, so the two bugs cancel each other out except that we don't set the vCPU priority and so the L2 guest might get an unexpected interrupt while it was servicing the LPI. Patches sent. -- PMM

4 years, 5 months

1
0
0 0

[TCWG CI] 433.milc slowed down by 5% after llvm: [AMDGPU] Implement widening multiplies with v_mad_i64_i32/v_mad_u64_u32

by ci_notify＠linaro.org

After llvm commit d7e03df719464354b20a845b7853be57da863924 Author: Jay Foad <jay.foad(a)amd.com> [AMDGPU] Implement widening multiplies with v_mad_i64_i32/v_mad_u64_u32 the following benchmarks slowed down by more than 2%: - 433.milc slowed down by 5% from 12335 to 12997 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-d7e03df719464354b20a845b7853be57da863924 cd investigate-llvm-d7e03df719464354b20a845b7853be57da863924 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach d7e03df719464354b20a845b7853be57da863924 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 8a52bd82e36855b3ad842f2535d0c78a97db55dc ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit d7e03df719464354b20a845b7853be57da863924 Author: Jay Foad <jay.foad(a)amd.com> Date: Fri Nov 12 18:02:58 2021 +0000 [AMDGPU] Implement widening multiplies with v_mad_i64_i32/v_mad_u64_u32 Select SelectionDAG ops smul_lohi/umul_lohi to v_mad_i64_i32/v_mad_u64_u32 respectively, with an addend of 0. v_mul_lo, v_mul_hi and v_mad_i64/u64 are all quarter-rate instructions so it is better to use one instruction than two. Further improvements are possible to make better use of the addend operand, but this is already a strict improvement over what we have now. Differential Revision: https://reviews.llvm.org/D113986 --- llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | 29 + llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h | 1 + llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp | 49 + llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h | 1 + llvm/lib/Target/AMDGPU/SIISelLowering.cpp | 23 + llvm/lib/Target/AMDGPU/SIISelLowering.h | 1 + .../AMDGPU/atomic_optimizations_global_pointer.ll | 104 +- .../AMDGPU/atomic_optimizations_local_pointer.ll | 108 +- llvm/test/CodeGen/AMDGPU/bypass-div.ll | 1064 +++++++++----------- llvm/test/CodeGen/AMDGPU/llvm.mulo.ll | 178 ++-- llvm/test/CodeGen/AMDGPU/mad_64_32.ll | 110 +- llvm/test/CodeGen/AMDGPU/mul.ll | 55 +- llvm/test/CodeGen/AMDGPU/mul_int24.ll | 9 +- llvm/test/CodeGen/AMDGPU/mul_uint24-amdgcn.ll | 24 +- llvm/test/CodeGen/AMDGPU/udiv.ll | 358 +++---- llvm/test/CodeGen/AMDGPU/wwm-reserved-spill.ll | 126 +-- llvm/test/CodeGen/AMDGPU/wwm-reserved.ll | 16 +- 17 files changed, 1126 insertions(+), 1130 deletions(-) diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp index 2e571ad01c1c..8236e6672247 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp @@ -654,6 +654,9 @@ void AMDGPUDAGToDAGISel::Select(SDNode *N) { SelectMAD_64_32(N); return; } + case ISD::SMUL_LOHI: + case ISD::UMUL_LOHI: + return SelectMUL_LOHI(N); case ISD::CopyToReg: { const SITargetLowering& Lowering = *static_cast<const SITargetLowering*>(getTargetLowering()); @@ -1013,6 +1016,32 @@ void AMDGPUDAGToDAGISel::SelectMAD_64_32(SDNode *N) { CurDAG->SelectNodeTo(N, Opc, N->getVTList(), Ops); } +// We need to handle this here because tablegen doesn't support matching +// instructions with multiple outputs. +void AMDGPUDAGToDAGISel::SelectMUL_LOHI(SDNode *N) { + SDLoc SL(N); + bool Signed = N->getOpcode() == ISD::SMUL_LOHI; + unsigned Opc = Signed ? AMDGPU::V_MAD_I64_I32_e64 : AMDGPU::V_MAD_U64_U32_e64; + + SDValue Zero = CurDAG->getTargetConstant(0, SL, MVT::i64); + SDValue Clamp = CurDAG->getTargetConstant(0, SL, MVT::i1); + SDValue Ops[] = {N->getOperand(0), N->getOperand(1), Zero, Clamp}; + SDNode *Mad = CurDAG->getMachineNode(Opc, SL, N->getVTList(), Ops); + if (!SDValue(N, 0).use_empty()) { + SDValue Sub0 = CurDAG->getTargetConstant(AMDGPU::sub0, SL, MVT::i32); + SDNode *Lo = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG, SL, + MVT::i32, SDValue(Mad, 0), Sub0); + ReplaceUses(SDValue(N, 0), SDValue(Lo, 0)); + } + if (!SDValue(N, 1).use_empty()) { + SDValue Sub1 = CurDAG->getTargetConstant(AMDGPU::sub1, SL, MVT::i32); + SDNode *Hi = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG, SL, + MVT::i32, SDValue(Mad, 0), Sub1); + ReplaceUses(SDValue(N, 1), SDValue(Hi, 0)); + } + CurDAG->RemoveDeadNode(N); +} + bool AMDGPUDAGToDAGISel::isDSOffsetLegal(SDValue Base, unsigned Offset) const { if (!isUInt<16>(Offset)) return false; diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h index 74aff9e406c9..d638d9877a9b 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h +++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h @@ -235,6 +235,7 @@ private: void SelectUADDO_USUBO(SDNode *N); void SelectDIV_SCALE(SDNode *N); void SelectMAD_64_32(SDNode *N); + void SelectMUL_LOHI(SDNode *N); void SelectFMA_W_CHAIN(SDNode *N); void SelectFMUL_W_CHAIN(SDNode *N); SDNode *getBFE32(bool IsSigned, const SDLoc &DL, SDValue Val, uint32_t Offset, diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp index 523fa2d3724b..54177564afbc 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp @@ -594,6 +594,8 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM, setTargetDAGCombine(ISD::SRL); setTargetDAGCombine(ISD::TRUNCATE); setTargetDAGCombine(ISD::MUL); + setTargetDAGCombine(ISD::SMUL_LOHI); + setTargetDAGCombine(ISD::UMUL_LOHI); setTargetDAGCombine(ISD::MULHU); setTargetDAGCombine(ISD::MULHS); setTargetDAGCombine(ISD::SELECT); @@ -3462,6 +3464,50 @@ SDValue AMDGPUTargetLowering::performMulCombine(SDNode *N, return DAG.getSExtOrTrunc(Mul, DL, VT); } +SDValue +AMDGPUTargetLowering::performMulLoHiCombine(SDNode *N, + DAGCombinerInfo &DCI) const { + if (N->getValueType(0) != MVT::i32) + return SDValue(); + + SelectionDAG &DAG = DCI.DAG; + SDLoc DL(N); + + SDValue N0 = N->getOperand(0); + SDValue N1 = N->getOperand(1); + + // SimplifyDemandedBits has the annoying habit of turning useful zero_extends + // in the source into any_extends if the result of the mul is truncated. Since + // we can assume the high bits are whatever we want, use the underlying value + // to avoid the unknown high bits from interfering. + if (N0.getOpcode() == ISD::ANY_EXTEND) + N0 = N0.getOperand(0); + if (N1.getOpcode() == ISD::ANY_EXTEND) + N1 = N1.getOperand(0); + + // Try to use two fast 24-bit multiplies (one for each half of the result) + // instead of one slow extending multiply. + unsigned LoOpcode, HiOpcode; + if (Subtarget->hasMulU24() && isU24(N0, DAG) && isU24(N1, DAG)) { + N0 = DAG.getZExtOrTrunc(N0, DL, MVT::i32); + N1 = DAG.getZExtOrTrunc(N1, DL, MVT::i32); + LoOpcode = AMDGPUISD::MUL_U24; + HiOpcode = AMDGPUISD::MULHI_U24; + } else if (Subtarget->hasMulI24() && isI24(N0, DAG) && isI24(N1, DAG)) { + N0 = DAG.getSExtOrTrunc(N0, DL, MVT::i32); + N1 = DAG.getSExtOrTrunc(N1, DL, MVT::i32); + LoOpcode = AMDGPUISD::MUL_I24; + HiOpcode = AMDGPUISD::MULHI_I24; + } else { + return SDValue(); + } + + SDValue Lo = DAG.getNode(LoOpcode, DL, MVT::i32, N0, N1); + SDValue Hi = DAG.getNode(HiOpcode, DL, MVT::i32, N0, N1); + DCI.CombineTo(N, Lo, Hi); + return SDValue(N, 0); +} + SDValue AMDGPUTargetLowering::performMulhsCombine(SDNode *N, DAGCombinerInfo &DCI) const { EVT VT = N->getValueType(0); @@ -4103,6 +4149,9 @@ SDValue AMDGPUTargetLowering::PerformDAGCombine(SDNode *N, return performTruncateCombine(N, DCI); case ISD::MUL: return performMulCombine(N, DCI); + case ISD::SMUL_LOHI: + case ISD::UMUL_LOHI: + return performMulLoHiCombine(N, DCI); case ISD::MULHS: return performMulhsCombine(N, DCI); case ISD::MULHU: diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h index 03632ac18598..daaca8737c5d 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h +++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h @@ -91,6 +91,7 @@ protected: SDValue performSrlCombine(SDNode *N, DAGCombinerInfo &DCI) const; SDValue performTruncateCombine(SDNode *N, DAGCombinerInfo &DCI) const; SDValue performMulCombine(SDNode *N, DAGCombinerInfo &DCI) const; + SDValue performMulLoHiCombine(SDNode *N, DAGCombinerInfo &DCI) const; SDValue performMulhsCombine(SDNode *N, DAGCombinerInfo &DCI) const; SDValue performMulhuCombine(SDNode *N, DAGCombinerInfo &DCI) const; SDValue performCtlz_CttzCombine(const SDLoc &SL, SDValue Cond, SDValue LHS, diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp index 519c5b936536..02440044d6e2 100644 --- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp +++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp @@ -809,6 +809,11 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM, setOperationAction(ISD::SMULO, MVT::i64, Custom); setOperationAction(ISD::UMULO, MVT::i64, Custom); + if (Subtarget->hasMad64_32()) { + setOperationAction(ISD::SMUL_LOHI, MVT::i32, Custom); + setOperationAction(ISD::UMUL_LOHI, MVT::i32, Custom); + } + setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::Other, Custom); setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::f32, Custom); setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::v4f32, Custom); @@ -4691,6 +4696,9 @@ SDValue SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const { case ISD::SMULO: case ISD::UMULO: return lowerXMULO(Op, DAG); + case ISD::SMUL_LOHI: + case ISD::UMUL_LOHI: + return lowerXMUL_LOHI(Op, DAG); case ISD::DYNAMIC_STACKALLOC: return LowerDYNAMIC_STACKALLOC(Op, DAG); } @@ -5304,6 +5312,21 @@ SDValue SITargetLowering::lowerXMULO(SDValue Op, SelectionDAG &DAG) const { return DAG.getMergeValues({ Result, Overflow }, SL); } +SDValue SITargetLowering::lowerXMUL_LOHI(SDValue Op, SelectionDAG &DAG) const { + if (Op->isDivergent()) { + // Select to V_MAD_[IU]64_[IU]32. + return Op; + } + if (Subtarget->hasSMulHi()) { + // Expand to S_MUL_I32 + S_MUL_HI_[IU]32. + return SDValue(); + } + // The multiply is uniform but we would have to use V_MUL_HI_[IU]32 to + // calculate the high part, so we might as well do the whole thing with + // V_MAD_[IU]64_[IU]32. + return Op; +} + SDValue SITargetLowering::lowerTRAP(SDValue Op, SelectionDAG &DAG) const { if (!Subtarget->isTrapHandlerEnabled() || Subtarget->getTrapHandlerAbi() != GCNSubtarget::TrapHandlerAbi::AMDHSA) diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h index 1e48c96ad3c8..ea6ca3f48827 100644 --- a/llvm/lib/Target/AMDGPU/SIISelLowering.h +++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h @@ -135,6 +135,7 @@ private: SDValue lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const; SDValue lowerFMINNUM_FMAXNUM(SDValue Op, SelectionDAG &DAG) const; SDValue lowerXMULO(SDValue Op, SelectionDAG &DAG) const; + SDValue lowerXMUL_LOHI(SDValue Op, SelectionDAG &DAG) const; SDValue getSegmentAperture(unsigned AS, const SDLoc &DL, SelectionDAG &DAG) const; diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll index 49f05fceb8ed..4ad774db6686 100644 --- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll +++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll @@ -818,32 +818,29 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX8-NEXT: s_mov_b32 s12, s6 ; GFX8-NEXT: s_bcnt1_i32_b64 s6, s[8:9] ; GFX8-NEXT: v_mov_b32_e32 v0, s6 -; GFX8-NEXT: v_mul_hi_u32 v0, s0, v0 -; GFX8-NEXT: s_mov_b32 s13, s7 -; GFX8-NEXT: s_mul_i32 s7, s1, s6 -; GFX8-NEXT: s_mul_i32 s6, s0, s6 +; GFX8-NEXT: v_mad_u64_u32 v[0:1], s[8:9], s0, v0, 0 +; GFX8-NEXT: s_mul_i32 s6, s1, s6 ; GFX8-NEXT: s_mov_b32 s15, 0xf000 ; GFX8-NEXT: s_mov_b32 s14, -1 -; GFX8-NEXT: v_add_u32_e32 v1, vcc, s7, v0 -; GFX8-NEXT: v_mov_b32_e32 v0, s6 +; GFX8-NEXT: s_mov_b32 s13, s7 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, s6, v1 ; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0) ; GFX8-NEXT: buffer_atomic_add_x2 v[0:1], off, s[12:15], 0 glc ; GFX8-NEXT: s_waitcnt vmcnt(0) ; GFX8-NEXT: buffer_wbinvl1_vol ; GFX8-NEXT: .LBB4_2: ; GFX8-NEXT: s_or_b64 exec, exec, s[2:3] -; GFX8-NEXT: v_readfirstlane_b32 s2, v0 ; GFX8-NEXT: s_waitcnt lgkmcnt(0) -; GFX8-NEXT: v_mul_lo_u32 v0, s1, v2 -; GFX8-NEXT: v_mul_hi_u32 v3, s0, v2 +; GFX8-NEXT: v_mul_lo_u32 v4, s1, v2 +; GFX8-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s0, v2, 0 +; GFX8-NEXT: v_readfirstlane_b32 s0, v0 ; GFX8-NEXT: v_readfirstlane_b32 s1, v1 -; GFX8-NEXT: v_mul_lo_u32 v1, s0, v2 -; GFX8-NEXT: s_mov_b32 s7, 0xf000 -; GFX8-NEXT: v_add_u32_e32 v2, vcc, v3, v0 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, v3, v4 ; GFX8-NEXT: v_mov_b32_e32 v3, s1 -; GFX8-NEXT: v_add_u32_e32 v0, vcc, s2, v1 +; GFX8-NEXT: v_add_u32_e32 v0, vcc, s0, v2 +; GFX8-NEXT: s_mov_b32 s7, 0xf000 ; GFX8-NEXT: s_mov_b32 s6, -1 -; GFX8-NEXT: v_addc_u32_e32 v1, vcc, v3, v2, vcc +; GFX8-NEXT: v_addc_u32_e32 v1, vcc, v3, v1, vcc ; GFX8-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX8-NEXT: s_endpgm ; @@ -878,17 +875,16 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX9-NEXT: .LBB4_2: ; GFX9-NEXT: s_or_b64 exec, exec, s[0:1] ; GFX9-NEXT: s_waitcnt lgkmcnt(0) -; GFX9-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX9-NEXT: v_mul_hi_u32 v4, s2, v2 +; GFX9-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0 ; GFX9-NEXT: v_readfirstlane_b32 s0, v0 -; GFX9-NEXT: v_mul_lo_u32 v0, s2, v2 ; GFX9-NEXT: v_readfirstlane_b32 s1, v1 -; GFX9-NEXT: v_add_u32_e32 v1, v4, v3 -; GFX9-NEXT: v_mov_b32_e32 v2, s1 -; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, s0, v0 +; GFX9-NEXT: v_add_u32_e32 v1, v3, v4 +; GFX9-NEXT: v_mov_b32_e32 v3, s1 +; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, s0, v2 ; GFX9-NEXT: s_mov_b32 s7, 0xf000 ; GFX9-NEXT: s_mov_b32 s6, -1 -; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v2, v1, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v3, v1, vcc ; GFX9-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX9-NEXT: s_endpgm ; @@ -927,14 +923,13 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX1064-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1064-NEXT: s_or_b64 exec, exec, s[0:1] ; GFX1064-NEXT: s_waitcnt lgkmcnt(0) -; GFX1064-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1064-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1064-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1064-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1064-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0 ; GFX1064-NEXT: v_readfirstlane_b32 s0, v0 ; GFX1064-NEXT: v_readfirstlane_b32 s1, v1 ; GFX1064-NEXT: s_mov_b32 s7, 0x31016000 ; GFX1064-NEXT: s_mov_b32 s6, -1 -; GFX1064-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1064-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1064-NEXT: v_add_co_u32 v0, vcc, s0, v2 ; GFX1064-NEXT: v_add_co_ci_u32_e32 v1, vcc, s1, v1, vcc ; GFX1064-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 @@ -974,14 +969,13 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX1032-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1032-NEXT: s_or_b32 exec_lo, exec_lo, s0 ; GFX1032-NEXT: s_waitcnt lgkmcnt(0) -; GFX1032-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1032-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1032-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1032-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1032-NEXT: v_mad_u64_u32 v[2:3], s0, s2, v2, 0 ; GFX1032-NEXT: v_readfirstlane_b32 s0, v0 ; GFX1032-NEXT: v_readfirstlane_b32 s1, v1 ; GFX1032-NEXT: s_mov_b32 s7, 0x31016000 ; GFX1032-NEXT: s_mov_b32 s6, -1 -; GFX1032-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1032-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1032-NEXT: v_add_co_u32 v0, vcc_lo, s0, v2 ; GFX1032-NEXT: v_add_co_ci_u32_e32 v1, vcc_lo, s1, v1, vcc_lo ; GFX1032-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 @@ -1955,32 +1949,29 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX8-NEXT: s_mov_b32 s12, s6 ; GFX8-NEXT: s_bcnt1_i32_b64 s6, s[8:9] ; GFX8-NEXT: v_mov_b32_e32 v0, s6 -; GFX8-NEXT: v_mul_hi_u32 v0, s0, v0 -; GFX8-NEXT: s_mov_b32 s13, s7 -; GFX8-NEXT: s_mul_i32 s7, s1, s6 -; GFX8-NEXT: s_mul_i32 s6, s0, s6 +; GFX8-NEXT: v_mad_u64_u32 v[0:1], s[8:9], s0, v0, 0 +; GFX8-NEXT: s_mul_i32 s6, s1, s6 ; GFX8-NEXT: s_mov_b32 s15, 0xf000 ; GFX8-NEXT: s_mov_b32 s14, -1 -; GFX8-NEXT: v_add_u32_e32 v1, vcc, s7, v0 -; GFX8-NEXT: v_mov_b32_e32 v0, s6 +; GFX8-NEXT: s_mov_b32 s13, s7 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, s6, v1 ; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0) ; GFX8-NEXT: buffer_atomic_sub_x2 v[0:1], off, s[12:15], 0 glc ; GFX8-NEXT: s_waitcnt vmcnt(0) ; GFX8-NEXT: buffer_wbinvl1_vol ; GFX8-NEXT: .LBB10_2: ; GFX8-NEXT: s_or_b64 exec, exec, s[2:3] -; GFX8-NEXT: v_readfirstlane_b32 s2, v0 ; GFX8-NEXT: s_waitcnt lgkmcnt(0) -; GFX8-NEXT: v_mul_lo_u32 v0, s1, v2 -; GFX8-NEXT: v_mul_hi_u32 v3, s0, v2 +; GFX8-NEXT: v_mul_lo_u32 v4, s1, v2 +; GFX8-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s0, v2, 0 +; GFX8-NEXT: v_readfirstlane_b32 s0, v0 ; GFX8-NEXT: v_readfirstlane_b32 s1, v1 -; GFX8-NEXT: v_mul_lo_u32 v1, s0, v2 -; GFX8-NEXT: s_mov_b32 s7, 0xf000 -; GFX8-NEXT: v_add_u32_e32 v2, vcc, v3, v0 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, v3, v4 ; GFX8-NEXT: v_mov_b32_e32 v3, s1 -; GFX8-NEXT: v_sub_u32_e32 v0, vcc, s2, v1 +; GFX8-NEXT: v_sub_u32_e32 v0, vcc, s0, v2 +; GFX8-NEXT: s_mov_b32 s7, 0xf000 ; GFX8-NEXT: s_mov_b32 s6, -1 -; GFX8-NEXT: v_subb_u32_e32 v1, vcc, v3, v2, vcc +; GFX8-NEXT: v_subb_u32_e32 v1, vcc, v3, v1, vcc ; GFX8-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX8-NEXT: s_endpgm ; @@ -2015,17 +2006,16 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX9-NEXT: .LBB10_2: ; GFX9-NEXT: s_or_b64 exec, exec, s[0:1] ; GFX9-NEXT: s_waitcnt lgkmcnt(0) -; GFX9-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX9-NEXT: v_mul_hi_u32 v4, s2, v2 +; GFX9-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0 ; GFX9-NEXT: v_readfirstlane_b32 s0, v0 -; GFX9-NEXT: v_mul_lo_u32 v0, s2, v2 ; GFX9-NEXT: v_readfirstlane_b32 s1, v1 -; GFX9-NEXT: v_add_u32_e32 v1, v4, v3 -; GFX9-NEXT: v_mov_b32_e32 v2, s1 -; GFX9-NEXT: v_sub_co_u32_e32 v0, vcc, s0, v0 +; GFX9-NEXT: v_add_u32_e32 v1, v3, v4 +; GFX9-NEXT: v_mov_b32_e32 v3, s1 +; GFX9-NEXT: v_sub_co_u32_e32 v0, vcc, s0, v2 ; GFX9-NEXT: s_mov_b32 s7, 0xf000 ; GFX9-NEXT: s_mov_b32 s6, -1 -; GFX9-NEXT: v_subb_co_u32_e32 v1, vcc, v2, v1, vcc +; GFX9-NEXT: v_subb_co_u32_e32 v1, vcc, v3, v1, vcc ; GFX9-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX9-NEXT: s_endpgm ; @@ -2064,14 +2054,13 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX1064-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1064-NEXT: s_or_b64 exec, exec, s[0:1] ; GFX1064-NEXT: s_waitcnt lgkmcnt(0) -; GFX1064-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1064-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1064-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1064-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1064-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0 ; GFX1064-NEXT: v_readfirstlane_b32 s0, v0 ; GFX1064-NEXT: v_readfirstlane_b32 s1, v1 ; GFX1064-NEXT: s_mov_b32 s7, 0x31016000 ; GFX1064-NEXT: s_mov_b32 s6, -1 -; GFX1064-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1064-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1064-NEXT: v_sub_co_u32 v0, vcc, s0, v2 ; GFX1064-NEXT: v_sub_co_ci_u32_e32 v1, vcc, s1, v1, vcc ; GFX1064-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 @@ -2111,14 +2100,13 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX1032-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1032-NEXT: s_or_b32 exec_lo, exec_lo, s0 ; GFX1032-NEXT: s_waitcnt lgkmcnt(0) -; GFX1032-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1032-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1032-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1032-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1032-NEXT: v_mad_u64_u32 v[2:3], s0, s2, v2, 0 ; GFX1032-NEXT: v_readfirstlane_b32 s0, v0 ; GFX1032-NEXT: v_readfirstlane_b32 s1, v1 ; GFX1032-NEXT: s_mov_b32 s7, 0x31016000 ; GFX1032-NEXT: s_mov_b32 s6, -1 -; GFX1032-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1032-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1032-NEXT: v_sub_co_u32 v0, vcc_lo, s0, v2 ; GFX1032-NEXT: v_sub_co_ci_u32_e32 v1, vcc_lo, s1, v1, vcc_lo ; GFX1032-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll index 455f9de836ba..bf91960537a4 100644 --- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll +++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll @@ -954,15 +954,13 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 %additive ; GFX8-NEXT: s_and_saveexec_b64 s[4:5], vcc ; GFX8-NEXT: s_cbranch_execz .LBB5_2 ; GFX8-NEXT: ; %bb.1: -; GFX8-NEXT: s_bcnt1_i32_b64 s6, s[6:7] -; GFX8-NEXT: v_mov_b32_e32 v0, s6 +; GFX8-NEXT: s_bcnt1_i32_b64 s8, s[6:7] +; GFX8-NEXT: v_mov_b32_e32 v0, s8 ; GFX8-NEXT: s_waitcnt lgkmcnt(0) -; GFX8-NEXT: v_mul_hi_u32 v0, s2, v0 -; GFX8-NEXT: s_mul_i32 s7, s3, s6 -; GFX8-NEXT: s_mul_i32 s6, s2, s6 +; GFX8-NEXT: v_mad_u64_u32 v[0:1], s[6:7], s2, v0, 0 +; GFX8-NEXT: s_mul_i32 s6, s3, s8 ; GFX8-NEXT: v_mov_b32_e32 v3, 0 -; GFX8-NEXT: v_add_u32_e32 v1, vcc, s7, v0 -; GFX8-NEXT: v_mov_b32_e32 v0, s6 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, s6, v1 ; GFX8-NEXT: s_mov_b32 m0, -1 ; GFX8-NEXT: s_waitcnt lgkmcnt(0) ; GFX8-NEXT: ds_add_rtn_u64 v[0:1], v3, v[0:1] @@ -971,18 +969,17 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 %additive ; GFX8-NEXT: s_or_b64 exec, exec, s[4:5] ; GFX8-NEXT: s_waitcnt lgkmcnt(0) ; GFX8-NEXT: s_mov_b32 s4, s0 -; GFX8-NEXT: v_readfirstlane_b32 s0, v0 -; GFX8-NEXT: v_mul_lo_u32 v0, s3, v2 -; GFX8-NEXT: v_mul_hi_u32 v3, s2, v2 ; GFX8-NEXT: s_mov_b32 s5, s1 +; GFX8-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX8-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0 +; GFX8-NEXT: v_readfirstlane_b32 s0, v0 ; GFX8-NEXT: v_readfirstlane_b32 s1, v1 -; GFX8-NEXT: v_mul_lo_u32 v1, s2, v2 -; GFX8-NEXT: v_add_u32_e32 v2, vcc, v3, v0 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, v3, v4 ; GFX8-NEXT: v_mov_b32_e32 v3, s1 -; GFX8-NEXT: v_add_u32_e32 v0, vcc, s0, v1 +; GFX8-NEXT: v_add_u32_e32 v0, vcc, s0, v2 ; GFX8-NEXT: s_mov_b32 s7, 0xf000 ; GFX8-NEXT: s_mov_b32 s6, -1 -; GFX8-NEXT: v_addc_u32_e32 v1, vcc, v3, v2, vcc +; GFX8-NEXT: v_addc_u32_e32 v1, vcc, v3, v1, vcc ; GFX8-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX8-NEXT: s_endpgm ; @@ -1012,19 +1009,18 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 %additive ; GFX9-NEXT: .LBB5_2: ; GFX9-NEXT: s_or_b64 exec, exec, s[4:5] ; GFX9-NEXT: s_waitcnt lgkmcnt(0) +; GFX9-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[2:3], s2, v2, 0 ; GFX9-NEXT: s_mov_b32 s4, s0 -; GFX9-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX9-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX9-NEXT: v_readfirstlane_b32 s0, v0 -; GFX9-NEXT: v_mul_lo_u32 v0, s2, v2 ; GFX9-NEXT: s_mov_b32 s5, s1 +; GFX9-NEXT: v_readfirstlane_b32 s0, v0 ; GFX9-NEXT: v_readfirstlane_b32 s1, v1 -; GFX9-NEXT: v_add_u32_e32 v1, v4, v3 -; GFX9-NEXT: v_mov_b32_e32 v2, s1 -; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, s0, v0 +; GFX9-NEXT: v_add_u32_e32 v1, v3, v4 +; GFX9-NEXT: v_mov_b32_e32 v3, s1 +; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, s0, v2 ; GFX9-NEXT: s_mov_b32 s7, 0xf000 ; GFX9-NEXT: s_mov_b32 s6, -1 -; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v2, v1, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v3, v1, vcc ; GFX9-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX9-NEXT: s_endpgm ; @@ -1057,13 +1053,12 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 %additive ; GFX1064-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1064-NEXT: s_or_b64 exec, exec, s[4:5] ; GFX1064-NEXT: s_waitcnt lgkmcnt(0) -; GFX1064-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1064-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1064-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1064-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1064-NEXT: v_mad_u64_u32 v[2:3], s[2:3], s2, v2, 0 ; GFX1064-NEXT: v_readfirstlane_b32 s2, v0 ; GFX1064-NEXT: v_readfirstlane_b32 s4, v1 ; GFX1064-NEXT: s_mov_b32 s3, 0x31016000 -; GFX1064-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1064-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1064-NEXT: v_add_co_u32 v0, vcc, s2, v2 ; GFX1064-NEXT: s_mov_b32 s2, -1 ; GFX1064-NEXT: v_add_co_ci_u32_e32 v1, vcc, s4, v1, vcc @@ -1098,13 +1093,12 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 %additive ; GFX1032-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1032-NEXT: s_or_b32 exec_lo, exec_lo, s4 ; GFX1032-NEXT: s_waitcnt lgkmcnt(0) -; GFX1032-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1032-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1032-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1032-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1032-NEXT: v_mad_u64_u32 v[2:3], s2, s2, v2, 0 ; GFX1032-NEXT: v_readfirstlane_b32 s2, v0 ; GFX1032-NEXT: v_readfirstlane_b32 s4, v1 ; GFX1032-NEXT: s_mov_b32 s3, 0x31016000 -; GFX1032-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1032-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1032-NEXT: v_add_co_u32 v0, vcc_lo, s2, v2 ; GFX1032-NEXT: s_mov_b32 s2, -1 ; GFX1032-NEXT: v_add_co_ci_u32_e32 v1, vcc_lo, s4, v1, vcc_lo @@ -2133,15 +2127,13 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 %subitive ; GFX8-NEXT: s_and_saveexec_b64 s[4:5], vcc ; GFX8-NEXT: s_cbranch_execz .LBB12_2 ; GFX8-NEXT: ; %bb.1: -; GFX8-NEXT: s_bcnt1_i32_b64 s6, s[6:7] -; GFX8-NEXT: v_mov_b32_e32 v0, s6 +; GFX8-NEXT: s_bcnt1_i32_b64 s8, s[6:7] +; GFX8-NEXT: v_mov_b32_e32 v0, s8 ; GFX8-NEXT: s_waitcnt lgkmcnt(0) -; GFX8-NEXT: v_mul_hi_u32 v0, s2, v0 -; GFX8-NEXT: s_mul_i32 s7, s3, s6 -; GFX8-NEXT: s_mul_i32 s6, s2, s6 +; GFX8-NEXT: v_mad_u64_u32 v[0:1], s[6:7], s2, v0, 0 +; GFX8-NEXT: s_mul_i32 s6, s3, s8 ; GFX8-NEXT: v_mov_b32_e32 v3, 0 -; GFX8-NEXT: v_add_u32_e32 v1, vcc, s7, v0 -; GFX8-NEXT: v_mov_b32_e32 v0, s6 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, s6, v1 ; GFX8-NEXT: s_mov_b32 m0, -1 ; GFX8-NEXT: s_waitcnt lgkmcnt(0) ; GFX8-NEXT: ds_sub_rtn_u64 v[0:1], v3, v[0:1] @@ -2150,18 +2142,17 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 %subitive ; GFX8-NEXT: s_or_b64 exec, exec, s[4:5] ; GFX8-NEXT: s_waitcnt lgkmcnt(0) ; GFX8-NEXT: s_mov_b32 s4, s0 -; GFX8-NEXT: v_readfirstlane_b32 s0, v0 -; GFX8-NEXT: v_mul_lo_u32 v0, s3, v2 -; GFX8-NEXT: v_mul_hi_u32 v3, s2, v2 ; GFX8-NEXT: s_mov_b32 s5, s1 +; GFX8-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX8-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0 +; GFX8-NEXT: v_readfirstlane_b32 s0, v0 ; GFX8-NEXT: v_readfirstlane_b32 s1, v1 -; GFX8-NEXT: v_mul_lo_u32 v1, s2, v2 -; GFX8-NEXT: v_add_u32_e32 v2, vcc, v3, v0 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, v3, v4 ; GFX8-NEXT: v_mov_b32_e32 v3, s1 -; GFX8-NEXT: v_sub_u32_e32 v0, vcc, s0, v1 +; GFX8-NEXT: v_sub_u32_e32 v0, vcc, s0, v2 ; GFX8-NEXT: s_mov_b32 s7, 0xf000 ; GFX8-NEXT: s_mov_b32 s6, -1 -; GFX8-NEXT: v_subb_u32_e32 v1, vcc, v3, v2, vcc +; GFX8-NEXT: v_subb_u32_e32 v1, vcc, v3, v1, vcc ; GFX8-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX8-NEXT: s_endpgm ; @@ -2191,19 +2182,18 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 %subitive ; GFX9-NEXT: .LBB12_2: ; GFX9-NEXT: s_or_b64 exec, exec, s[4:5] ; GFX9-NEXT: s_waitcnt lgkmcnt(0) +; GFX9-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[2:3], s2, v2, 0 ; GFX9-NEXT: s_mov_b32 s4, s0 -; GFX9-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX9-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX9-NEXT: v_readfirstlane_b32 s0, v0 -; GFX9-NEXT: v_mul_lo_u32 v0, s2, v2 ; GFX9-NEXT: s_mov_b32 s5, s1 +; GFX9-NEXT: v_readfirstlane_b32 s0, v0 ; GFX9-NEXT: v_readfirstlane_b32 s1, v1 -; GFX9-NEXT: v_add_u32_e32 v1, v4, v3 -; GFX9-NEXT: v_mov_b32_e32 v2, s1 -; GFX9-NEXT: v_sub_co_u32_e32 v0, vcc, s0, v0 +; GFX9-NEXT: v_add_u32_e32 v1, v3, v4 +; GFX9-NEXT: v_mov_b32_e32 v3, s1 +; GFX9-NEXT: v_sub_co_u32_e32 v0, vcc, s0, v2 ; GFX9-NEXT: s_mov_b32 s7, 0xf000 ; GFX9-NEXT: s_mov_b32 s6, -1 -; GFX9-NEXT: v_subb_co_u32_e32 v1, vcc, v2, v1, vcc +; GFX9-NEXT: v_subb_co_u32_e32 v1, vcc, v3, v1, vcc ; GFX9-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX9-NEXT: s_endpgm ; @@ -2236,13 +2226,12 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 %subitive ; GFX1064-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1064-NEXT: s_or_b64 exec, exec, s[4:5] ; GFX1064-NEXT: s_waitcnt lgkmcnt(0) -; GFX1064-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1064-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1064-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1064-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1064-NEXT: v_mad_u64_u32 v[2:3], s[2:3], s2, v2, 0 ; GFX1064-NEXT: v_readfirstlane_b32 s2, v0 ; GFX1064-NEXT: v_readfirstlane_b32 s4, v1 ; GFX1064-NEXT: s_mov_b32 s3, 0x31016000 -; GFX1064-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1064-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1064-NEXT: v_sub_co_u32 v0, vcc, s2, v2 ; GFX1064-NEXT: s_mov_b32 s2, -1 ; GFX1064-NEXT: v_sub_co_ci_u32_e32 v1, vcc, s4, v1, vcc @@ -2277,13 +2266,12 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 %subitive ; GFX1032-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1032-NEXT: s_or_b32 exec_lo, exec_lo, s4 ; GFX1032-NEXT: s_waitcnt lgkmcnt(0) -; GFX1032-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1032-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1032-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1032-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1032-NEXT: v_mad_u64_u32 v[2:3], s2, s2, v2, 0 ; GFX1032-NEXT: v_readfirstlane_b32 s2, v0 ; GFX1032-NEXT: v_readfirstlane_b32 s4, v1 ; GFX1032-NEXT: s_mov_b32 s3, 0x31016000 -; GFX1032-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1032-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1032-NEXT: v_sub_co_u32 v0, vcc_lo, s2, v2 ; GFX1032-NEXT: s_mov_b32 s2, -1 ; GFX1032-NEXT: v_sub_co_ci_u32_e32 v1, vcc_lo, s4, v1, vcc_lo diff --git a/llvm/test/CodeGen/AMDGPU/bypass-div.ll b/llvm/test/CodeGen/AMDGPU/bypass-div.ll index 4ff9f6159cae..907ba8dd3086 100644 --- a/llvm/test/CodeGen/AMDGPU/bypass-div.ll +++ b/llvm/test/CodeGen/AMDGPU/bypass-div.ll @@ -16,119 +16,107 @@ define i64 @sdiv64(i64 %a, i64 %b) { ; GFX9-NEXT: s_xor_b64 s[6:7], exec, s[4:5] ; GFX9-NEXT: s_cbranch_execz .LBB0_2 ; GFX9-NEXT: ; %bb.1: -; GFX9-NEXT: v_ashrrev_i32_e32 v4, 31, v3 -; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v2, v4 -; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, v3, v4, vcc -; GFX9-NEXT: v_xor_b32_e32 v3, v3, v4 -; GFX9-NEXT: v_xor_b32_e32 v2, v2, v4 -; GFX9-NEXT: v_cvt_f32_u32_e32 v5, v2 -; GFX9-NEXT: v_cvt_f32_u32_e32 v6, v3 -; GFX9-NEXT: v_sub_co_u32_e32 v7, vcc, 0, v2 -; GFX9-NEXT: v_subb_co_u32_e32 v8, vcc, 0, v3, vcc -; GFX9-NEXT: v_mac_f32_e32 v5, 0x4f800000, v6 -; GFX9-NEXT: v_rcp_f32_e32 v5, v5 +; GFX9-NEXT: v_ashrrev_i32_e32 v9, 31, v3 +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v2, v9 +; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, v3, v9, vcc +; GFX9-NEXT: v_xor_b32_e32 v10, v3, v9 +; GFX9-NEXT: v_xor_b32_e32 v11, v2, v9 +; GFX9-NEXT: v_cvt_f32_u32_e32 v2, v11 +; GFX9-NEXT: v_cvt_f32_u32_e32 v3, v10 +; GFX9-NEXT: v_sub_co_u32_e32 v7, vcc, 0, v11 +; GFX9-NEXT: v_subb_co_u32_e32 v8, vcc, 0, v10, vcc +; GFX9-NEXT: v_mac_f32_e32 v2, 0x4f800000, v3 +; GFX9-NEXT: v_rcp_f32_e32 v2, v2 ; GFX9-NEXT: v_mov_b32_e32 v14, 0 -; GFX9-NEXT: v_mul_f32_e32 v5, 0x5f7ffffc, v5 -; GFX9-NEXT: v_mul_f32_e32 v6, 0x2f800000, v5 -; GFX9-NEXT: v_trunc_f32_e32 v6, v6 -; GFX9-NEXT: v_mac_f32_e32 v5, 0xcf800000, v6 -; GFX9-NEXT: v_cvt_u32_f32_e32 v6, v6 -; GFX9-NEXT: v_cvt_u32_f32_e32 v5, v5 -; GFX9-NEXT: v_mul_lo_u32 v11, v7, v6 -; GFX9-NEXT: v_mul_lo_u32 v9, v8, v5 -; GFX9-NEXT: v_mul_hi_u32 v10, v7, v5 -; GFX9-NEXT: v_mul_lo_u32 v12, v7, v5 -; GFX9-NEXT: v_add3_u32 v9, v10, v11, v9 -; GFX9-NEXT: v_mul_lo_u32 v10, v5, v9 -; GFX9-NEXT: v_mul_hi_u32 v11, v5, v12 -; GFX9-NEXT: v_mul_hi_u32 v13, v5, v9 -; GFX9-NEXT: v_mul_hi_u32 v15, v6, v9 -; GFX9-NEXT: v_mul_lo_u32 v9, v6, v9 -; GFX9-NEXT: v_add_co_u32_e32 v10, vcc, v11, v10 -; GFX9-NEXT: v_addc_co_u32_e32 v11, vcc, 0, v13, vcc -; GFX9-NEXT: v_mul_lo_u32 v13, v6, v12 -; GFX9-NEXT: v_mul_hi_u32 v12, v6, v12 -; GFX9-NEXT: v_add_co_u32_e32 v10, vcc, v10, v13 -; GFX9-NEXT: v_addc_co_u32_e32 v10, vcc, v11, v12, vcc -; GFX9-NEXT: v_addc_co_u32_e32 v11, vcc, v15, v14, vcc -; GFX9-NEXT: v_add_co_u32_e32 v9, vcc, v10, v9 -; GFX9-NEXT: v_addc_co_u32_e32 v10, vcc, 0, v11, vcc -; GFX9-NEXT: v_add_co_u32_e32 v5, vcc, v5, v9 -; GFX9-NEXT: v_addc_co_u32_e32 v6, vcc, v6, v10, vcc -; GFX9-NEXT: v_mul_lo_u32 v9, v7, v6 -; GFX9-NEXT: v_mul_lo_u32 v8, v8, v5 -; GFX9-NEXT: v_mul_hi_u32 v10, v7, v5 -; GFX9-NEXT: v_mul_lo_u32 v7, v7, v5 -; GFX9-NEXT: v_add3_u32 v8, v10, v9, v8 -; GFX9-NEXT: v_mul_lo_u32 v11, v5, v8 -; GFX9-NEXT: v_mul_hi_u32 v12, v5, v7 -; GFX9-NEXT: v_mul_hi_u32 v13, v5, v8 -; GFX9-NEXT: v_mul_hi_u32 v10, v6, v7 -; GFX9-NEXT: v_mul_lo_u32 v7, v6, v7 -; GFX9-NEXT: v_mul_hi_u32 v9, v6, v8 -; GFX9-NEXT: v_add_co_u32_e32 v11, vcc, v12, v11 -; GFX9-NEXT: v_addc_co_u32_e32 v12, vcc, 0, v13, vcc -; GFX9-NEXT: v_mul_lo_u32 v8, v6, v8 -; GFX9-NEXT: v_add_co_u32_e32 v7, vcc, v11, v7 -; GFX9-NEXT: v_addc_co_u32_e32 v7, vcc, v12, v10, vcc -; GFX9-NEXT: v_addc_co_u32_e32 v9, vcc, v9, v14, vcc -; GFX9-NEXT: v_add_co_u32_e32 v7, vcc, v7, v8 -; GFX9-NEXT: v_addc_co_u32_e32 v8, vcc, 0, v9, vcc -; GFX9-NEXT: v_add_co_u32_e32 v5, vcc, v5, v7 -; GFX9-NEXT: v_addc_co_u32_e32 v6, vcc, v6, v8, vcc -; GFX9-NEXT: v_ashrrev_i32_e32 v7, 31, v1 -; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v0, v7 -; GFX9-NEXT: v_xor_b32_e32 v0, v0, v7 -; GFX9-NEXT: v_mul_lo_u32 v8, v0, v6 -; GFX9-NEXT: v_mul_hi_u32 v9, v0, v5 -; GFX9-NEXT: v_mul_hi_u32 v10, v0, v6 -; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v7, vcc -; GFX9-NEXT: v_xor_b32_e32 v1, v1, v7 -; GFX9-NEXT: v_add_co_u32_e32 v8, vcc, v9, v8 -; GFX9-NEXT: v_addc_co_u32_e32 v9, vcc, 0, v10, vcc -; GFX9-NEXT: v_mul_lo_u32 v10, v1, v5 -; GFX9-NEXT: v_mul_hi_u32 v5, v1, v5 -; GFX9-NEXT: v_mul_hi_u32 v11, v1, v6 -; GFX9-NEXT: v_mul_lo_u32 v6, v1, v6 -; GFX9-NEXT: v_add_co_u32_e32 v8, vcc, v8, v10 -; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, v9, v5, vcc -; GFX9-NEXT: v_addc_co_u32_e32 v8, vcc, v11, v14, vcc -; GFX9-NEXT: v_add_co_u32_e32 v5, vcc, v5, v6 -; GFX9-NEXT: v_addc_co_u32_e32 v6, vcc, 0, v8, vcc -; GFX9-NEXT: v_mul_lo_u32 v8, v3, v5 -; GFX9-NEXT: v_mul_lo_u32 v9, v2, v6 -; GFX9-NEXT: v_mul_hi_u32 v10, v2, v5 -; GFX9-NEXT: v_mul_lo_u32 v11, v2, v5 -; GFX9-NEXT: v_add3_u32 v8, v10, v9, v8 -; GFX9-NEXT: v_sub_u32_e32 v9, v1, v8 -; GFX9-NEXT: v_sub_co_u32_e32 v0, vcc, v0, v11 -; GFX9-NEXT: v_subb_co_u32_e64 v9, s[4:5], v9, v3, vcc -; GFX9-NEXT: v_sub_co_u32_e64 v10, s[4:5], v0, v2 -; GFX9-NEXT: v_subbrev_co_u32_e64 v9, s[4:5], 0, v9, s[4:5] -; GFX9-NEXT: v_cmp_ge_u32_e64 s[4:5], v9, v3 -; GFX9-NEXT: v_cndmask_b32_e64 v11, 0, -1, s[4:5] -; GFX9-NEXT: v_cmp_ge_u32_e64 s[4:5], v10, v2 -; GFX9-NEXT: v_cndmask_b32_e64 v10, 0, -1, s[4:5] -; GFX9-NEXT: v_cmp_eq_u32_e64 s[4:5], v9, v3 -; GFX9-NEXT: v_cndmask_b32_e64 v9, v11, v10, s[4:5] -; GFX9-NEXT: v_add_co_u32_e64 v10, s[4:5], 2, v5 -; GFX9-NEXT: v_subb_co_u32_e32 v1, vcc, v1, v8, vcc -; GFX9-NEXT: v_addc_co_u32_e64 v11, s[4:5], 0, v6, s[4:5] -; GFX9-NEXT: v_cmp_ge_u32_e32 vcc, v1, v3 -; GFX9-NEXT: v_add_co_u32_e64 v12, s[4:5], 1, v5 -; GFX9-NEXT: v_cndmask_b32_e64 v8, 0, -1, vcc -; GFX9-NEXT: v_cmp_ge_u32_e32 vcc, v0, v2 -; GFX9-NEXT: v_addc_co_u32_e64 v13, s[4:5], 0, v6, s[4:5] +; GFX9-NEXT: v_mul_f32_e32 v2, 0x5f7ffffc, v2 +; GFX9-NEXT: v_mul_f32_e32 v3, 0x2f800000, v2 +; GFX9-NEXT: v_trunc_f32_e32 v3, v3 +; GFX9-NEXT: v_mac_f32_e32 v2, 0xcf800000, v3 +; GFX9-NEXT: v_cvt_u32_f32_e32 v6, v2 +; GFX9-NEXT: v_cvt_u32_f32_e32 v12, v3 +; GFX9-NEXT: v_mul_lo_u32 v4, v8, v6 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[4:5], v7, v6, 0 +; GFX9-NEXT: v_mul_lo_u32 v5, v7, v12 +; GFX9-NEXT: v_mul_hi_u32 v13, v6, v2 +; GFX9-NEXT: v_add3_u32 v5, v3, v5, v4 +; GFX9-NEXT: v_mad_u64_u32 v[3:4], s[4:5], v6, v5, 0 +; GFX9-NEXT: v_add_co_u32_e32 v13, vcc, v13, v3 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[4:5], v12, v2, 0 +; GFX9-NEXT: v_addc_co_u32_e32 v15, vcc, 0, v4, vcc +; GFX9-NEXT: v_mad_u64_u32 v[4:5], s[4:5], v12, v5, 0 +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v13, v2 +; GFX9-NEXT: v_addc_co_u32_e32 v2, vcc, v15, v3, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, v5, v14, vcc +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v2, v4 +; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v3, vcc +; GFX9-NEXT: v_add_co_u32_e32 v13, vcc, v6, v2 +; GFX9-NEXT: v_addc_co_u32_e32 v12, vcc, v12, v3, vcc +; GFX9-NEXT: v_mul_lo_u32 v4, v7, v12 +; GFX9-NEXT: v_mul_lo_u32 v5, v8, v13 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[4:5], v7, v13, 0 +; GFX9-NEXT: v_add3_u32 v5, v3, v4, v5 +; GFX9-NEXT: v_mad_u64_u32 v[3:4], s[4:5], v12, v5, 0 +; GFX9-NEXT: v_mad_u64_u32 v[5:6], s[4:5], v13, v5, 0 +; GFX9-NEXT: v_mul_hi_u32 v15, v13, v2 +; GFX9-NEXT: v_mad_u64_u32 v[7:8], s[4:5], v12, v2, 0 +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v15, v5 +; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v6, vcc +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v2, v7 +; GFX9-NEXT: v_addc_co_u32_e32 v2, vcc, v5, v8, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v4, vcc, v4, v14, vcc +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v2, v3 +; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v4, vcc +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v13, v2 +; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, v12, v3, vcc +; GFX9-NEXT: v_ashrrev_i32_e32 v4, 31, v1 +; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v0, v4 +; GFX9-NEXT: v_xor_b32_e32 v6, v0, v4 +; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, v1, v4, vcc +; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v6, v3, 0 +; GFX9-NEXT: v_mul_hi_u32 v7, v6, v2 +; GFX9-NEXT: v_xor_b32_e32 v5, v5, v4 +; GFX9-NEXT: v_add_co_u32_e32 v7, vcc, v7, v0 +; GFX9-NEXT: v_addc_co_u32_e32 v8, vcc, 0, v1, vcc +; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v5, v2, 0 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[4:5], v5, v3, 0 +; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v7, v0 +; GFX9-NEXT: v_addc_co_u32_e32 v0, vcc, v8, v1, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v3, v14, vcc +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v0, v2 +; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v1, vcc +; GFX9-NEXT: v_mul_lo_u32 v7, v10, v2 +; GFX9-NEXT: v_mul_lo_u32 v8, v11, v3 +; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v11, v2, 0 +; GFX9-NEXT: v_add3_u32 v1, v1, v8, v7 +; GFX9-NEXT: v_sub_u32_e32 v7, v5, v1 +; GFX9-NEXT: v_sub_co_u32_e32 v0, vcc, v6, v0 +; GFX9-NEXT: v_subb_co_u32_e64 v6, s[4:5], v7, v10, vcc +; GFX9-NEXT: v_sub_co_u32_e64 v7, s[4:5], v0, v11 +; GFX9-NEXT: v_subbrev_co_u32_e64 v6, s[4:5], 0, v6, s[4:5] +; GFX9-NEXT: v_cmp_ge_u32_e64 s[4:5], v6, v10 +; GFX9-NEXT: v_cndmask_b32_e64 v8, 0, -1, s[4:5] +; GFX9-NEXT: v_cmp_ge_u32_e64 s[4:5], v7, v11 +; GFX9-NEXT: v_cndmask_b32_e64 v7, 0, -1, s[4:5] +; GFX9-NEXT: v_cmp_eq_u32_e64 s[4:5], v6, v10 +; GFX9-NEXT: v_cndmask_b32_e64 v6, v8, v7, s[4:5] +; GFX9-NEXT: v_add_co_u32_e64 v7, s[4:5], 2, v2 +; GFX9-NEXT: v_subb_co_u32_e32 v1, vcc, v5, v1, vcc +; GFX9-NEXT: v_addc_co_u32_e64 v8, s[4:5], 0, v3, s[4:5] +; GFX9-NEXT: v_cmp_ge_u32_e32 vcc, v1, v10 +; GFX9-NEXT: v_add_co_u32_e64 v12, s[4:5], 1, v2 +; GFX9-NEXT: v_cndmask_b32_e64 v5, 0, -1, vcc +; GFX9-NEXT: v_cmp_ge_u32_e32 vcc, v0, v11 +; GFX9-NEXT: v_addc_co_u32_e64 v13, s[4:5], 0, v3, s[4:5] ; GFX9-NEXT: v_cndmask_b32_e64 v0, 0, -1, vcc -; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, v1, v3 -; GFX9-NEXT: v_cmp_ne_u32_e64 s[4:5], 0, v9 -; GFX9-NEXT: v_cndmask_b32_e32 v0, v8, v0, vcc +; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, v1, v10 +; GFX9-NEXT: v_cmp_ne_u32_e64 s[4:5], 0, v6 +; GFX9-NEXT: v_cndmask_b32_e32 v0, v5, v0, vcc ; GFX9-NEXT: v_cmp_ne_u32_e32 vcc, 0, v0 -; GFX9-NEXT: v_cndmask_b32_e64 v1, v12, v10, s[4:5] -; GFX9-NEXT: v_cndmask_b32_e64 v9, v13, v11, s[4:5] -; GFX9-NEXT: v_cndmask_b32_e32 v1, v5, v1, vcc -; GFX9-NEXT: v_xor_b32_e32 v2, v7, v4 -; GFX9-NEXT: v_cndmask_b32_e32 v0, v6, v9, vcc +; GFX9-NEXT: v_cndmask_b32_e64 v1, v12, v7, s[4:5] +; GFX9-NEXT: v_cndmask_b32_e64 v6, v13, v8, s[4:5] +; GFX9-NEXT: v_cndmask_b32_e32 v1, v2, v1, vcc +; GFX9-NEXT: v_xor_b32_e32 v2, v4, v9 +; GFX9-NEXT: v_cndmask_b32_e32 v0, v3, v6, vcc ; GFX9-NEXT: v_xor_b32_e32 v1, v1, v2 ; GFX9-NEXT: v_xor_b32_e32 v0, v0, v2 ; GFX9-NEXT: v_sub_co_u32_e32 v4, vcc, v1, v2 @@ -183,106 +171,94 @@ define i64 @udiv64(i64 %a, i64 %b) { ; GFX9-NEXT: ; %bb.1: ; GFX9-NEXT: v_cvt_f32_u32_e32 v4, v2 ; GFX9-NEXT: v_cvt_f32_u32_e32 v5, v3 -; GFX9-NEXT: v_sub_co_u32_e32 v6, vcc, 0, v2 -; GFX9-NEXT: v_subb_co_u32_e32 v7, vcc, 0, v3, vcc +; GFX9-NEXT: v_sub_co_u32_e32 v10, vcc, 0, v2 +; GFX9-NEXT: v_subb_co_u32_e32 v11, vcc, 0, v3, vcc ; GFX9-NEXT: v_mac_f32_e32 v4, 0x4f800000, v5 ; GFX9-NEXT: v_rcp_f32_e32 v4, v4 -; GFX9-NEXT: v_mov_b32_e32 v12, 0 +; GFX9-NEXT: v_mov_b32_e32 v13, 0 ; GFX9-NEXT: v_mul_f32_e32 v4, 0x5f7ffffc, v4 ; GFX9-NEXT: v_mul_f32_e32 v5, 0x2f800000, v4 ; GFX9-NEXT: v_trunc_f32_e32 v5, v5 ; GFX9-NEXT: v_mac_f32_e32 v4, 0xcf800000, v5 -; GFX9-NEXT: v_cvt_u32_f32_e32 v5, v5 -; GFX9-NEXT: v_cvt_u32_f32_e32 v4, v4 -; GFX9-NEXT: v_mul_lo_u32 v8, v6, v5 -; GFX9-NEXT: v_mul_lo_u32 v9, v7, v4 -; GFX9-NEXT: v_mul_hi_u32 v10, v6, v4 -; GFX9-NEXT: v_mul_lo_u32 v11, v6, v4 -; GFX9-NEXT: v_add3_u32 v8, v10, v8, v9 -; GFX9-NEXT: v_mul_hi_u32 v9, v4, v11 -; GFX9-NEXT: v_mul_lo_u32 v10, v4, v8 -; GFX9-NEXT: v_mul_hi_u32 v13, v4, v8 -; GFX9-NEXT: v_mul_hi_u32 v14, v5, v8 -; GFX9-NEXT: v_mul_lo_u32 v8, v5, v8 -; GFX9-NEXT: v_add_co_u32_e32 v9, vcc, v9, v10 -; GFX9-NEXT: v_addc_co_u32_e32 v10, vcc, 0, v13, vcc -; GFX9-NEXT: v_mul_lo_u32 v13, v5, v11 -; GFX9-NEXT: v_mul_hi_u32 v11, v5, v11 -; GFX9-NEXT: v_add_co_u32_e32 v9, vcc, v9, v13 -; GFX9-NEXT: v_addc_co_u32_e32 v9, vcc, v10, v11, vcc -; GFX9-NEXT: v_addc_co_u32_e32 v10, vcc, v14, v12, vcc -; GFX9-NEXT: v_add_co_u32_e32 v8, vcc, v9, v8 -; GFX9-NEXT: v_addc_co_u32_e32 v9, vcc, 0, v10, vcc -; GFX9-NEXT: v_add_co_u32_e32 v4, vcc, v4, v8 -; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, v5, v9, vcc -; GFX9-NEXT: v_mul_lo_u32 v8, v6, v5 -; GFX9-NEXT: v_mul_lo_u32 v7, v7, v4 -; GFX9-NEXT: v_mul_hi_u32 v9, v6, v4 -; GFX9-NEXT: v_mul_lo_u32 v6, v6, v4 -; GFX9-NEXT: v_add3_u32 v7, v9, v8, v7 -; GFX9-NEXT: v_mul_lo_u32 v10, v4, v7 -; GFX9-NEXT: v_mul_hi_u32 v11, v4, v6 -; GFX9-NEXT: v_mul_hi_u32 v13, v4, v7 -; GFX9-NEXT: v_mul_hi_u32 v9, v5, v6 -; GFX9-NEXT: v_mul_lo_u32 v6, v5, v6 -; GFX9-NEXT: v_mul_hi_u32 v8, v5, v7 -; GFX9-NEXT: v_add_co_u32_e32 v10, vcc, v11, v10 -; GFX9-NEXT: v_addc_co_u32_e32 v11, vcc, 0, v13, vcc -; GFX9-NEXT: v_mul_lo_u32 v7, v5, v7 -; GFX9-NEXT: v_add_co_u32_e32 v6, vcc, v10, v6 -; GFX9-NEXT: v_addc_co_u32_e32 v6, vcc, v11, v9, vcc -; GFX9-NEXT: v_addc_co_u32_e32 v8, vcc, v8, v12, vcc -; GFX9-NEXT: v_add_co_u32_e32 v6, vcc, v6, v7 -; GFX9-NEXT: v_addc_co_u32_e32 v7, vcc, 0, v8, vcc +; GFX9-NEXT: v_cvt_u32_f32_e32 v8, v5 +; GFX9-NEXT: v_cvt_u32_f32_e32 v9, v4 +; GFX9-NEXT: v_mul_lo_u32 v6, v10, v8 +; GFX9-NEXT: v_mul_lo_u32 v7, v11, v9 +; GFX9-NEXT: v_mad_u64_u32 v[4:5], s[4:5], v10, v9, 0 +; GFX9-NEXT: v_add3_u32 v7, v5, v6, v7 +; GFX9-NEXT: v_mul_hi_u32 v12, v9, v4 +; GFX9-NEXT: v_mad_u64_u32 v[5:6], s[4:5], v9, v7, 0 +; GFX9-NEXT: v_add_co_u32_e32 v12, vcc, v12, v5 +; GFX9-NEXT: v_mad_u64_u32 v[4:5], s[4:5], v8, v4, 0 +; GFX9-NEXT: v_addc_co_u32_e32 v14, vcc, 0, v6, vcc +; GFX9-NEXT: v_mad_u64_u32 v[6:7], s[4:5], v8, v7, 0 +; GFX9-NEXT: v_add_co_u32_e32 v4, vcc, v12, v4 +; GFX9-NEXT: v_addc_co_u32_e32 v4, vcc, v14, v5, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, v7, v13, vcc ; GFX9-NEXT: v_add_co_u32_e32 v4, vcc, v4, v6 -; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, v5, v7, vcc -; GFX9-NEXT: v_mul_lo_u32 v6, v0, v5 -; GFX9-NEXT: v_mul_hi_u32 v7, v0, v4 -; GFX9-NEXT: v_mul_hi_u32 v8, v0, v5 -; GFX9-NEXT: v_mul_hi_u32 v9, v1, v5 -; GFX9-NEXT: v_mul_lo_u32 v5, v1, v5 -; GFX9-NEXT: v_add_co_u32_e32 v6, vcc, v7, v6 +; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v5, vcc +; GFX9-NEXT: v_add_co_u32_e32 v12, vcc, v9, v4 +; GFX9-NEXT: v_addc_co_u32_e32 v14, vcc, v8, v5, vcc +; GFX9-NEXT: v_mul_lo_u32 v6, v10, v14 +; GFX9-NEXT: v_mul_lo_u32 v7, v11, v12 +; GFX9-NEXT: v_mad_u64_u32 v[4:5], s[4:5], v10, v12, 0 +; GFX9-NEXT: v_add3_u32 v7, v5, v6, v7 +; GFX9-NEXT: v_mad_u64_u32 v[5:6], s[4:5], v14, v7, 0 +; GFX9-NEXT: v_mad_u64_u32 v[7:8], s[4:5], v12, v7, 0 +; GFX9-NEXT: v_mul_hi_u32 v11, v12, v4 +; GFX9-NEXT: v_mad_u64_u32 v[9:10], s[4:5], v14, v4, 0 +; GFX9-NEXT: v_add_co_u32_e32 v4, vcc, v11, v7 ; GFX9-NEXT: v_addc_co_u32_e32 v7, vcc, 0, v8, vcc -; GFX9-NEXT: v_mul_lo_u32 v8, v1, v4 -; GFX9-NEXT: v_mul_hi_u32 v4, v1, v4 -; GFX9-NEXT: v_add_co_u32_e32 v6, vcc, v6, v8 -; GFX9-NEXT: v_addc_co_u32_e32 v4, vcc, v7, v4, vcc -; GFX9-NEXT: v_addc_co_u32_e32 v6, vcc, v9, v12, vcc +; GFX9-NEXT: v_add_co_u32_e32 v4, vcc, v4, v9 +; GFX9-NEXT: v_addc_co_u32_e32 v4, vcc, v7, v10, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v6, vcc, v6, v13, vcc ; GFX9-NEXT: v_add_co_u32_e32 v4, vcc, v4, v5 ; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v6, vcc -; GFX9-NEXT: v_mul_lo_u32 v6, v3, v4 -; GFX9-NEXT: v_mul_lo_u32 v7, v2, v5 -; GFX9-NEXT: v_mul_hi_u32 v8, v2, v4 </cut>

4 years, 5 months

2
1
0 0

[ACTIVITY] week ending 21 Nov 2021

by Richard Henderson

[UM-2] * release work * revived some 6month old ppc fpu fixes * reviews: loongarch, riscv, watchpoints, gdbstub. r~

4 years, 5 months

1
0
0 0

[ACTIVITY] week ending Nov. 21 2021

by Alex Bennée

VirtIO Initiative ([STR-9]) =========================== - posted Initial thoughts for test scenarios for AF_XDP epic Message-Id: <87k0h5v6ju.fsf(a)linaro.org> vhost-device maintainer effort ([UM-196]) - more review - [did some more noodling with rust] to get comfortable with generics [UM-196] <https://linaro.atlassian.net/browse/UM-196> [vhost-device crate] <https://github.com/rust-vmm/vhost-device> [did some more noodling with rust] <https://gitlab.com/stsquad/softfloat.rs> QEMU Upstream Work ([UM-2]) =========================== - posted [PULL for 6.2 0/7] misc build and test fixes Message-Id: <20211116162515.4100231-1-alex.bennee(a)linaro.org> - posted [RFC PATCH] tests/avocado: fix tcg_plugin mem access count test Message-Id: <20211117095448.136558-1-alex.bennee(a)linaro.org> - posted Re: [RFC PATCH] plugins/meson.build: fix linker issue with weird paths (for v6.2?) Message-Id: <20211117111924.179776-1-alex.bennee(a)linaro.org> - posted Re: [PATCH v2 1/3] icount: preserve cflags when custom tb is about to execute Message-Id: <87h7cbw1tx.fsf(a)linaro.org> - posted [RFC PATCH] gdbstub: handle a potentially racing TaskState Message-Id: <20211119145124.942390-1-alex.bennee(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> Upstream MTTCG tests ([QEMU-52]) - posted [kvm-unit-tests PATCH v3 0/3] GIC ITS tests Message-Id: <20211112114734.3058678-1-alex.bennee(a)linaro.org> - posted [kvm-unit-tests PATCH v8 00/10] MTTCG sanity tests for ARM Message-Id: <20211118184650.661575-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> [mttcg tests to current state and fixed up] <https://github.com/stsquad/qemu/tree/mttcg/current-tests-v8> Completed Reviews [2/2] ======================= [PATCH v2 0/3] Some watchpoint-related patches Message-Id: <163662450348.125458.5494710452733592356.stgit@pasha-ThinkPad-X280> [PATCH 0/5] Update linux-headers + NOIRQ support for KVM gdbstub Message-Id: <20211111110604.207376-1-pbonzini(a)redhat.com> Absences ======== - none Current Review Queue ==================== TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64 Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com> =============================================================================================================== TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com> =================================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== -- Alex Bennée

4 years, 5 months

1
0
0 0

[ACTIVITY] report week ending 19 Nov

by Peter Maydell

Progress (short week, 3 days) * UM-2 [QEMU upstream maintainership] - Still trying to sort out the regression of booting EL3 guest code on the imx7 board. I got most of the way through prototyping a cleanup which would fix this, but then spotted that the highbank board has a more awkward-to-fix similar problem. We're going to revert the PSCI emulation change for 6.2 so we can take the time to get the cleanup right and land it in 7.0. - Usual patch accumulation, review, etc during release cycle -- PMM

4 years, 5 months

1
0
0 0

[TCWG CI] Regression caused by gcc: tree-optimization/102880 - make PHI-OPT recognize more CFGs

by ci_notify＠linaro.org

[TCWG CI] Regression caused by gcc: tree-optimization/102880 - make PHI-OPT recognize more CFGs: commit f98f373dd822b35c52356b753d528924e9f89678 Author: Richard Biener <rguenther(a)suse.de> tree-optimization/102880 - make PHI-OPT recognize more CFGs Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21059 # First few build errors in logs: # 00:20:13 drivers/net/wireless/realtek/rtlwifi/rtl8192se/hw.c:2521:1: error: definition in block 22 does not dominate use in block 21 # 00:20:13 drivers/net/wireless/realtek/rtlwifi/rtl8192se/hw.c:2521:1: internal compiler error: verify_ssa failed # 00:20:13 make[6]: *** [scripts/Makefile.build:280: drivers/net/wireless/realtek/rtlwifi/rtl8192se/hw.o] Error 1 # 00:20:20 make[5]: *** [scripts/Makefile.build:497: drivers/net/wireless/realtek/rtlwifi/rtl8192se] Error 2 # 00:24:02 make[4]: *** [scripts/Makefile.build:497: drivers/net/wireless/realtek/rtlwifi] Error 2 # 00:24:02 make[3]: *** [scripts/Makefile.build:497: drivers/net/wireless/realtek] Error 2 # 00:24:02 make[2]: *** [scripts/Makefile.build:497: drivers/net/wireless] Error 2 # 00:25:21 drivers/staging/comedi/drivers/addi_apci_3120.c:1117:1: error: definition in block 10 does not dominate use in block 11 # 00:25:21 drivers/staging/comedi/drivers/addi_apci_3120.c:1117:1: internal compiler error: verify_ssa failed # 00:25:22 make[4]: *** [scripts/Makefile.build:280: drivers/staging/comedi/drivers/addi_apci_3120.o] Error 1 from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 28893 # linux build successful: all THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/gnu-master-arm-lts-allmodconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… Reproduce builds: <cut> mkdir investigate-gcc-f98f373dd822b35c52356b753d528924e9f89678 cd investigate-gcc-f98f373dd822b35c52356b753d528924e9f89678 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach f98f373dd822b35c52356b753d528924e9f89678 ../artifacts/test.sh # Reproduce last_good build git checkout --detach d699f03720fce57b319276226ac4a463a8538e9f ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit f98f373dd822b35c52356b753d528924e9f89678 Author: Richard Biener <rguenther(a)suse.de> Date: Mon Nov 15 15:19:36 2021 +0100 tree-optimization/102880 - make PHI-OPT recognize more CFGs This allows extra edges into the middle BB for the PHI-OPT transforms using replace_phi_edge_with_variable that do not end up moving stmts from that middle BB. This avoids regressing gcc.dg/tree-ssa/ssa-hoist-4.c with the actual fix for PR102880 where CFG cleanup has the choice to remove two forwarders and picks "the wrong" leading to if (a > b) / /\ / / <BB> / | # PHI <a, b> rather than if (a > b) | /\ | <BB> \ | / \ | # PHI <a, b, b> but it's relatively straight-forward to support extra edges into the middle-BB in paths ending in replace_phi_edge_with_variable and that do not require moving stmts. That's because we really only want to remove the edge from the condition to the middle BB. Of course actually doing that means updating dominators in non-trival ways which is why I kept the original code for the single edge case and simply defer to CFG cleanup by adjusting the condition for the complicated case. The testcase needs to be a GIMPLE one since it's quite unreliable to produce the desired CFG. 2021-11-15 Richard Biener <rguenther(a)suse.de> PR tree-optimization/102880 * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Push single_pred (bb1) condition to places that really need it. (match_simplify_replacement): Likewise. (value_replacement): Likewise. (replace_phi_edge_with_variable): Deal with extra edges into the middle BB. * gcc.dg/tree-ssa/phi-opt-26.c: New testcase. --- gcc/testsuite/gcc.dg/tree-ssa/phi-opt-26.c | 31 +++++++++++++ gcc/tree-ssa-phiopt.c | 71 +++++++++++++++++------------- 2 files changed, 72 insertions(+), 30 deletions(-) diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-26.c b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-26.c new file mode 100644 index 00000000000..21aa66e38b8 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-26.c @@ -0,0 +1,31 @@ +/* { dg-do compile } */ +/* { dg-options "-O -fgimple -fdump-tree-phiopt1" } */ + +int __GIMPLE (ssa,startwith("phiopt")) +foo (int a, int b, int flag) +{ + int res; + + __BB(2): + if (flag_2(D) != 0) + goto __BB6; + else + goto __BB4; + + __BB(4): + if (a_3(D) > b_4(D)) + goto __BB7; + else + goto __BB6; + + __BB(6): + goto __BB7; + + __BB(7): + res_1 = __PHI (__BB4: a_3(D), __BB6: b_4(D)); + return res_1; +} + +/* We should be able to detect MAX despite the extra edge into + the middle BB. */ +/* { dg-final { scan-tree-dump "MAX" "phiopt1" } } */ diff --git a/gcc/tree-ssa-phiopt.c b/gcc/tree-ssa-phiopt.c index 173ac835ca6..6b22f6bedd4 100644 --- a/gcc/tree-ssa-phiopt.c +++ b/gcc/tree-ssa-phiopt.c @@ -220,7 +220,6 @@ tree_ssa_phiopt_worker (bool do_store_elim, bool do_hoist_loads, bool early_p) /* If either bb1's succ or bb2 or bb2's succ is non NULL. */ if (EDGE_COUNT (bb1->succs) == 0 - || bb2 == NULL || EDGE_COUNT (bb2->succs) == 0) continue; @@ -276,14 +275,14 @@ tree_ssa_phiopt_worker (bool do_store_elim, bool do_hoist_loads, bool early_p) || (e1->flags & EDGE_FALLTHRU) == 0) continue; - /* Also make sure that bb1 only have one predecessor and that it - is bb. */ - if (!single_pred_p (bb1) - || single_pred (bb1) != bb) - continue; - if (do_store_elim) { + /* Also make sure that bb1 only have one predecessor and that it + is bb. */ + if (!single_pred_p (bb1) + || single_pred (bb1) != bb) + continue; + /* bb1 is the middle block, bb2 the join block, bb the split block, e1 the fallthrough edge from bb1 to bb2. We can't do the optimization if the join block has more than two predecessors. */ @@ -328,10 +327,11 @@ tree_ssa_phiopt_worker (bool do_store_elim, bool do_hoist_loads, bool early_p) node. */ gcc_assert (arg0 != NULL_TREE && arg1 != NULL_TREE); - gphi *newphi = factor_out_conditional_conversion (e1, e2, phi, - arg0, arg1, - cond_stmt); - if (newphi != NULL) + gphi *newphi; + if (single_pred_p (bb1) + && (newphi = factor_out_conditional_conversion (e1, e2, phi, + arg0, arg1, + cond_stmt))) { phi = newphi; /* factor_out_conditional_conversion may create a new PHI in @@ -350,12 +350,14 @@ tree_ssa_phiopt_worker (bool do_store_elim, bool do_hoist_loads, bool early_p) early_p)) cfgchanged = true; else if (!early_p + && single_pred_p (bb1) && cond_removal_in_builtin_zero_pattern (bb, bb1, e1, e2, phi, arg0, arg1)) cfgchanged = true; else if (minmax_replacement (bb, bb1, e1, e2, phi, arg0, arg1)) cfgchanged = true; - else if (spaceship_replacement (bb, bb1, e1, e2, phi, arg0, arg1)) + else if (single_pred_p (bb1) + && spaceship_replacement (bb, bb1, e1, e2, phi, arg0, arg1)) cfgchanged = true; } } @@ -386,7 +388,6 @@ replace_phi_edge_with_variable (basic_block cond_block, edge e, gphi *phi, tree new_tree) { basic_block bb = gimple_bb (phi); - basic_block block_to_remove; gimple_stmt_iterator gsi; tree phi_result = PHI_RESULT (phi); @@ -422,28 +423,33 @@ replace_phi_edge_with_variable (basic_block cond_block, SET_USE (PHI_ARG_DEF_PTR (phi, e->dest_idx), new_tree); /* Remove the empty basic block. */ + edge edge_to_remove; if (EDGE_SUCC (cond_block, 0)->dest == bb) + edge_to_remove = EDGE_SUCC (cond_block, 1); + else + edge_to_remove = EDGE_SUCC (cond_block, 0); + if (EDGE_COUNT (edge_to_remove->dest->preds) == 1) { - EDGE_SUCC (cond_block, 0)->flags |= EDGE_FALLTHRU; - EDGE_SUCC (cond_block, 0)->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE); - EDGE_SUCC (cond_block, 0)->probability = profile_probability::always (); + e->flags |= EDGE_FALLTHRU; + e->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE); + e->probability = profile_probability::always (); + delete_basic_block (edge_to_remove->dest); - block_to_remove = EDGE_SUCC (cond_block, 1)->dest; + /* Eliminate the COND_EXPR at the end of COND_BLOCK. */ + gsi = gsi_last_bb (cond_block); + gsi_remove (&gsi, true); } else { - EDGE_SUCC (cond_block, 1)->flags |= EDGE_FALLTHRU; - EDGE_SUCC (cond_block, 1)->flags - &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE); - EDGE_SUCC (cond_block, 1)->probability = profile_probability::always (); - - block_to_remove = EDGE_SUCC (cond_block, 0)->dest; + /* If there are other edges into the middle block make + CFG cleanup deal with the edge removal to avoid + updating dominators here in a non-trivial way. */ + gcond *cond = as_a <gcond *> (last_stmt (cond_block)); + if (edge_to_remove->flags & EDGE_TRUE_VALUE) + gimple_cond_make_false (cond); + else + gimple_cond_make_true (cond); } - delete_basic_block (block_to_remove); - - /* Eliminate the COND_EXPR at the end of COND_BLOCK. */ - gsi = gsi_last_bb (cond_block); - gsi_remove (&gsi, true); statistics_counter_event (cfun, "Replace PHI with variable", 1); @@ -959,6 +965,9 @@ match_simplify_replacement (basic_block cond_bb, basic_block middle_bb, allow it and move it once the transformation is done. */ if (!empty_block_p (middle_bb)) { + if (!single_pred_p (middle_bb)) + return false; + stmt_to_move = last_and_only_stmt (middle_bb); if (!stmt_to_move) return false; @@ -1351,7 +1360,10 @@ value_replacement (basic_block cond_bb, basic_block middle_bb, } else { - statistics_counter_event (cfun, "Replace PHI with variable/value_replacement", 1); + if (!single_pred_p (middle_bb)) + return 0; + statistics_counter_event (cfun, "Replace PHI with " + "variable/value_replacement", 1); /* Replace the PHI arguments with arg. */ SET_PHI_ARG_DEF (phi, e0->dest_idx, arg); @@ -1367,7 +1379,6 @@ value_replacement (basic_block cond_bb, basic_block middle_bb, } return 1; } - } /* Now optimize (x != 0) ? x + y : y to just x + y. */ </cut>

4 years, 5 months

2
1
0 0

[TCWG CI] 464.h264ref slowed down by 6% after llvm: [unroll] Keep unrolled iterations with initial iteration

by ci_notify＠linaro.org

After llvm commit de2fed61528a5584dc54c47f6754408597be24de Author: Philip Reames <listmail(a)philipreames.com> [unroll] Keep unrolled iterations with initial iteration the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 6% from 10902 to 11518 perf samples - 464.h264ref:[.] FastFullPelBlockMotionSearch slowed down by 43% from 1494 to 2141 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-de2fed61528a5584dc54c47f6754408597be24de cd investigate-llvm-de2fed61528a5584dc54c47f6754408597be24de # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach de2fed61528a5584dc54c47f6754408597be24de ../artifacts/test.sh # Reproduce last_good build git checkout --detach da25f968a90ad4560fc920a6d18fc2a0221d2750 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit de2fed61528a5584dc54c47f6754408597be24de Author: Philip Reames <listmail(a)philipreames.com> Date: Fri Nov 12 11:35:28 2021 -0800 [unroll] Keep unrolled iterations with initial iteration The unrolling code was previously inserting new cloned blocks at the end of the function. The result of this with typical loop structures is that the new iterations are placed far from the initial iteration. With unrolling, the general assumption is that the a) the loop is reasonable hot, and b) the first Count-1 copies of the loop are rarely (if ever) loop exiting. As such, placing Count-1 copies out of line is a fairly poor code placement choice. We'd much rather fall through into the hot (non-exiting) path. For code with branch profiles, later layout would fix this, but this may have a positive impact on non-PGO compiled code. However, the real motivation for this change isn't performance. Its readability and human understanding. Having to jump around long distances in an IR file to trace an unrolled loop structure is error prone and tedious. --- llvm/lib/Transforms/Utils/LoopUnroll.cpp | 6 +- llvm/test/DebugInfo/unrolled-loop-remainder.ll | 86 +- .../Transforms/LoopUnroll/2011-08-08-PhiUpdate.ll | 66 +- .../Transforms/LoopUnroll/2011-08-09-PhiUpdate.ll | 24 +- .../LoopUnroll/AArch64/runtime-unroll-generic.ll | 4 +- .../LoopUnroll/AArch64/thresholdO3-cost-model.ll | 8 +- .../LoopUnroll/AArch64/unroll-upperbound.ll | 4 +- .../Transforms/LoopUnroll/ARM/loop-unrolling.ll | 4 +- .../test/Transforms/LoopUnroll/ARM/multi-blocks.ll | 230 +- llvm/test/Transforms/LoopUnroll/ARM/upperbound.ll | 10 +- .../LoopUnroll/full-unroll-keep-first-exit.ll | 16 +- .../full-unroll-one-unpredictable-exit.ll | 16 +- llvm/test/Transforms/LoopUnroll/multiple-exits.ll | 8 +- llvm/test/Transforms/LoopUnroll/nonlatchcondbr.ll | 20 +- .../LoopUnroll/partial-unroll-non-latch-exit.ll | 14 +- .../partially-unroll-unconditional-latch.ll | 4 +- .../LoopUnroll/runtime-loop-at-most-two-exits.ll | 120 +- .../runtime-loop-multiexit-dom-verify.ll | 206 +- .../LoopUnroll/runtime-loop-multiple-exits.ll | 2560 ++++++++++---------- llvm/test/Transforms/LoopUnroll/runtime-loop5.ll | 34 +- .../LoopUnroll/runtime-multiexit-heuristic.ll | 122 +- .../LoopUnroll/runtime-small-upperbound.ll | 8 +- .../LoopUnroll/runtime-unroll-remainder.ll | 62 +- llvm/test/Transforms/LoopUnroll/scevunroll.ll | 48 +- .../Transforms/LoopUnroll/shifted-tripcount.ll | 4 +- ...er-exiting-with-phis-multiple-exiting-blocks.ll | 20 +- .../LoopUnroll/unroll-unconditional-latch.ll | 12 +- .../Transforms/LoopUnrollAndJam/unroll-and-jam.ll | 68 +- .../PhaseOrdering/AArch64/matrix-extract-insert.ll | 4 +- 29 files changed, 1896 insertions(+), 1892 deletions(-) diff --git a/llvm/lib/Transforms/Utils/LoopUnroll.cpp b/llvm/lib/Transforms/Utils/LoopUnroll.cpp index ce463927fd50..b0c622b98d5e 100644 --- a/llvm/lib/Transforms/Utils/LoopUnroll.cpp +++ b/llvm/lib/Transforms/Utils/LoopUnroll.cpp @@ -514,6 +514,10 @@ LoopUnrollResult llvm::UnrollLoop(Loop *L, UnrollLoopOptions ULO, LoopInfo *LI, SmallVector<MDNode *, 6> LoopLocalNoAliasDeclScopes; identifyNoAliasScopesToClone(L->getBlocks(), LoopLocalNoAliasDeclScopes); + // We place the unrolled iterations immediately after the original loop + // latch. This is a reasonable default placement if we don't have block + // frequencies, and if we do, well the layout will be adjusted later. + auto BlockInsertPt = std::next(LatchBlock->getIterator()); for (unsigned It = 1; It != ULO.Count; ++It) { SmallVector<BasicBlock *, 8> NewBlocks; SmallDenseMap<const Loop *, Loop *, 4> NewLoops; @@ -522,7 +526,7 @@ LoopUnrollResult llvm::UnrollLoop(Loop *L, UnrollLoopOptions ULO, LoopInfo *LI, for (LoopBlocksDFS::RPOIterator BB = BlockBegin; BB != BlockEnd; ++BB) { ValueToValueMapTy VMap; BasicBlock *New = CloneBasicBlock(*BB, VMap, "." + Twine(It)); - Header->getParent()->getBasicBlockList().push_back(New); + Header->getParent()->getBasicBlockList().insert(BlockInsertPt, New); assert((*BB != Header || LI->getLoopFor(*BB) == L) && "Header should not be in a sub-loop"); diff --git a/llvm/test/DebugInfo/unrolled-loop-remainder.ll b/llvm/test/DebugInfo/unrolled-loop-remainder.ll index 83c30dec780d..ba4ce1f409f6 100644 --- a/llvm/test/DebugInfo/unrolled-loop-remainder.ll +++ b/llvm/test/DebugInfo/unrolled-loop-remainder.ll @@ -38,71 +38,71 @@ define i32 @func_c() local_unnamed_addr #0 !dbg !14 { ; CHECK-NEXT: [[PROL_ITER_SUB:%.*]] = sub i32 [[XTRAITER]], 1, !dbg [[DBG24]] ; CHECK-NEXT: [[PROL_ITER_CMP:%.*]] = icmp ne i32 [[PROL_ITER_SUB]], 0, !dbg [[DBG24]] ; CHECK-NEXT: br i1 [[PROL_ITER_CMP]], label [[FOR_BODY_PROL_1:%.*]], label [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA:%.*]], !dbg [[DBG24]] +; CHECK: for.body.prol.1: +; CHECK-NEXT: [[ARRAYIDX_PROL_1:%.*]] = getelementptr inbounds i32, i32* [[TMP6]], i64 1, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP7:%.*]] = load i32, i32* [[ARRAYIDX_PROL_1]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] +; CHECK-NEXT: [[CONV_PROL_1:%.*]] = sext i32 [[TMP7]] to i64, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP8:%.*]] = inttoptr i64 [[CONV_PROL_1]] to i32*, !dbg [[DBG28]] +; CHECK-NEXT: [[ADD_PROL_1:%.*]] = add nsw i32 [[ADD_PROL]], 2, !dbg [[DBG29]] +; CHECK-NEXT: [[PROL_ITER_SUB_1:%.*]] = sub i32 [[PROL_ITER_SUB]], 1, !dbg [[DBG24]] +; CHECK-NEXT: [[PROL_ITER_CMP_1:%.*]] = icmp ne i32 [[PROL_ITER_SUB_1]], 0, !dbg [[DBG24]] +; CHECK-NEXT: br i1 [[PROL_ITER_CMP_1]], label [[FOR_BODY_PROL_2:%.*]], label [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]], !dbg [[DBG24]] +; CHECK: for.body.prol.2: +; CHECK-NEXT: [[ARRAYIDX_PROL_2:%.*]] = getelementptr inbounds i32, i32* [[TMP8]], i64 1, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP9:%.*]] = load i32, i32* [[ARRAYIDX_PROL_2]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] +; CHECK-NEXT: [[CONV_PROL_2:%.*]] = sext i32 [[TMP9]] to i64, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[CONV_PROL_2]] to i32*, !dbg [[DBG28]] +; CHECK-NEXT: [[ADD_PROL_2:%.*]] = add nsw i32 [[ADD_PROL_1]], 2, !dbg [[DBG29]] +; CHECK-NEXT: br label [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]] ; CHECK: for.body.prol.loopexit.unr-lcssa: -; CHECK-NEXT: [[DOTLCSSA_UNR_PH:%.*]] = phi i32* [ [[TMP6]], [[FOR_BODY_PROL]] ], [ [[TMP20:%.*]], [[FOR_BODY_PROL_1]] ], [ [[TMP22:%.*]], [[FOR_BODY_PROL_2:%.*]] ] -; CHECK-NEXT: [[DOTUNR_PH:%.*]] = phi i32* [ [[TMP6]], [[FOR_BODY_PROL]] ], [ [[TMP20]], [[FOR_BODY_PROL_1]] ], [ [[TMP22]], [[FOR_BODY_PROL_2]] ] -; CHECK-NEXT: [[DOTUNR1_PH:%.*]] = phi i32 [ [[ADD_PROL]], [[FOR_BODY_PROL]] ], [ [[ADD_PROL_1:%.*]], [[FOR_BODY_PROL_1]] ], [ [[ADD_PROL_2:%.*]], [[FOR_BODY_PROL_2]] ] +; CHECK-NEXT: [[DOTLCSSA_UNR_PH:%.*]] = phi i32* [ [[TMP6]], [[FOR_BODY_PROL]] ], [ [[TMP8]], [[FOR_BODY_PROL_1]] ], [ [[TMP10]], [[FOR_BODY_PROL_2]] ] +; CHECK-NEXT: [[DOTUNR_PH:%.*]] = phi i32* [ [[TMP6]], [[FOR_BODY_PROL]] ], [ [[TMP8]], [[FOR_BODY_PROL_1]] ], [ [[TMP10]], [[FOR_BODY_PROL_2]] ] +; CHECK-NEXT: [[DOTUNR1_PH:%.*]] = phi i32 [ [[ADD_PROL]], [[FOR_BODY_PROL]] ], [ [[ADD_PROL_1]], [[FOR_BODY_PROL_1]] ], [ [[ADD_PROL_2]], [[FOR_BODY_PROL_2]] ] ; CHECK-NEXT: br label [[FOR_BODY_PROL_LOOPEXIT]], !dbg [[DBG24]] ; CHECK: for.body.prol.loopexit: ; CHECK-NEXT: [[DOTLCSSA_UNR:%.*]] = phi i32* [ undef, [[FOR_BODY_LR_PH]] ], [ [[DOTLCSSA_UNR_PH]], [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]] ] ; CHECK-NEXT: [[DOTUNR:%.*]] = phi i32* [ [[A_PROMOTED]], [[FOR_BODY_LR_PH]] ], [ [[DOTUNR_PH]], [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]] ] ; CHECK-NEXT: [[DOTUNR1:%.*]] = phi i32 [ [[DOTPR]], [[FOR_BODY_LR_PH]] ], [ [[DOTUNR1_PH]], [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]] ] -; CHECK-NEXT: [[TMP7:%.*]] = icmp ult i32 [[TMP3]], 3, !dbg [[DBG24]] -; CHECK-NEXT: br i1 [[TMP7]], label [[FOR_COND_FOR_END_CRIT_EDGE:%.*]], label [[FOR_BODY_LR_PH_NEW:%.*]], !dbg [[DBG24]] +; CHECK-NEXT: [[TMP11:%.*]] = icmp ult i32 [[TMP3]], 3, !dbg [[DBG24]] +; CHECK-NEXT: br i1 [[TMP11]], label [[FOR_COND_FOR_END_CRIT_EDGE:%.*]], label [[FOR_BODY_LR_PH_NEW:%.*]], !dbg [[DBG24]] ; CHECK: for.body.lr.ph.new: ; CHECK-NEXT: br label [[FOR_BODY:%.*]], !dbg [[DBG24]] ; CHECK: for.body: -; CHECK-NEXT: [[TMP8:%.*]] = phi i32* [ [[DOTUNR]], [[FOR_BODY_LR_PH_NEW]] ], [ [[TMP17:%.*]], [[FOR_BODY]] ], !dbg [[DBG28]] -; CHECK-NEXT: [[TMP9:%.*]] = phi i32 [ [[DOTUNR1]], [[FOR_BODY_LR_PH_NEW]] ], [ [[ADD_3:%.*]], [[FOR_BODY]] ] -; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[TMP8]], i64 1, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP10:%.*]] = load i32, i32* [[ARRAYIDX]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] -; CHECK-NEXT: [[CONV:%.*]] = sext i32 [[TMP10]] to i64, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP11:%.*]] = inttoptr i64 [[CONV]] to i32*, !dbg [[DBG28]] -; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP9]], 2, !dbg [[DBG29]] -; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, i32* [[TMP11]], i64 1, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP12:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] -; CHECK-NEXT: [[CONV_1:%.*]] = sext i32 [[TMP12]] to i64, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP13:%.*]] = inttoptr i64 [[CONV_1]] to i32*, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP12:%.*]] = phi i32* [ [[DOTUNR]], [[FOR_BODY_LR_PH_NEW]] ], [ [[TMP21:%.*]], [[FOR_BODY]] ], !dbg [[DBG28]] +; CHECK-NEXT: [[TMP13:%.*]] = phi i32 [ [[DOTUNR1]], [[FOR_BODY_LR_PH_NEW]] ], [ [[ADD_3:%.*]], [[FOR_BODY]] ] +; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[TMP12]], i64 1, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP14:%.*]] = load i32, i32* [[ARRAYIDX]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] +; CHECK-NEXT: [[CONV:%.*]] = sext i32 [[TMP14]] to i64, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP15:%.*]] = inttoptr i64 [[CONV]] to i32*, !dbg [[DBG28]] +; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP13]], 2, !dbg [[DBG29]] +; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, i32* [[TMP15]], i64 1, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP16:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] +; CHECK-NEXT: [[CONV_1:%.*]] = sext i32 [[TMP16]] to i64, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP17:%.*]] = inttoptr i64 [[CONV_1]] to i32*, !dbg [[DBG28]] ; CHECK-NEXT: [[ADD_1:%.*]] = add nsw i32 [[ADD]], 2, !dbg [[DBG29]] -; CHECK-NEXT: [[ARRAYIDX_2:%.*]] = getelementptr inbounds i32, i32* [[TMP13]], i64 1, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP14:%.*]] = load i32, i32* [[ARRAYIDX_2]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] -; CHECK-NEXT: [[CONV_2:%.*]] = sext i32 [[TMP14]] to i64, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP15:%.*]] = inttoptr i64 [[CONV_2]] to i32*, !dbg [[DBG28]] +; CHECK-NEXT: [[ARRAYIDX_2:%.*]] = getelementptr inbounds i32, i32* [[TMP17]], i64 1, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP18:%.*]] = load i32, i32* [[ARRAYIDX_2]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] +; CHECK-NEXT: [[CONV_2:%.*]] = sext i32 [[TMP18]] to i64, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP19:%.*]] = inttoptr i64 [[CONV_2]] to i32*, !dbg [[DBG28]] ; CHECK-NEXT: [[ADD_2:%.*]] = add nsw i32 [[ADD_1]], 2, !dbg [[DBG29]] -; CHECK-NEXT: [[ARRAYIDX_3:%.*]] = getelementptr inbounds i32, i32* [[TMP15]], i64 1, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP16:%.*]] = load i32, i32* [[ARRAYIDX_3]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] -; CHECK-NEXT: [[CONV_3:%.*]] = sext i32 [[TMP16]] to i64, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP17]] = inttoptr i64 [[CONV_3]] to i32*, !dbg [[DBG28]] +; CHECK-NEXT: [[ARRAYIDX_3:%.*]] = getelementptr inbounds i32, i32* [[TMP19]], i64 1, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP20:%.*]] = load i32, i32* [[ARRAYIDX_3]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] +; CHECK-NEXT: [[CONV_3:%.*]] = sext i32 [[TMP20]] to i64, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP21]] = inttoptr i64 [[CONV_3]] to i32*, !dbg [[DBG28]] ; CHECK-NEXT: [[ADD_3]] = add nsw i32 [[ADD_2]], 2, !dbg [[DBG29]] ; CHECK-NEXT: [[TOBOOL_3:%.*]] = icmp eq i32 [[ADD_3]], 0, !dbg [[DBG24]] ; CHECK-NEXT: br i1 [[TOBOOL_3]], label [[FOR_COND_FOR_END_CRIT_EDGE_UNR_LCSSA:%.*]], label [[FOR_BODY]], !dbg [[DBG24]], !llvm.loop [[LOOP30:![0-9]+]] ; CHECK: for.cond.for.end_crit_edge.unr-lcssa: -; CHECK-NEXT: [[DOTLCSSA_PH:%.*]] = phi i32* [ [[TMP17]], [[FOR_BODY]] ] +; CHECK-NEXT: [[DOTLCSSA_PH:%.*]] = phi i32* [ [[TMP21]], [[FOR_BODY]] ] ; CHECK-NEXT: br label [[FOR_COND_FOR_END_CRIT_EDGE]], !dbg [[DBG24]] ; CHECK: for.cond.for.end_crit_edge: ; CHECK-NEXT: [[DOTLCSSA:%.*]] = phi i32* [ [[DOTLCSSA_UNR]], [[FOR_BODY_PROL_LOOPEXIT]] ], [ [[DOTLCSSA_PH]], [[FOR_COND_FOR_END_CRIT_EDGE_UNR_LCSSA]] ], !dbg [[DBG28]] -; CHECK-NEXT: [[TMP18:%.*]] = add i32 [[TMP2]], 2, !dbg [[DBG24]] +; CHECK-NEXT: [[TMP22:%.*]] = add i32 [[TMP2]], 2, !dbg [[DBG24]] ; CHECK-NEXT: store i32* [[DOTLCSSA]], i32** @a, align 8, !dbg [[DBG25]], !tbaa [[TBAA26]] -; CHECK-NEXT: store i32 [[TMP18]], i32* @b, align 4, !dbg [[DBG33:![0-9]+]], !tbaa [[TBAA20]] +; CHECK-NEXT: store i32 [[TMP22]], i32* @b, align 4, !dbg [[DBG33:![0-9]+]], !tbaa [[TBAA20]] ; CHECK-NEXT: br label [[FOR_END]], !dbg [[DBG24]] ; CHECK: for.end: ; CHECK-NEXT: ret i32 undef, !dbg [[DBG34:![0-9]+]] -; CHECK: for.body.prol.1: -; CHECK-NEXT: [[ARRAYIDX_PROL_1:%.*]] = getelementptr inbounds i32, i32* [[TMP6]], i64 1, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP19:%.*]] = load i32, i32* [[ARRAYIDX_PROL_1]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] -; CHECK-NEXT: [[CONV_PROL_1:%.*]] = sext i32 [[TMP19]] to i64, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP20]] = inttoptr i64 [[CONV_PROL_1]] to i32*, !dbg [[DBG28]] -; CHECK-NEXT: [[ADD_PROL_1]] = add nsw i32 [[ADD_PROL]], 2, !dbg [[DBG29]] -; CHECK-NEXT: [[PROL_ITER_SUB_1:%.*]] = sub i32 [[PROL_ITER_SUB]], 1, !dbg [[DBG24]] -; CHECK-NEXT: [[PROL_ITER_CMP_1:%.*]] = icmp ne i32 [[PROL_ITER_SUB_1]], 0, !dbg [[DBG24]] -; CHECK-NEXT: br i1 [[PROL_ITER_CMP_1]], label [[FOR_BODY_PROL_2]], label [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]], !dbg [[DBG24]] -; CHECK: for.body.prol.2: -; CHECK-NEXT: [[ARRAYIDX_PROL_2:%.*]] = getelementptr inbounds i32, i32* [[TMP20]], i64 1, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP21:%.*]] = load i32, i32* [[ARRAYIDX_PROL_2]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] -; CHECK-NEXT: [[CONV_PROL_2:%.*]] = sext i32 [[TMP21]] to i64, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP22]] = inttoptr i64 [[CONV_PROL_2]] to i32*, !dbg [[DBG28]] -; CHECK-NEXT: [[ADD_PROL_2]] = add nsw i32 [[ADD_PROL_1]], 2, !dbg [[DBG29]] -; CHECK-NEXT: br label [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]] ; entry: %.pr = load i32, i32* @b, align 4, !dbg !17, !tbaa !20 diff --git a/llvm/test/Transforms/LoopUnroll/2011-08-08-PhiUpdate.ll b/llvm/test/Transforms/LoopUnroll/2011-08-08-PhiUpdate.ll index 3e611430d69e..7bb2d732195a 100644 --- a/llvm/test/Transforms/LoopUnroll/2011-08-08-PhiUpdate.ll +++ b/llvm/test/Transforms/LoopUnroll/2011-08-08-PhiUpdate.ll @@ -17,24 +17,24 @@ define void @test1(i32 %i, i32 %j) nounwind uwtable ssp { ; CHECK-NEXT: [[SUB5:%.*]] = sub i32 [[SUB]], [[J:%.*]] ; CHECK-NEXT: [[COND2:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND2]], label [[IF_THEN_LOOPEXIT:%.*]], label [[IF_ELSE_1:%.*]] -; CHECK: if.then.loopexit: -; CHECK-NEXT: [[SUB5_LCSSA:%.*]] = phi i32 [ [[SUB5]], [[IF_ELSE]] ], [ [[SUB5_1:%.*]], [[IF_ELSE_1]] ], [ [[SUB5_2:%.*]], [[IF_ELSE_2:%.*]] ], [ [[SUB5_3]], [[IF_ELSE_3]] ] -; CHECK-NEXT: br label [[IF_THEN]] -; CHECK: if.then: -; CHECK-NEXT: [[I_TR:%.*]] = phi i32 [ [[I]], [[ENTRY:%.*]] ], [ [[SUB5_LCSSA]], [[IF_THEN_LOOPEXIT]] ] -; CHECK-NEXT: ret void ; CHECK: if.else.1: -; CHECK-NEXT: [[SUB5_1]] = sub i32 [[SUB5]], [[J]] +; CHECK-NEXT: [[SUB5_1:%.*]] = sub i32 [[SUB5]], [[J]] ; CHECK-NEXT: [[COND2_1:%.*]] = call zeroext i1 @check() -; CHECK-NEXT: br i1 [[COND2_1]], label [[IF_THEN_LOOPEXIT]], label [[IF_ELSE_2]] +; CHECK-NEXT: br i1 [[COND2_1]], label [[IF_THEN_LOOPEXIT]], label [[IF_ELSE_2:%.*]] ; CHECK: if.else.2: -; CHECK-NEXT: [[SUB5_2]] = sub i32 [[SUB5_1]], [[J]] +; CHECK-NEXT: [[SUB5_2:%.*]] = sub i32 [[SUB5_1]], [[J]] ; CHECK-NEXT: [[COND2_2:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND2_2]], label [[IF_THEN_LOOPEXIT]], label [[IF_ELSE_3]] ; CHECK: if.else.3: ; CHECK-NEXT: [[SUB5_3]] = sub i32 [[SUB5_2]], [[J]] ; CHECK-NEXT: [[COND2_3:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND2_3]], label [[IF_THEN_LOOPEXIT]], label [[IF_ELSE]], !llvm.loop [[LOOP0:![0-9]+]] +; CHECK: if.then.loopexit: +; CHECK-NEXT: [[SUB5_LCSSA:%.*]] = phi i32 [ [[SUB5]], [[IF_ELSE]] ], [ [[SUB5_1]], [[IF_ELSE_1]] ], [ [[SUB5_2]], [[IF_ELSE_2]] ], [ [[SUB5_3]], [[IF_ELSE_3]] ] +; CHECK-NEXT: br label [[IF_THEN]] +; CHECK: if.then: +; CHECK-NEXT: [[I_TR:%.*]] = phi i32 [ [[I]], [[ENTRY:%.*]] ], [ [[SUB5_LCSSA]], [[IF_THEN_LOOPEXIT]] ] +; CHECK-NEXT: ret void ; entry: %cond1 = call zeroext i1 @check() @@ -77,17 +77,11 @@ define i32 @test2(i32* nocapture %p, i32 %n) nounwind readonly { ; CHECK-NEXT: [[INDVAR_NEXT:%.*]] = add nuw nsw i64 [[INDVAR]], 1 ; CHECK-NEXT: [[EXITCOND:%.*]] = icmp ne i64 [[INDVAR_NEXT]], [[TMP]] ; CHECK-NEXT: br i1 [[EXITCOND]], label [[BB_1:%.*]], label [[BB1_BB2_CRIT_EDGE:%.*]] -; CHECK: bb1.bb2_crit_edge: -; CHECK-NEXT: [[DOTLCSSA:%.*]] = phi i32 [ [[TMP2]], [[BB1]] ], [ [[TMP4:%.*]], [[BB1_1:%.*]] ], [ [[TMP6:%.*]], [[BB1_2:%.*]] ], [ [[TMP8]], [[BB1_3]] ] -; CHECK-NEXT: br label [[BB2]] -; CHECK: bb2: -; CHECK-NEXT: [[S_0_LCSSA:%.*]] = phi i32 [ [[DOTLCSSA]], [[BB1_BB2_CRIT_EDGE]] ], [ 0, [[ENTRY:%.*]] ] -; CHECK-NEXT: ret i32 [[S_0_LCSSA]] ; CHECK: bb.1: ; CHECK-NEXT: [[SCEVGEP_1:%.*]] = getelementptr i32, i32* [[P]], i64 [[INDVAR_NEXT]] ; CHECK-NEXT: [[TMP3:%.*]] = load i32, i32* [[SCEVGEP_1]], align 1 -; CHECK-NEXT: [[TMP4]] = add nsw i32 [[TMP3]], [[TMP2]] -; CHECK-NEXT: br label [[BB1_1]] +; CHECK-NEXT: [[TMP4:%.*]] = add nsw i32 [[TMP3]], [[TMP2]] +; CHECK-NEXT: br label [[BB1_1:%.*]] ; CHECK: bb1.1: ; CHECK-NEXT: [[INDVAR_NEXT_1:%.*]] = add nuw nsw i64 [[INDVAR_NEXT]], 1 ; CHECK-NEXT: [[EXITCOND_1:%.*]] = icmp ne i64 [[INDVAR_NEXT_1]], [[TMP]] @@ -95,8 +89,8 @@ define i32 @test2(i32* nocapture %p, i32 %n) nounwind readonly { ; CHECK: bb.2: ; CHECK-NEXT: [[SCEVGEP_2:%.*]] = getelementptr i32, i32* [[P]], i64 [[INDVAR_NEXT_1]] ; CHECK-NEXT: [[TMP5:%.*]] = load i32, i32* [[SCEVGEP_2]], align 1 -; CHECK-NEXT: [[TMP6]] = add nsw i32 [[TMP5]], [[TMP4]] -; CHECK-NEXT: br label [[BB1_2]] +; CHECK-NEXT: [[TMP6:%.*]] = add nsw i32 [[TMP5]], [[TMP4]] +; CHECK-NEXT: br label [[BB1_2:%.*]] ; CHECK: bb1.2: ; CHECK-NEXT: [[INDVAR_NEXT_2:%.*]] = add nuw nsw i64 [[INDVAR_NEXT_1]], 1 ; CHECK-NEXT: [[EXITCOND_2:%.*]] = icmp ne i64 [[INDVAR_NEXT_2]], [[TMP]] @@ -110,6 +104,12 @@ define i32 @test2(i32* nocapture %p, i32 %n) nounwind readonly { ; CHECK-NEXT: [[INDVAR_NEXT_3]] = add i64 [[INDVAR_NEXT_2]], 1 ; CHECK-NEXT: [[EXITCOND_3:%.*]] = icmp ne i64 [[INDVAR_NEXT_3]], [[TMP]] ; CHECK-NEXT: br i1 [[EXITCOND_3]], label [[BB]], label [[BB1_BB2_CRIT_EDGE]], !llvm.loop [[LOOP2:![0-9]+]] +; CHECK: bb1.bb2_crit_edge: +; CHECK-NEXT: [[DOTLCSSA:%.*]] = phi i32 [ [[TMP2]], [[BB1]] ], [ [[TMP4]], [[BB1_1]] ], [ [[TMP6]], [[BB1_2]] ], [ [[TMP8]], [[BB1_3]] ] +; CHECK-NEXT: br label [[BB2]] +; CHECK: bb2: +; CHECK-NEXT: [[S_0_LCSSA:%.*]] = phi i32 [ [[DOTLCSSA]], [[BB1_BB2_CRIT_EDGE]] ], [ 0, [[ENTRY:%.*]] ] +; CHECK-NEXT: ret i32 [[S_0_LCSSA]] ; entry: %0 = icmp sgt i32 %n, 0 ; <i1> [#uses=1] @@ -162,20 +162,12 @@ define i32 @test3() nounwind uwtable ssp align 2 { ; CHECK: do.cond: ; CHECK-NEXT: [[COND3:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND3]], label [[DO_END:%.*]], label [[DO_BODY_1:%.*]] -; CHECK: do.end: -; CHECK-NEXT: br label [[RETURN]] -; CHECK: return.loopexit: -; CHECK-NEXT: [[TMP7_I_LCSSA:%.*]] = phi i32 [ [[TMP7_I]], [[LAND_LHS_TRUE]] ], [ [[TMP7_I_1:%.*]], [[LAND_LHS_TRUE_1:%.*]] ], [ [[TMP7_I_2:%.*]], [[LAND_LHS_TRUE_2:%.*]] ], [ [[TMP7_I_3:%.*]], [[LAND_LHS_TRUE_3:%.*]] ] -; CHECK-NEXT: br label [[RETURN]] -; CHECK: return: -; CHECK-NEXT: [[RETVAL_0:%.*]] = phi i32 [ 0, [[DO_END]] ], [ 0, [[ENTRY:%.*]] ], [ [[TMP7_I_LCSSA]], [[RETURN_LOOPEXIT]] ] -; CHECK-NEXT: ret i32 [[RETVAL_0]] ; CHECK: do.body.1: ; CHECK-NEXT: [[COND2_1:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND2_1]], label [[EXIT_1:%.*]], label [[DO_COND_1:%.*]] ; CHECK: exit.1: -; CHECK-NEXT: [[TMP7_I_1]] = load i32, i32* undef, align 8 -; CHECK-NEXT: br i1 undef, label [[DO_COND_1]], label [[LAND_LHS_TRUE_1]] +; CHECK-NEXT: [[TMP7_I_1:%.*]] = load i32, i32* undef, align 8 +; CHECK-NEXT: br i1 undef, label [[DO_COND_1]], label [[LAND_LHS_TRUE_1:%.*]] ; CHECK: land.lhs.true.1: ; CHECK-NEXT: br i1 true, label [[RETURN_LOOPEXIT]], label [[DO_COND_1]] ; CHECK: do.cond.1: @@ -185,8 +177,8 @@ define i32 @test3() nounwind uwtable ssp align 2 { ; CHECK-NEXT: [[COND2_2:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND2_2]], label [[EXIT_2:%.*]], label [[DO_COND_2:%.*]] ; CHECK: exit.2: -; CHECK-NEXT: [[TMP7_I_2]] = load i32, i32* undef, align 8 -; CHECK-NEXT: br i1 undef, label [[DO_COND_2]], label [[LAND_LHS_TRUE_2]] +; CHECK-NEXT: [[TMP7_I_2:%.*]] = load i32, i32* undef, align 8 +; CHECK-NEXT: br i1 undef, label [[DO_COND_2]], label [[LAND_LHS_TRUE_2:%.*]] ; CHECK: land.lhs.true.2: ; CHECK-NEXT: br i1 true, label [[RETURN_LOOPEXIT]], label [[DO_COND_2]] ; CHECK: do.cond.2: @@ -196,13 +188,21 @@ define i32 @test3() nounwind uwtable ssp align 2 { ; CHECK-NEXT: [[COND2_3:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND2_3]], label [[EXIT_3:%.*]], label [[DO_COND_3:%.*]] ; CHECK: exit.3: -; CHECK-NEXT: [[TMP7_I_3]] = load i32, i32* undef, align 8 -; CHECK-NEXT: br i1 undef, label [[DO_COND_3]], label [[LAND_LHS_TRUE_3]] +; CHECK-NEXT: [[TMP7_I_3:%.*]] = load i32, i32* undef, align 8 +; CHECK-NEXT: br i1 undef, label [[DO_COND_3]], label [[LAND_LHS_TRUE_3:%.*]] ; CHECK: land.lhs.true.3: ; CHECK-NEXT: br i1 true, label [[RETURN_LOOPEXIT]], label [[DO_COND_3]] ; CHECK: do.cond.3: ; CHECK-NEXT: [[COND3_3:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND3_3]], label [[DO_END]], label [[DO_BODY]], !llvm.loop [[LOOP3:![0-9]+]] +; CHECK: do.end: +; CHECK-NEXT: br label [[RETURN]] +; CHECK: return.loopexit: +; CHECK-NEXT: [[TMP7_I_LCSSA:%.*]] = phi i32 [ [[TMP7_I]], [[LAND_LHS_TRUE]] ], [ [[TMP7_I_1]], [[LAND_LHS_TRUE_1]] ], [ [[TMP7_I_2]], [[LAND_LHS_TRUE_2]] ], [ [[TMP7_I_3]], [[LAND_LHS_TRUE_3]] ] +; CHECK-NEXT: br label [[RETURN]] +; CHECK: return: +; CHECK-NEXT: [[RETVAL_0:%.*]] = phi i32 [ 0, [[DO_END]] ], [ 0, [[ENTRY:%.*]] ], [ [[TMP7_I_LCSSA]], [[RETURN_LOOPEXIT]] ] +; CHECK-NEXT: ret i32 [[RETVAL_0]] ; entry: %cond1 = call zeroext i1 @check() diff --git a/llvm/test/Transforms/LoopUnroll/2011-08-09-PhiUpdate.ll b/llvm/test/Transforms/LoopUnroll/2011-08-09-PhiUpdate.ll index be4b6ff64fdd..af648bae8642 100644 --- a/llvm/test/Transforms/LoopUnroll/2011-08-09-PhiUpdate.ll +++ b/llvm/test/Transforms/LoopUnroll/2011-08-09-PhiUpdate.ll @@ -33,16 +33,13 @@ define i32 @foo() uwtable ssp align 2 { ; CHECK: do.cond: ; CHECK-NEXT: [[CMP18:%.*]] = icmp sgt i32 [[CALL2]], -1 ; CHECK-NEXT: br i1 [[CMP18]], label [[LAND_LHS_TRUE_I_1:%.*]], label [[RETURN]] -; CHECK: return: -; CHECK-NEXT: [[RETVAL_0:%.*]] = phi i32 [ [[TMP7_I]], [[LAND_LHS_TRUE]] ], [ 0, [[DO_COND]] ], [ [[TMP7_I_1:%.*]], [[LAND_LHS_TRUE_1:%.*]] ], [ 0, [[DO_COND_1:%.*]] ], [ [[TMP7_I_2:%.*]], [[LAND_LHS_TRUE_2:%.*]] ], [ 0, [[DO_COND_2:%.*]] ], [ [[TMP7_I_3:%.*]], [[LAND_LHS_TRUE_3:%.*]] ], [ 0, [[DO_COND_3:%.*]] ] -; CHECK-NEXT: ret i32 [[RETVAL_0]] ; CHECK: land.lhs.true.i.1: ; CHECK-NEXT: [[CMP4_I_1:%.*]] = call zeroext i1 @check() #[[ATTR0]] -; CHECK-NEXT: br i1 [[CMP4_I_1]], label [[BAR_EXIT_1:%.*]], label [[DO_COND_1]] +; CHECK-NEXT: br i1 [[CMP4_I_1]], label [[BAR_EXIT_1:%.*]], label [[DO_COND_1:%.*]] ; CHECK: bar.exit.1: -; CHECK-NEXT: [[TMP7_I_1]] = call i32 @getval() #[[ATTR0]] +; CHECK-NEXT: [[TMP7_I_1:%.*]] = call i32 @getval() #[[ATTR0]] ; CHECK-NEXT: [[CMP_NOT_1:%.*]] = icmp eq i32 [[TMP7_I_1]], 0 -; CHECK-NEXT: br i1 [[CMP_NOT_1]], label [[DO_COND_1]], label [[LAND_LHS_TRUE_1]] +; CHECK-NEXT: br i1 [[CMP_NOT_1]], label [[DO_COND_1]], label [[LAND_LHS_TRUE_1:%.*]] ; CHECK: land.lhs.true.1: ; CHECK-NEXT: [[CALL10_1:%.*]] = call i32 @getval() ; CHECK-NEXT: [[CMP11_1:%.*]] = icmp eq i32 [[CALL10_1]], 0 @@ -52,11 +49,11 @@ define i32 @foo() uwtable ssp align 2 { ; CHECK-NEXT: br i1 [[CMP18_1]], label [[LAND_LHS_TRUE_I_2:%.*]], label [[RETURN]] ; CHECK: land.lhs.true.i.2: ; CHECK-NEXT: [[CMP4_I_2:%.*]] = call zeroext i1 @check() #[[ATTR0]] -; CHECK-NEXT: br i1 [[CMP4_I_2]], label [[BAR_EXIT_2:%.*]], label [[DO_COND_2]] +; CHECK-NEXT: br i1 [[CMP4_I_2]], label [[BAR_EXIT_2:%.*]], label [[DO_COND_2:%.*]] ; CHECK: bar.exit.2: -; CHECK-NEXT: [[TMP7_I_2]] = call i32 @getval() #[[ATTR0]] +; CHECK-NEXT: [[TMP7_I_2:%.*]] = call i32 @getval() #[[ATTR0]] ; CHECK-NEXT: [[CMP_NOT_2:%.*]] = icmp eq i32 [[TMP7_I_2]], 0 -; CHECK-NEXT: br i1 [[CMP_NOT_2]], label [[DO_COND_2]], label [[LAND_LHS_TRUE_2]] +; CHECK-NEXT: br i1 [[CMP_NOT_2]], label [[DO_COND_2]], label [[LAND_LHS_TRUE_2:%.*]] ; CHECK: land.lhs.true.2: ; CHECK-NEXT: [[CALL10_2:%.*]] = call i32 @getval() ; CHECK-NEXT: [[CMP11_2:%.*]] = icmp eq i32 [[CALL10_2]], 0 @@ -66,11 +63,11 @@ define i32 @foo() uwtable ssp align 2 { ; CHECK-NEXT: br i1 [[CMP18_2]], label [[LAND_LHS_TRUE_I_3:%.*]], label [[RETURN]] ; CHECK: land.lhs.true.i.3: ; CHECK-NEXT: [[CMP4_I_3:%.*]] = call zeroext i1 @check() #[[ATTR0]] -; CHECK-NEXT: br i1 [[CMP4_I_3]], label [[BAR_EXIT_3:%.*]], label [[DO_COND_3]] +; CHECK-NEXT: br i1 [[CMP4_I_3]], label [[BAR_EXIT_3:%.*]], label [[DO_COND_3:%.*]] ; CHECK: bar.exit.3: -; CHECK-NEXT: [[TMP7_I_3]] = call i32 @getval() #[[ATTR0]] +; CHECK-NEXT: [[TMP7_I_3:%.*]] = call i32 @getval() #[[ATTR0]] ; CHECK-NEXT: [[CMP_NOT_3:%.*]] = icmp eq i32 [[TMP7_I_3]], 0 -; CHECK-NEXT: br i1 [[CMP_NOT_3]], label [[DO_COND_3]], label [[LAND_LHS_TRUE_3]] +; CHECK-NEXT: br i1 [[CMP_NOT_3]], label [[DO_COND_3]], label [[LAND_LHS_TRUE_3:%.*]] ; CHECK: land.lhs.true.3: ; CHECK-NEXT: [[CALL10_3:%.*]] = call i32 @getval() ; CHECK-NEXT: [[CMP11_3:%.*]] = icmp eq i32 [[CALL10_3]], 0 @@ -78,6 +75,9 @@ define i32 @foo() uwtable ssp align 2 { ; CHECK: do.cond.3: ; CHECK-NEXT: [[CMP18_3:%.*]] = icmp sgt i32 [[CALL2]], -1 ; CHECK-NEXT: br i1 [[CMP18_3]], label [[LAND_LHS_TRUE_I]], label [[RETURN]], !llvm.loop [[LOOP0:![0-9]+]] +; CHECK: return: +; CHECK-NEXT: [[RETVAL_0:%.*]] = phi i32 [ [[TMP7_I]], [[LAND_LHS_TRUE]] ], [ 0, [[DO_COND]] ], [ [[TMP7_I_1]], [[LAND_LHS_TRUE_1]] ], [ 0, [[DO_COND_1]] ], [ [[TMP7_I_2]], [[LAND_LHS_TRUE_2]] ], [ 0, [[DO_COND_2]] ], [ [[TMP7_I_3]], [[LAND_LHS_TRUE_3]] ], [ 0, [[DO_COND_3]] ] +; CHECK-NEXT: ret i32 [[RETVAL_0]] ; entry: br i1 undef, label %return, label %if.end diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/runtime-unroll-generic.ll b/llvm/test/Transforms/LoopUnroll/AArch64/runtime-unroll-generic.ll index 5bbab929c936..5c8f9ca01679 100644 --- a/llvm/test/Transforms/LoopUnroll/AArch64/runtime-unroll-generic.ll +++ b/llvm/test/Transforms/LoopUnroll/AArch64/runtime-unroll-generic.ll @@ -67,8 +67,6 @@ define void @runtime_unroll_generic(i32 %arg_0, i32* %arg_1, i16* %arg_2, i16* % ; CHECK-A55-NEXT: store i32 [[ADD21_EPIL]], i32* [[ARRAYIDX20]], align 4 ; CHECK-A55-NEXT: [[EPIL_ITER_CMP_NOT:%.*]] = icmp eq i32 [[XTRAITER]], 1 ; CHECK-A55-NEXT: br i1 [[EPIL_ITER_CMP_NOT]], label [[FOR_END]], label [[FOR_BODY6_EPIL_1:%.*]] -; CHECK-A55: for.end: -; CHECK-A55-NEXT: ret void ; CHECK-A55: for.body6.epil.1: ; CHECK-A55-NEXT: [[TMP14:%.*]] = load i16, i16* [[ARRAYIDX10]], align 2 ; CHECK-A55-NEXT: [[CONV_EPIL_1:%.*]] = sext i16 [[TMP14]] to i32 @@ -90,6 +88,8 @@ define void @runtime_unroll_generic(i32 %arg_0, i32* %arg_1, i16* %arg_2, i16* % ; CHECK-A55-NEXT: [[ADD21_EPIL_2:%.*]] = add nsw i32 [[MUL16_EPIL_2]], [[TMP19]] ; CHECK-A55-NEXT: store i32 [[ADD21_EPIL_2]], i32* [[ARRAYIDX20]], align 4 ; CHECK-A55-NEXT: br label [[FOR_END]] +; CHECK-A55: for.end: +; CHECK-A55-NEXT: ret void ; ; CHECK-GENERIC-LABEL: @runtime_unroll_generic( ; CHECK-GENERIC-NEXT: entry: diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/thresholdO3-cost-model.ll b/llvm/test/Transforms/LoopUnroll/AArch64/thresholdO3-cost-model.ll index ee07518f8cac..5c6ac690c0ca 100644 --- a/llvm/test/Transforms/LoopUnroll/AArch64/thresholdO3-cost-model.ll +++ b/llvm/test/Transforms/LoopUnroll/AArch64/thresholdO3-cost-model.ll @@ -21,10 +21,6 @@ define i32 @tripcount_11() { ; CHECK-NEXT: br label [[DO_BODY6:%.*]] ; CHECK: for.cond: ; CHECK-NEXT: br i1 true, label [[FOR_COND_1:%.*]], label [[IF_THEN11:%.*]] -; CHECK: do.body6: -; CHECK-NEXT: br i1 true, label [[FOR_COND:%.*]], label [[IF_THEN11]] -; CHECK: if.then11: -; CHECK-NEXT: unreachable ; CHECK: for.cond.1: ; CHECK-NEXT: br i1 true, label [[FOR_COND_2:%.*]], label [[IF_THEN11]] ; CHECK: for.cond.2: @@ -45,6 +41,10 @@ define i32 @tripcount_11() { ; CHECK-NEXT: br i1 true, label [[FOR_COND_10:%.*]], label [[IF_THEN11]] ; CHECK: for.cond.10: ; CHECK-NEXT: ret i32 0 +; CHECK: do.body6: +; CHECK-NEXT: br i1 true, label [[FOR_COND:%.*]], label [[IF_THEN11]] +; CHECK: if.then11: +; CHECK-NEXT: unreachable ; do.body6.preheader: br label %do.body6 diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll b/llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll index 3b82365d1a6e..ee905e5b10fe 100644 --- a/llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll +++ b/llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll @@ -18,8 +18,6 @@ define void @test(i1 %cond) { ; CHECK-NEXT: br label [[LATCH]] ; CHECK: latch: ; CHECK-NEXT: br i1 false, label [[FOR_END:%.*]], label [[FOR_BODY_1:%.*]] -; CHECK: for.end: -; CHECK-NEXT: ret void ; CHECK: for.body.1: ; CHECK-NEXT: switch i32 1, label [[SW_DEFAULT_1:%.*]] [ ; CHECK-NEXT: i32 2, label [[LATCH_1:%.*]] @@ -38,6 +36,8 @@ define void @test(i1 %cond) { ; CHECK-NEXT: br label [[LATCH_2]] ; CHECK: latch.2: ; CHECK-NEXT: br label [[FOR_END]] +; CHECK: for.end: +; CHECK-NEXT: ret void ; entry: %0 = select i1 %cond, i32 2, i32 3 diff --git a/llvm/test/Transforms/LoopUnroll/ARM/loop-unrolling.ll b/llvm/test/Transforms/LoopUnroll/ARM/loop-unrolling.ll index f2e748ade0a2..e12dbf031b3b 100644 --- a/llvm/test/Transforms/LoopUnroll/ARM/loop-unrolling.ll +++ b/llvm/test/Transforms/LoopUnroll/ARM/loop-unrolling.ll @@ -121,14 +121,14 @@ for.body4: ; CHECK-NOUNROLL: br ; CHECK-UNROLL: for.body4.epil: +; CHECK-UNROLL: for.body4.epil.1: +; CHECK-UNROLL: for.body4.epil.2: ; CHECK-UNROLL: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, [[PRE:%[a-z0-9.]+]] ], [ [[IV4:%[a-z.0-9]+]], %for.body4 ] ; CHECK-UNROLL: [[IV1:%[a-z.0-9]+]] = add nuw nsw i32 [[IV0]], 1 ; CHECK-UNROLL: [[IV2:%[a-z.0-9]+]] = add nuw nsw i32 [[IV1]], 1 ; CHECK-UNROLL: [[IV3:%[a-z.0-9]+]] = add nuw nsw i32 [[IV2]], 1 ; CHECK-UNROLL: [[IV4]] = add nuw i32 [[IV3]], 1 ; CHECK-UNROLL: br -; CHECK-UNROLL: for.body4.epil.1: -; CHECK-UNROLL: for.body4.epil.2: %w.024 = phi i32 [ 0, %for.body4.lr.ph ], [ %inc, %for.body4 ] %add = add i32 %w.024, %mul diff --git a/llvm/test/Transforms/LoopUnroll/ARM/multi-blocks.ll b/llvm/test/Transforms/LoopUnroll/ARM/multi-blocks.ll index 156c0ab10658..8c4257698ab7 100644 --- a/llvm/test/Transforms/LoopUnroll/ARM/multi-blocks.ll +++ b/llvm/test/Transforms/LoopUnroll/ARM/multi-blocks.ll @@ -45,8 +45,37 @@ define void @test_three_blocks(i32* nocapture %Output, ; CHECK-NEXT: [[EPIL_ITER_SUB:%.*]] = sub i32 [[XTRAITER]], 1 ; CHECK-NEXT: [[EPIL_ITER_CMP:%.*]] = icmp ne i32 [[EPIL_ITER_SUB]], 0 ; CHECK-NEXT: br i1 [[EPIL_ITER_CMP]], label [[FOR_BODY_EPIL_1:%.*]], label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA:%.*]] +; CHECK: for.body.epil.1: +; CHECK-NEXT: [[ARRAYIDX_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_EPIL]] +; CHECK-NEXT: [[TMP4:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_1]], align 4 +; CHECK-NEXT: [[TOBOOL_EPIL_1:%.*]] = icmp eq i32 [[TMP4]], 0 +; CHECK-NEXT: br i1 [[TOBOOL_EPIL_1]], label [[FOR_INC_EPIL_1:%.*]], label [[IF_THEN_EPIL_1:%.*]] +; CHECK: if.then.epil.1: +; CHECK-NEXT: [[ARRAYIDX1_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_EPIL]] +; CHECK-NEXT: [[TMP5:%.*]] = load i32, i32* [[ARRAYIDX1_EPIL_1]], align 4 +; CHECK-NEXT: [[ADD_EPIL_1:%.*]] = add i32 [[TMP5]], [[TEMP_1_EPIL]] +; CHECK-NEXT: br label [[FOR_INC_EPIL_1]] +; CHECK: for.inc.epil.1: +; CHECK-NEXT: [[TEMP_1_EPIL_1:%.*]] = phi i32 [ [[ADD_EPIL_1]], [[IF_THEN_EPIL_1]] ], [ [[TEMP_1_EPIL]], [[FOR_BODY_EPIL_1]] ] +; CHECK-NEXT: [[INC_EPIL_1:%.*]] = add nuw i32 [[INC_EPIL]], 1 +; CHECK-NEXT: [[EPIL_ITER_SUB_1:%.*]] = sub i32 [[EPIL_ITER_SUB]], 1 +; CHECK-NEXT: [[EPIL_ITER_CMP_1:%.*]] = icmp ne i32 [[EPIL_ITER_SUB_1]], 0 +; CHECK-NEXT: br i1 [[EPIL_ITER_CMP_1]], label [[FOR_BODY_EPIL_2:%.*]], label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] +; CHECK: for.body.epil.2: +; CHECK-NEXT: [[ARRAYIDX_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_EPIL_1]] +; CHECK-NEXT: [[TMP6:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_2]], align 4 +; CHECK-NEXT: [[TOBOOL_EPIL_2:%.*]] = icmp eq i32 [[TMP6]], 0 +; CHECK-NEXT: br i1 [[TOBOOL_EPIL_2]], label [[FOR_INC_EPIL_2:%.*]], label [[IF_THEN_EPIL_2:%.*]] +; CHECK: if.then.epil.2: +; CHECK-NEXT: [[ARRAYIDX1_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_EPIL_1]] +; CHECK-NEXT: [[TMP7:%.*]] = load i32, i32* [[ARRAYIDX1_EPIL_2]], align 4 +; CHECK-NEXT: [[ADD_EPIL_2:%.*]] = add i32 [[TMP7]], [[TEMP_1_EPIL_1]] +; CHECK-NEXT: br label [[FOR_INC_EPIL_2]] +; CHECK: for.inc.epil.2: +; CHECK-NEXT: [[TEMP_1_EPIL_2:%.*]] = phi i32 [ [[ADD_EPIL_2]], [[IF_THEN_EPIL_2]] ], [ [[TEMP_1_EPIL_1]], [[FOR_BODY_EPIL_2]] ] +; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] ; CHECK: for.cond.cleanup.loopexit.epilog-lcssa: -; CHECK-NEXT: [[TEMP_1_LCSSA_PH1:%.*]] = phi i32 [ [[TEMP_1_EPIL]], [[FOR_INC_EPIL]] ], [ [[TEMP_1_EPIL_1:%.*]], [[FOR_INC_EPIL_1:%.*]] ], [ [[TEMP_1_EPIL_2:%.*]], [[FOR_INC_EPIL_2:%.*]] ] +; CHECK-NEXT: [[TEMP_1_LCSSA_PH1:%.*]] = phi i32 [ [[TEMP_1_EPIL]], [[FOR_INC_EPIL]] ], [ [[TEMP_1_EPIL_1]], [[FOR_INC_EPIL_1]] ], [ [[TEMP_1_EPIL_2]], [[FOR_INC_EPIL_2]] ] ; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT]] ; CHECK: for.cond.cleanup.loopexit: ; CHECK-NEXT: [[TEMP_1_LCSSA:%.*]] = phi i32 [ [[TEMP_1_LCSSA_PH]], [[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA]] ], [ [[TEMP_1_LCSSA_PH1]], [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] ] @@ -60,51 +89,22 @@ define void @test_three_blocks(i32* nocapture %Output, ; CHECK-NEXT: [[TEMP_09:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER_NEW]] ], [ [[TEMP_1_3]], [[FOR_INC_3]] ] ; CHECK-NEXT: [[NITER:%.*]] = phi i32 [ [[UNROLL_ITER]], [[FOR_BODY_PREHEADER_NEW]] ], [ [[NITER_NSUB_3:%.*]], [[FOR_INC_3]] ] ; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[J_010]] -; CHECK-NEXT: [[TMP4:%.*]] = load i32, i32* [[ARRAYIDX]], align 4 -; CHECK-NEXT: [[TOBOOL:%.*]] = icmp eq i32 [[TMP4]], 0 +; CHECK-NEXT: [[TMP8:%.*]] = load i32, i32* [[ARRAYIDX]], align 4 +; CHECK-NEXT: [[TOBOOL:%.*]] = icmp eq i32 [[TMP8]], 0 ; CHECK-NEXT: br i1 [[TOBOOL]], label [[FOR_INC:%.*]], label [[IF_THEN:%.*]] ; CHECK: if.then: ; CHECK-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[J_010]] -; CHECK-NEXT: [[TMP5:%.*]] = load i32, i32* [[ARRAYIDX1]], align 4 -; CHECK-NEXT: [[ADD:%.*]] = add i32 [[TMP5]], [[TEMP_09]] +; CHECK-NEXT: [[TMP9:%.*]] = load i32, i32* [[ARRAYIDX1]], align 4 +; CHECK-NEXT: [[ADD:%.*]] = add i32 [[TMP9]], [[TEMP_09]] ; CHECK-NEXT: br label [[FOR_INC]] ; CHECK: for.inc: ; CHECK-NEXT: [[TEMP_1:%.*]] = phi i32 [ [[ADD]], [[IF_THEN]] ], [ [[TEMP_09]], [[FOR_BODY]] ] ; CHECK-NEXT: [[INC:%.*]] = add nuw nsw i32 [[J_010]], 1 ; CHECK-NEXT: [[NITER_NSUB:%.*]] = sub i32 [[NITER]], 1 ; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC]] -; CHECK-NEXT: [[TMP6:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4 -; CHECK-NEXT: [[TOBOOL_1:%.*]] = icmp eq i32 [[TMP6]], 0 +; CHECK-NEXT: [[TMP10:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4 +; CHECK-NEXT: [[TOBOOL_1:%.*]] = icmp eq i32 [[TMP10]], 0 ; CHECK-NEXT: br i1 [[TOBOOL_1]], label [[FOR_INC_1:%.*]], label [[IF_THEN_1:%.*]] -; CHECK: for.body.epil.1: -; CHECK-NEXT: [[ARRAYIDX_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_EPIL]] -; CHECK-NEXT: [[TMP7:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_1]], align 4 -; CHECK-NEXT: [[TOBOOL_EPIL_1:%.*]] = icmp eq i32 [[TMP7]], 0 -; CHECK-NEXT: br i1 [[TOBOOL_EPIL_1]], label [[FOR_INC_EPIL_1]], label [[IF_THEN_EPIL_1:%.*]] -; CHECK: if.then.epil.1: -; CHECK-NEXT: [[ARRAYIDX1_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_EPIL]] -; CHECK-NEXT: [[TMP8:%.*]] = load i32, i32* [[ARRAYIDX1_EPIL_1]], align 4 -; CHECK-NEXT: [[ADD_EPIL_1:%.*]] = add i32 [[TMP8]], [[TEMP_1_EPIL]] -; CHECK-NEXT: br label [[FOR_INC_EPIL_1]] -; CHECK: for.inc.epil.1: -; CHECK-NEXT: [[TEMP_1_EPIL_1]] = phi i32 [ [[ADD_EPIL_1]], [[IF_THEN_EPIL_1]] ], [ [[TEMP_1_EPIL]], [[FOR_BODY_EPIL_1]] ] -; CHECK-NEXT: [[INC_EPIL_1:%.*]] = add nuw i32 [[INC_EPIL]], 1 -; CHECK-NEXT: [[EPIL_ITER_SUB_1:%.*]] = sub i32 [[EPIL_ITER_SUB]], 1 -; CHECK-NEXT: [[EPIL_ITER_CMP_1:%.*]] = icmp ne i32 [[EPIL_ITER_SUB_1]], 0 -; CHECK-NEXT: br i1 [[EPIL_ITER_CMP_1]], label [[FOR_BODY_EPIL_2:%.*]], label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] -; CHECK: for.body.epil.2: -; CHECK-NEXT: [[ARRAYIDX_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_EPIL_1]] -; CHECK-NEXT: [[TMP9:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_2]], align 4 -; CHECK-NEXT: [[TOBOOL_EPIL_2:%.*]] = icmp eq i32 [[TMP9]], 0 -; CHECK-NEXT: br i1 [[TOBOOL_EPIL_2]], label [[FOR_INC_EPIL_2]], label [[IF_THEN_EPIL_2:%.*]] -; CHECK: if.then.epil.2: -; CHECK-NEXT: [[ARRAYIDX1_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_EPIL_1]] -; CHECK-NEXT: [[TMP10:%.*]] = load i32, i32* [[ARRAYIDX1_EPIL_2]], align 4 -; CHECK-NEXT: [[ADD_EPIL_2:%.*]] = add i32 [[TMP10]], [[TEMP_1_EPIL_1]] -; CHECK-NEXT: br label [[FOR_INC_EPIL_2]] -; CHECK: for.inc.epil.2: -; CHECK-NEXT: [[TEMP_1_EPIL_2]] = phi i32 [ [[ADD_EPIL_2]], [[IF_THEN_EPIL_2]] ], [ [[TEMP_1_EPIL_1]], [[FOR_BODY_EPIL_2]] ] -; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] ; CHECK: if.then.1: ; CHECK-NEXT: [[ARRAYIDX1_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC]] ; CHECK-NEXT: [[TMP11:%.*]] = load i32, i32* [[ARRAYIDX1_1]], align 4 @@ -203,41 +203,34 @@ define void @test_two_exits(i32* nocapture %Output, ; CHECK-NEXT: [[INC:%.*]] = add nuw nsw i32 [[J_016]], 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp ult i32 [[INC]], [[MAXJ]] ; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY_1:%.*]], label [[CLEANUP_LOOPEXIT]] -; CHECK: cleanup.loopexit: -; CHECK-NEXT: [[TEMP_0_LCSSA_PH:%.*]] = phi i32 [ [[TEMP_0_ADD]], [[IF_END]] ], [ [[TEMP_015]], [[FOR_BODY]] ], [ [[TEMP_0_ADD]], [[FOR_BODY_1]] ], [ [[TEMP_0_ADD_1:%.*]], [[IF_END_1:%.*]] ], [ [[TEMP_0_ADD_1]], [[FOR_BODY_2:%.*]] ], [ [[TEMP_0_ADD_2:%.*]], [[IF_END_2:%.*]] ], [ [[TEMP_0_ADD_2]], [[FOR_BODY_3:%.*]] ], [ [[TEMP_0_ADD_3]], [[IF_END_3]] ] -; CHECK-NEXT: br label [[CLEANUP]] -; CHECK: cleanup: -; CHECK-NEXT: [[TEMP_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[TEMP_0_LCSSA_PH]], [[CLEANUP_LOOPEXIT]] ] -; CHECK-NEXT: store i32 [[TEMP_0_LCSSA]], i32* [[OUTPUT:%.*]], align 4 -; CHECK-NEXT: ret void ; CHECK: for.body.1: ; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC]] ; CHECK-NEXT: [[TMP2:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4 ; CHECK-NEXT: [[CMP1_1:%.*]] = icmp ugt i32 [[TMP2]], 65535 -; CHECK-NEXT: br i1 [[CMP1_1]], label [[CLEANUP_LOOPEXIT]], label [[IF_END_1]] +; CHECK-NEXT: br i1 [[CMP1_1]], label [[CLEANUP_LOOPEXIT]], label [[IF_END_1:%.*]] ; CHECK: if.end.1: ; CHECK-NEXT: [[ARRAYIDX2_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC]] ; CHECK-NEXT: [[TMP3:%.*]] = load i32, i32* [[ARRAYIDX2_1]], align 4 ; CHECK-NEXT: [[TOBOOL_1:%.*]] = icmp eq i32 [[TMP3]], 0 ; CHECK-NEXT: [[ADD_1:%.*]] = select i1 [[TOBOOL_1]], i32 0, i32 [[TMP2]] -; CHECK-NEXT: [[TEMP_0_ADD_1]] = add i32 [[ADD_1]], [[TEMP_0_ADD]] +; CHECK-NEXT: [[TEMP_0_ADD_1:%.*]] = add i32 [[ADD_1]], [[TEMP_0_ADD]] ; CHECK-NEXT: [[INC_1:%.*]] = add nuw nsw i32 [[INC]], 1 ; CHECK-NEXT: [[CMP_1:%.*]] = icmp ult i32 [[INC_1]], [[MAXJ]] -; CHECK-NEXT: br i1 [[CMP_1]], label [[FOR_BODY_2]], label [[CLEANUP_LOOPEXIT]] +; CHECK-NEXT: br i1 [[CMP_1]], label [[FOR_BODY_2:%.*]], label [[CLEANUP_LOOPEXIT]] ; CHECK: for.body.2: ; CHECK-NEXT: [[ARRAYIDX_2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_1]] ; CHECK-NEXT: [[TMP4:%.*]] = load i32, i32* [[ARRAYIDX_2]], align 4 ; CHECK-NEXT: [[CMP1_2:%.*]] = icmp ugt i32 [[TMP4]], 65535 -; CHECK-NEXT: br i1 [[CMP1_2]], label [[CLEANUP_LOOPEXIT]], label [[IF_END_2]] +; CHECK-NEXT: br i1 [[CMP1_2]], label [[CLEANUP_LOOPEXIT]], label [[IF_END_2:%.*]] ; CHECK: if.end.2: ; CHECK-NEXT: [[ARRAYIDX2_2:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_1]] ; CHECK-NEXT: [[TMP5:%.*]] = load i32, i32* [[ARRAYIDX2_2]], align 4 ; CHECK-NEXT: [[TOBOOL_2:%.*]] = icmp eq i32 [[TMP5]], 0 ; CHECK-NEXT: [[ADD_2:%.*]] = select i1 [[TOBOOL_2]], i32 0, i32 [[TMP4]] -; CHECK-NEXT: [[TEMP_0_ADD_2]] = add i32 [[ADD_2]], [[TEMP_0_ADD_1]] +; CHECK-NEXT: [[TEMP_0_ADD_2:%.*]] = add i32 [[ADD_2]], [[TEMP_0_ADD_1]] ; CHECK-NEXT: [[INC_2:%.*]] = add nuw nsw i32 [[INC_1]], 1 ; CHECK-NEXT: [[CMP_2:%.*]] = icmp ult i32 [[INC_2]], [[MAXJ]] -; CHECK-NEXT: br i1 [[CMP_2]], label [[FOR_BODY_3]], label [[CLEANUP_LOOPEXIT]] +; CHECK-NEXT: br i1 [[CMP_2]], label [[FOR_BODY_3:%.*]], label [[CLEANUP_LOOPEXIT]] ; CHECK: for.body.3: ; CHECK-NEXT: [[ARRAYIDX_3:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_2]] ; CHECK-NEXT: [[TMP6:%.*]] = load i32, i32* [[ARRAYIDX_3]], align 4 @@ -252,6 +245,13 @@ define void @test_two_exits(i32* nocapture %Output, ; CHECK-NEXT: [[INC_3]] = add nuw i32 [[INC_2]], 1 ; CHECK-NEXT: [[CMP_3:%.*]] = icmp ult i32 [[INC_3]], [[MAXJ]] ; CHECK-NEXT: br i1 [[CMP_3]], label [[FOR_BODY]], label [[CLEANUP_LOOPEXIT]] +; CHECK: cleanup.loopexit: +; CHECK-NEXT: [[TEMP_0_LCSSA_PH:%.*]] = phi i32 [ [[TEMP_0_ADD]], [[IF_END]] ], [ [[TEMP_015]], [[FOR_BODY]] ], [ [[TEMP_0_ADD]], [[FOR_BODY_1]] ], [ [[TEMP_0_ADD_1]], [[IF_END_1]] ], [ [[TEMP_0_ADD_1]], [[FOR_BODY_2]] ], [ [[TEMP_0_ADD_2]], [[IF_END_2]] ], [ [[TEMP_0_ADD_2]], [[FOR_BODY_3]] ], [ [[TEMP_0_ADD_3]], [[IF_END_3]] ] +; CHECK-NEXT: br label [[CLEANUP]] +; CHECK: cleanup: +; CHECK-NEXT: [[TEMP_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[TEMP_0_LCSSA_PH]], [[CLEANUP_LOOPEXIT]] ] +; CHECK-NEXT: store i32 [[TEMP_0_LCSSA]], i32* [[OUTPUT:%.*]], align 4 +; CHECK-NEXT: ret void ; i32* nocapture readonly %Condition, i32* nocapture readonly %Input, @@ -417,100 +417,100 @@ define void @test_four_blocks(i32* nocapture %Output, ; CHECK-NEXT: [[EPIL_ITER_SUB:%.*]] = sub i32 [[XTRAITER]], 1 ; CHECK-NEXT: [[EPIL_ITER_CMP:%.*]] = icmp ne i32 [[EPIL_ITER_SUB]], 0 ; CHECK-NEXT: br i1 [[EPIL_ITER_CMP]], label [[FOR_BODY_EPIL_1:%.*]], label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA:%.*]] -; CHECK: for.cond.cleanup.loopexit.epilog-lcssa: -; CHECK-NEXT: [[TEMP_1_LCSSA_PH1:%.*]] = phi i32 [ [[TEMP_1_EPIL]], [[FOR_INC_EPIL]] ], [ [[TEMP_1_EPIL_1:%.*]], [[FOR_INC_EPIL_1:%.*]] ], [ [[TEMP_1_EPIL_2:%.*]], [[FOR_INC_EPIL_2:%.*]] ] -; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT]] -; CHECK: for.cond.cleanup.loopexit: -; CHECK-NEXT: [[TEMP_1_LCSSA:%.*]] = phi i32 [ [[TEMP_1_LCSSA_PH]], [[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA]] ], [ [[TEMP_1_LCSSA_PH1]], [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] ] -; CHECK-NEXT: br label [[FOR_COND_CLEANUP]] -; CHECK: for.cond.cleanup: -; CHECK-NEXT: [[TEMP_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[TEMP_1_LCSSA]], [[FOR_COND_CLEANUP_LOOPEXIT]] ] -; CHECK-NEXT: store i32 [[TEMP_0_LCSSA]], i32* [[OUTPUT:%.*]], align 4 -; CHECK-NEXT: ret void -; CHECK: for.body: -; CHECK-NEXT: [[TMP6:%.*]] = phi i32 [ [[DOTPRE]], [[FOR_BODY_LR_PH_NEW]] ], [ [[TMP23]], [[FOR_INC_3]] ] -; CHECK-NEXT: [[J_027:%.*]] = phi i32 [ 1, [[FOR_BODY_LR_PH_NEW]] ], [ [[INC_3]], [[FOR_INC_3]] ] -; CHECK-NEXT: [[TEMP_026:%.*]] = phi i32 [ 0, [[FOR_BODY_LR_PH_NEW]] ], [ [[TEMP_1_3]], [[FOR_INC_3]] ] -; CHECK-NEXT: [[NITER:%.*]] = phi i32 [ [[UNROLL_ITER]], [[FOR_BODY_LR_PH_NEW]] ], [ [[NITER_NSUB_3:%.*]], [[FOR_INC_3]] ] -; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[J_027]] -; CHECK-NEXT: [[TMP7:%.*]] = load i32, i32* [[ARRAYIDX]], align 4 -; CHECK-NEXT: [[CMP1:%.*]] = icmp ugt i32 [[TMP7]], 65535 -; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[J_027]] -; CHECK-NEXT: [[TMP8:%.*]] = load i32, i32* [[ARRAYIDX2]], align 4 -; CHECK-NEXT: [[CMP4:%.*]] = icmp ugt i32 [[TMP8]], [[TMP6]] -; CHECK-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]] -; CHECK: if.then: -; CHECK-NEXT: [[COND:%.*]] = zext i1 [[CMP4]] to i32 -; CHECK-NEXT: [[ADD:%.*]] = add i32 [[TEMP_026]], [[COND]] -; CHECK-NEXT: br label [[FOR_INC:%.*]] -; CHECK: if.else: -; CHECK-NEXT: [[NOT_CMP4:%.*]] = xor i1 [[CMP4]], true -; CHECK-NEXT: [[SUB:%.*]] = sext i1 [[NOT_CMP4]] to i32 -; CHECK-NEXT: [[SUB10_SINK:%.*]] = add i32 [[J_027]], [[SUB]] -; CHECK-NEXT: [[ARRAYIDX11:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[SUB10_SINK]] -; CHECK-NEXT: [[TMP9:%.*]] = load i32, i32* [[ARRAYIDX11]], align 4 -; CHECK-NEXT: [[SUB13:%.*]] = sub i32 [[TEMP_026]], [[TMP9]] -; CHECK-NEXT: br label [[FOR_INC]] -; CHECK: for.inc: -; CHECK-NEXT: [[TEMP_1:%.*]] = phi i32 [ [[ADD]], [[IF_THEN]] ], [ [[SUB13]], [[IF_ELSE]] ] -; CHECK-NEXT: [[INC:%.*]] = add nuw nsw i32 [[J_027]], 1 -; CHECK-NEXT: [[NITER_NSUB:%.*]] = sub i32 [[NITER]], 1 -; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC]] -; CHECK-NEXT: [[TMP10:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4 -; CHECK-NEXT: [[CMP1_1:%.*]] = icmp ugt i32 [[TMP10]], 65535 -; CHECK-NEXT: [[ARRAYIDX2_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC]] -; CHECK-NEXT: [[TMP11:%.*]] = load i32, i32* [[ARRAYIDX2_1]], align 4 -; CHECK-NEXT: [[CMP4_1:%.*]] = icmp ugt i32 [[TMP11]], [[TMP8]] -; CHECK-NEXT: br i1 [[CMP1_1]], label [[IF_THEN_1:%.*]], label [[IF_ELSE_1:%.*]] ; CHECK: for.body.epil.1: ; CHECK-NEXT: [[ARRAYIDX_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_EPIL]] -; CHECK-NEXT: [[TMP12:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_1]], align 4 -; CHECK-NEXT: [[CMP1_EPIL_1:%.*]] = icmp ugt i32 [[TMP12]], 65535 +; CHECK-NEXT: [[TMP6:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_1]], align 4 +; CHECK-NEXT: [[CMP1_EPIL_1:%.*]] = icmp ugt i32 [[TMP6]], 65535 ; CHECK-NEXT: [[ARRAYIDX2_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_EPIL]] -; CHECK-NEXT: [[TMP13:%.*]] = load i32, i32* [[ARRAYIDX2_EPIL_1]], align 4 -; CHECK-NEXT: [[CMP4_EPIL_1:%.*]] = icmp ugt i32 [[TMP13]], [[TMP4]] +; CHECK-NEXT: [[TMP7:%.*]] = load i32, i32* [[ARRAYIDX2_EPIL_1]], align 4 +; CHECK-NEXT: [[CMP4_EPIL_1:%.*]] = icmp ugt i32 [[TMP7]], [[TMP4]] ; CHECK-NEXT: br i1 [[CMP1_EPIL_1]], label [[IF_THEN_EPIL_1:%.*]], label [[IF_ELSE_EPIL_1:%.*]] ; CHECK: if.else.epil.1: ; CHECK-NEXT: [[NOT_CMP4_EPIL_1:%.*]] = xor i1 [[CMP4_EPIL_1]], true ; CHECK-NEXT: [[SUB_EPIL_1:%.*]] = sext i1 [[NOT_CMP4_EPIL_1]] to i32 ; CHECK-NEXT: [[SUB10_SINK_EPIL_1:%.*]] = add i32 [[INC_EPIL]], [[SUB_EPIL_1]] ; CHECK-NEXT: [[ARRAYIDX11_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[SUB10_SINK_EPIL_1]] -; CHECK-NEXT: [[TMP14:%.*]] = load i32, i32* [[ARRAYIDX11_EPIL_1]], align 4 -; CHECK-NEXT: [[SUB13_EPIL_1:%.*]] = sub i32 [[TEMP_1_EPIL]], [[TMP14]] -; CHECK-NEXT: br label [[FOR_INC_EPIL_1]] +; CHECK-NEXT: [[TMP8:%.*]] = load i32, i32* [[ARRAYIDX11_EPIL_1]], align 4 +; CHECK-NEXT: [[SUB13_EPIL_1:%.*]] = sub i32 [[TEMP_1_EPIL]], [[TMP8]] +; CHECK-NEXT: br label [[FOR_INC_EPIL_1:%.*]] ; CHECK: if.then.epil.1: ; CHECK-NEXT: [[COND_EPIL_1:%.*]] = zext i1 [[CMP4_EPIL_1]] to i32 ; CHECK-NEXT: [[ADD_EPIL_1:%.*]] = add i32 [[TEMP_1_EPIL]], [[COND_EPIL_1]] ; CHECK-NEXT: br label [[FOR_INC_EPIL_1]] ; CHECK: for.inc.epil.1: -; CHECK-NEXT: [[TEMP_1_EPIL_1]] = phi i32 [ [[ADD_EPIL_1]], [[IF_THEN_EPIL_1]] ], [ [[SUB13_EPIL_1]], [[IF_ELSE_EPIL_1]] ] +; CHECK-NEXT: [[TEMP_1_EPIL_1:%.*]] = phi i32 [ [[ADD_EPIL_1]], [[IF_THEN_EPIL_1]] ], [ [[SUB13_EPIL_1]], [[IF_ELSE_EPIL_1]] ] ; CHECK-NEXT: [[INC_EPIL_1:%.*]] = add nuw i32 [[INC_EPIL]], 1 ; CHECK-NEXT: [[EPIL_ITER_SUB_1:%.*]] = sub i32 [[EPIL_ITER_SUB]], 1 ; CHECK-NEXT: [[EPIL_ITER_CMP_1:%.*]] = icmp ne i32 [[EPIL_ITER_SUB_1]], 0 ; CHECK-NEXT: br i1 [[EPIL_ITER_CMP_1]], label [[FOR_BODY_EPIL_2:%.*]], label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] ; CHECK: for.body.epil.2: ; CHECK-NEXT: [[ARRAYIDX_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_EPIL_1]] -; CHECK-NEXT: [[TMP15:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_2]], align 4 -; CHECK-NEXT: [[CMP1_EPIL_2:%.*]] = icmp ugt i32 [[TMP15]], 65535 +; CHECK-NEXT: [[TMP9:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_2]], align 4 +; CHECK-NEXT: [[CMP1_EPIL_2:%.*]] = icmp ugt i32 [[TMP9]], 65535 ; CHECK-NEXT: [[ARRAYIDX2_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_EPIL_1]] -; CHECK-NEXT: [[TMP16:%.*]] = load i32, i32* [[ARRAYIDX2_EPIL_2]], align 4 -; CHECK-NEXT: [[CMP4_EPIL_2:%.*]] = icmp ugt i32 [[TMP16]], [[TMP13]] +; CHECK-NEXT: [[TMP10:%.*]] = load i32, i32* [[ARRAYIDX2_EPIL_2]], align 4 +; CHECK-NEXT: [[CMP4_EPIL_2:%.*]] = icmp ugt i32 [[TMP10]], [[TMP7]] ; CHECK-NEXT: br i1 [[CMP1_EPIL_2]], label [[IF_THEN_EPIL_2:%.*]], label [[IF_ELSE_EPIL_2:%.*]] ; CHECK: if.else.epil.2: ; CHECK-NEXT: [[NOT_CMP4_EPIL_2:%.*]] = xor i1 [[CMP4_EPIL_2]], true ; CHECK-NEXT: [[SUB_EPIL_2:%.*]] = sext i1 [[NOT_CMP4_EPIL_2]] to i32 ; CHECK-NEXT: [[SUB10_SINK_EPIL_2:%.*]] = add i32 [[INC_EPIL_1]], [[SUB_EPIL_2]] ; CHECK-NEXT: [[ARRAYIDX11_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[SUB10_SINK_EPIL_2]] -; CHECK-NEXT: [[TMP17:%.*]] = load i32, i32* [[ARRAYIDX11_EPIL_2]], align 4 -; CHECK-NEXT: [[SUB13_EPIL_2:%.*]] = sub i32 [[TEMP_1_EPIL_1]], [[TMP17]] -; CHECK-NEXT: br label [[FOR_INC_EPIL_2]] +; CHECK-NEXT: [[TMP11:%.*]] = load i32, i32* [[ARRAYIDX11_EPIL_2]], align 4 +; CHECK-NEXT: [[SUB13_EPIL_2:%.*]] = sub i32 [[TEMP_1_EPIL_1]], [[TMP11]] +; CHECK-NEXT: br label [[FOR_INC_EPIL_2:%.*]] ; CHECK: if.then.epil.2: ; CHECK-NEXT: [[COND_EPIL_2:%.*]] = zext i1 [[CMP4_EPIL_2]] to i32 ; CHECK-NEXT: [[ADD_EPIL_2:%.*]] = add i32 [[TEMP_1_EPIL_1]], [[COND_EPIL_2]] ; CHECK-NEXT: br label [[FOR_INC_EPIL_2]] ; CHECK: for.inc.epil.2: -; CHECK-NEXT: [[TEMP_1_EPIL_2]] = phi i32 [ [[ADD_EPIL_2]], [[IF_THEN_EPIL_2]] ], [ [[SUB13_EPIL_2]], [[IF_ELSE_EPIL_2]] ] +; CHECK-NEXT: [[TEMP_1_EPIL_2:%.*]] = phi i32 [ [[ADD_EPIL_2]], [[IF_THEN_EPIL_2]] ], [ [[SUB13_EPIL_2]], [[IF_ELSE_EPIL_2]] ] ; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] +; CHECK: for.cond.cleanup.loopexit.epilog-lcssa: +; CHECK-NEXT: [[TEMP_1_LCSSA_PH1:%.*]] = phi i32 [ [[TEMP_1_EPIL]], [[FOR_INC_EPIL]] ], [ [[TEMP_1_EPIL_1]], [[FOR_INC_EPIL_1]] ], [ [[TEMP_1_EPIL_2]], [[FOR_INC_EPIL_2]] ] +; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT]] +; CHECK: for.cond.cleanup.loopexit: +; CHECK-NEXT: [[TEMP_1_LCSSA:%.*]] = phi i32 [ [[TEMP_1_LCSSA_PH]], [[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA]] ], [ [[TEMP_1_LCSSA_PH1]], [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] ] +; CHECK-NEXT: br label [[FOR_COND_CLEANUP]] +; CHECK: for.cond.cleanup: +; CHECK-NEXT: [[TEMP_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[TEMP_1_LCSSA]], [[FOR_COND_CLEANUP_LOOPEXIT]] ] +; CHECK-NEXT: store i32 [[TEMP_0_LCSSA]], i32* [[OUTPUT:%.*]], align 4 +; CHECK-NEXT: ret void +; CHECK: for.body: +; CHECK-NEXT: [[TMP12:%.*]] = phi i32 [ [[DOTPRE]], [[FOR_BODY_LR_PH_NEW]] ], [ [[TMP23]], [[FOR_INC_3]] ] +; CHECK-NEXT: [[J_027:%.*]] = phi i32 [ 1, [[FOR_BODY_LR_PH_NEW]] ], [ [[INC_3]], [[FOR_INC_3]] ] +; CHECK-NEXT: [[TEMP_026:%.*]] = phi i32 [ 0, [[FOR_BODY_LR_PH_NEW]] ], [ [[TEMP_1_3]], [[FOR_INC_3]] ] +; CHECK-NEXT: [[NITER:%.*]] = phi i32 [ [[UNROLL_ITER]], [[FOR_BODY_LR_PH_NEW]] ], [ [[NITER_NSUB_3:%.*]], [[FOR_INC_3]] ] +; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[J_027]] +; CHECK-NEXT: [[TMP13:%.*]] = load i32, i32* [[ARRAYIDX]], align 4 +; CHECK-NEXT: [[CMP1:%.*]] = icmp ugt i32 [[TMP13]], 65535 +; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[J_027]] +; CHECK-NEXT: [[TMP14:%.*]] = load i32, i32* [[ARRAYIDX2]], align 4 +; CHECK-NEXT: [[CMP4:%.*]] = icmp ugt i32 [[TMP14]], [[TMP12]] +; CHECK-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]] +; CHECK: if.then: +; CHECK-NEXT: [[COND:%.*]] = zext i1 [[CMP4]] to i32 +; CHECK-NEXT: [[ADD:%.*]] = add i32 [[TEMP_026]], [[COND]] +; CHECK-NEXT: br label [[FOR_INC:%.*]] +; CHECK: if.else: +; CHECK-NEXT: [[NOT_CMP4:%.*]] = xor i1 [[CMP4]], true +; CHECK-NEXT: [[SUB:%.*]] = sext i1 [[NOT_CMP4]] to i32 +; CHECK-NEXT: [[SUB10_SINK:%.*]] = add i32 [[J_027]], [[SUB]] +; CHECK-NEXT: [[ARRAYIDX11:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[SUB10_SINK]] +; CHECK-NEXT: [[TMP15:%.*]] = load i32, i32* [[ARRAYIDX11]], align 4 +; CHECK-NEXT: [[SUB13:%.*]] = sub i32 [[TEMP_026]], [[TMP15]] +; CHECK-NEXT: br label [[FOR_INC]] +; CHECK: for.inc: +; CHECK-NEXT: [[TEMP_1:%.*]] = phi i32 [ [[ADD]], [[IF_THEN]] ], [ [[SUB13]], [[IF_ELSE]] ] +; CHECK-NEXT: [[INC:%.*]] = add nuw nsw i32 [[J_027]], 1 +; CHECK-NEXT: [[NITER_NSUB:%.*]] = sub i32 [[NITER]], 1 +; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC]] +; CHECK-NEXT: [[TMP16:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4 +; CHECK-NEXT: [[CMP1_1:%.*]] = icmp ugt i32 [[TMP16]], 65535 +; CHECK-NEXT: [[ARRAYIDX2_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC]] +; CHECK-NEXT: [[TMP17:%.*]] = load i32, i32* [[ARRAYIDX2_1]], align 4 +; CHECK-NEXT: [[CMP4_1:%.*]] = icmp ugt i32 [[TMP17]], [[TMP14]] +; CHECK-NEXT: br i1 [[CMP1_1]], label [[IF_THEN_1:%.*]], label [[IF_ELSE_1:%.*]] ; CHECK: if.else.1: ; CHECK-NEXT: [[NOT_CMP4_1:%.*]] = xor i1 [[CMP4_1]], true ; CHECK-NEXT: [[SUB_1:%.*]] = sext i1 [[NOT_CMP4_1]] to i32 @@ -532,7 +532,7 @@ define void @test_four_blocks(i32* nocapture %Output, ; CHECK-NEXT: [[CMP1_2:%.*]] = icmp ugt i32 [[TMP19]], 65535 ; CHECK-NEXT: [[ARRAYIDX2_2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_1]] ; CHECK-NEXT: [[TMP20:%.*]] = load i32, i32* [[ARRAYIDX2_2]], align 4 -; CHECK-NEXT: [[CMP4_2:%.*]] = icmp ugt i32 [[TMP20]], [[TMP11]] +; CHECK-NEXT: [[CMP4_2:%.*]] = icmp ugt i32 [[TMP20]], [[TMP17]] ; CHECK-NEXT: br i1 [[CMP1_2]], label [[IF_THEN_2:%.*]], label [[IF_ELSE_2:%.*]] ; CHECK: if.else.2: ; CHECK-NEXT: [[NOT_CMP4_2:%.*]] = xor i1 [[CMP4_2]], true @@ -742,10 +742,6 @@ define void @iterate_inc(%struct.Node* %n, i32 %limit) { ; CHECK-NEXT: [[TMP2:%.*]] = load %struct.Node*, %struct.Node** [[TMP1]], align 4 ; CHECK-NEXT: [[TOBOOL:%.*]] = icmp eq %struct.Node* [[TMP2]], null ; CHECK-NEXT: br i1 [[TOBOOL]], label [[WHILE_END_LOOPEXIT]], label [[LAND_RHS_1:%.*]] -; CHECK: while.end.loopexit: -; CHECK-NEXT: br label [[WHILE_END]] -; CHECK: while.end: -; CHECK-NEXT: ret void ; CHECK: land.rhs.1: ; CHECK-NEXT: [[VAL_1:%.*]] = getelementptr inbounds [[STRUCT_NODE]], %struct.Node* [[TMP2]], i32 0, i32 1 ; CHECK-NEXT: [[TMP3:%.*]] = load i32, i32* [[VAL_1]], align 4 @@ -782,6 +778,10 @@ define void @iterate_inc(%struct.Node* %n, i32 %limit) { ; CHECK-NEXT: [[TMP11]] = load %struct.Node*, %struct.Node** [[TMP10]], align 4 ; CHECK-NEXT: [[TOBOOL_3:%.*]] = icmp eq %struct.Node* [[TMP11]], null ; CHECK-NEXT: br i1 [[TOBOOL_3]], label [[WHILE_END_LOOPEXIT]], label [[LAND_RHS]] +; CHECK: while.end.loopexit: +; CHECK-NEXT: br label [[WHILE_END]] +; CHECK: while.end: +; CHECK-NEXT: ret void ; entry: %tobool5 = icmp eq %struct.Node* %n, null diff --git a/llvm/test/Transforms/LoopUnroll/ARM/upperbound.ll b/llvm/test/Transforms/LoopUnroll/ARM/upperbound.ll index ea18d3aa1054..33151c68b319 100644 --- a/llvm/test/Transforms/LoopUnroll/ARM/upperbound.ll +++ b/llvm/test/Transforms/LoopUnroll/ARM/upperbound.ll @@ -20,8 +20,6 @@ define void @test(i32* %x, i32 %n) { ; CHECK-NEXT: [[INCDEC_PTR:%.*]] = getelementptr inbounds i32, i32* [[X]], i64 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp sgt i32 [[REM]], 1 ; CHECK-NEXT: br i1 [[CMP]], label [[WHILE_BODY_1:%.*]], label [[WHILE_END]] -; CHECK: while.end: -; CHECK-NEXT: ret void ; CHECK: while.body.1: ; CHECK-NEXT: [[TMP1:%.*]] = load i32, i32* [[INCDEC_PTR]], align 4 ; CHECK-NEXT: [[CMP1_1:%.*]] = icmp slt i32 [[TMP1]], 10 @@ -40,6 +38,8 @@ define void @test(i32* %x, i32 %n) { ; CHECK: if.then.2: ; CHECK-NEXT: store i32 0, i32* [[INCDEC_PTR_1]], align 4 ; CHECK-NEXT: br label [[WHILE_END]] +; CHECK: while.end: +; CHECK-NEXT: ret void ; entry: %sub = add nsw i32 %n, -1 @@ -76,9 +76,9 @@ define i32 @test2(i32 %l86) { ; CHECK-NEXT: [[L86_OFF:%.*]] = add i32 [[L86:%.*]], -1 ; CHECK-NEXT: [[SWITCH:%.*]] = icmp ult i32 [[L86_OFF]], 24 ; CHECK-NEXT: [[DOTNOT30:%.*]] = icmp ne i32 [[L86]], 25 -; CHECK-NEXT: [[SPEC_SELECT24:%.*]] = zext i1 [[DOTNOT30]] to i32 -; CHECK-NEXT: [[COMMON_RET31_OP:%.*]] = select i1 [[SWITCH]], i32 0, i32 [[SPEC_SELECT24]] -; CHECK-NEXT: ret i32 [[COMMON_RET31_OP]] +; CHECK-NEXT: [[SPEC_SELECT:%.*]] = zext i1 [[DOTNOT30]] to i32 +; CHECK-NEXT: [[COMMON_RET_OP:%.*]] = select i1 [[SWITCH]], i32 0, i32 [[SPEC_SELECT]] +; CHECK-NEXT: ret i32 [[COMMON_RET_OP]] ; entry: br label %for.body.i.i diff --git a/llvm/test/Transforms/LoopUnroll/full-unroll-keep-first-exit.ll b/llvm/test/Transforms/LoopUnroll/full-unroll-keep-first-exit.ll index 316051715584..cdc8e944715e 100644 --- a/llvm/test/Transforms/LoopUnroll/full-unroll-keep-first-exit.ll +++ b/llvm/test/Transforms/LoopUnroll/full-unroll-keep-first-exit.ll @@ -15,12 +15,12 @@ define void @s32_max1(i32 %n, i32* %p) { ; CHECK-NEXT: [[INC:%.*]] = add i32 [[N]], 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[N]], [[ADD]] ; CHECK-NEXT: br i1 [[CMP]], label [[DO_BODY_1:%.*]], label [[DO_END:%.*]] -; CHECK: do.end: -; CHECK-NEXT: ret void ; CHECK: do.body.1: ; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr i32, i32* [[P]], i32 [[INC]] ; CHECK-NEXT: store i32 [[INC]], i32* [[ARRAYIDX_1]], align 4 ; CHECK-NEXT: br label [[DO_END]] +; CHECK: do.end: +; CHECK-NEXT: ret void ; entry: %add = add i32 %n, 1 @@ -51,8 +51,6 @@ define void @s32_max2(i32 %n, i32* %p) { ; CHECK-NEXT: [[INC:%.*]] = add i32 [[N]], 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[N]], [[ADD]] ; CHECK-NEXT: br i1 [[CMP]], label [[DO_BODY_1:%.*]], label [[DO_END:%.*]] -; CHECK: do.end: -; CHECK-NEXT: ret void ; CHECK: do.body.1: ; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr i32, i32* [[P]], i32 [[INC]] ; CHECK-NEXT: store i32 [[INC]], i32* [[ARRAYIDX_1]], align 4 @@ -60,6 +58,8 @@ define void @s32_max2(i32 %n, i32* %p) { ; CHECK-NEXT: [[ARRAYIDX_2:%.*]] = getelementptr i32, i32* [[P]], i32 [[INC_1]] ; CHECK-NEXT: store i32 [[INC_1]], i32* [[ARRAYIDX_2]], align 4 ; CHECK-NEXT: br label [[DO_END]] +; CHECK: do.end: +; CHECK-NEXT: ret void ; entry: %add = add i32 %n, 2 @@ -163,12 +163,12 @@ define void @u32_max1(i32 %n, i32* %p) { ; CHECK-NEXT: [[INC:%.*]] = add i32 [[N]], 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp ult i32 [[N]], [[ADD]] ; CHECK-NEXT: br i1 [[CMP]], label [[DO_BODY_1:%.*]], label [[DO_END:%.*]] -; CHECK: do.end: -; CHECK-NEXT: ret void ; CHECK: do.body.1: ; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr i32, i32* [[P]], i32 [[INC]] ; CHECK-NEXT: store i32 [[INC]], i32* [[ARRAYIDX_1]], align 4 ; CHECK-NEXT: br label [[DO_END]] +; CHECK: do.end: +; CHECK-NEXT: ret void ; entry: %add = add i32 %n, 1 @@ -199,8 +199,6 @@ define void @u32_max2(i32 %n, i32* %p) { ; CHECK-NEXT: [[INC:%.*]] = add i32 [[N]], 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp ult i32 [[N]], [[ADD]] ; CHECK-NEXT: br i1 [[CMP]], label [[DO_BODY_1:%.*]], label [[DO_END:%.*]] -; CHECK: do.end: -; CHECK-NEXT: ret void ; CHECK: do.body.1: ; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr i32, i32* [[P]], i32 [[INC]] ; CHECK-NEXT: store i32 [[INC]], i32* [[ARRAYIDX_1]], align 4 @@ -208,6 +206,8 @@ define void @u32_max2(i32 %n, i32* %p) { ; CHECK-NEXT: [[ARRAYIDX_2:%.*]] = getelementptr i32, i32* [[P]], i32 [[INC_1]] ; CHECK-NEXT: store i32 [[INC_1]], i32* [[ARRAYIDX_2]], align 4 ; CHECK-NEXT: br label [[DO_END]] +; CHECK: do.end: +; CHECK-NEXT: ret void ; entry: %add = add i32 %n, 2 diff --git a/llvm/test/Transforms/LoopUnroll/full-unroll-one-unpredictable-exit.ll b/llvm/test/Transforms/LoopUnroll/full-unroll-one-unpredictable-exit.ll index 095a7c1e1dd1..b7d7e00fa0c9 100644 --- a/llvm/test/Transforms/LoopUnroll/full-unroll-one-unpredictable-exit.ll +++ b/llvm/test/Transforms/LoopUnroll/full-unroll-one-unpredictable-exit.ll @@ -34,11 +34,11 @@ define i1 @test_latch() { ; CHECK-NEXT: [[LOAD2_1:%.*]] = load i64, i64* [[GEP2_1]], align 8 ; CHECK-NEXT: [[EXITCOND2_1:%.*]] = icmp eq i64 [[LOAD1_1]], [[LOAD2_1]] ; CHECK-NEXT: br i1 [[EXITCOND2_1]], label [[LATCH_1:%.*]], label [[EXIT]] +; CHECK: latch.1: +; CHECK-NEXT: br label [[EXIT]] ; CHECK: exit: ; CHECK-NEXT: [[EXIT_VAL:%.*]] = phi i1 [ false, [[LOOP]] ], [ false, [[LATCH]] ], [ true, [[LATCH_1]] ] ; CHECK-NEXT: ret i1 [[EXIT_VAL]] -; CHECK: latch.1: -; CHECK-NEXT: br label [[EXIT]] ; start: %a1 = alloca [2 x i64], align 8 @@ -95,22 +95,22 @@ define i1 @test_non_latch() { ; CHECK-NEXT: [[LOAD2:%.*]] = load i64, i64* [[GEP2]], align 8 ; CHECK-NEXT: [[EXITCOND2:%.*]] = icmp eq i64 [[LOAD1]], [[LOAD2]] ; CHECK-NEXT: br i1 [[EXITCOND2]], label [[LOOP_1:%.*]], label [[EXIT:%.*]] -; CHECK: exit: -; CHECK-NEXT: [[EXIT_VAL:%.*]] = phi i1 [ false, [[LATCH]] ], [ false, [[LATCH_1:%.*]] ], [ true, [[LOOP_2:%.*]] ], [ false, [[LATCH_2:%.*]] ] -; CHECK-NEXT: ret i1 [[EXIT_VAL]] ; CHECK: loop.1: -; CHECK-NEXT: br label [[LATCH_1]] +; CHECK-NEXT: br label [[LATCH_1:%.*]] ; CHECK: latch.1: ; CHECK-NEXT: [[GEP1_1:%.*]] = getelementptr inbounds [2 x i64], [2 x i64]* [[A1]], i64 0, i64 1 ; CHECK-NEXT: [[GEP2_1:%.*]] = getelementptr inbounds [2 x i64], [2 x i64]* [[A2]], i64 0, i64 1 ; CHECK-NEXT: [[LOAD1_1:%.*]] = load i64, i64* [[GEP1_1]], align 8 ; CHECK-NEXT: [[LOAD2_1:%.*]] = load i64, i64* [[GEP2_1]], align 8 ; CHECK-NEXT: [[EXITCOND2_1:%.*]] = icmp eq i64 [[LOAD1_1]], [[LOAD2_1]] -; CHECK-NEXT: br i1 [[EXITCOND2_1]], label [[LOOP_2]], label [[EXIT]] +; CHECK-NEXT: br i1 [[EXITCOND2_1]], label [[LOOP_2:%.*]], label [[EXIT]] ; CHECK: loop.2: -; CHECK-NEXT: br i1 true, label [[EXIT]], label [[LATCH_2]] +; CHECK-NEXT: br i1 true, label [[EXIT]], label [[LATCH_2:%.*]] ; CHECK: latch.2: ; CHECK-NEXT: br label [[EXIT]] +; CHECK: exit: +; CHECK-NEXT: [[EXIT_VAL:%.*]] = phi i1 [ false, [[LATCH]] ], [ false, [[LATCH_1]] ], [ true, [[LOOP_2]] ], [ false, [[LATCH_2]] ] +; CHECK-NEXT: ret i1 [[EXIT_VAL]] ; start: %a1 = alloca [2 x i64], align 8 diff --git a/llvm/test/Transforms/LoopUnroll/multiple-exits.ll b/llvm/test/Transforms/LoopUnroll/multiple-exits.ll index 0bea86350b99..9f40f51c10e6 100644 --- a/llvm/test/Transforms/LoopUnroll/multiple-exits.ll +++ b/llvm/test/Transforms/LoopUnroll/multiple-exits.ll @@ -14,8 +14,6 @@ define void @test1() { ; CHECK-NEXT: call void @bar() ; CHECK-NEXT: call void @bar() ; CHECK-NEXT: br label [[LATCH_1:%.*]] -; CHECK: exit: -; CHECK-NEXT: ret void ; CHECK: latch.1: </cut>

4 years, 5 months

1
0
0 0

[TCWG CI] 482.sphinx3 slowed down by 6% after llvm: Making the code compliant to the documentation about Floating Point support default values for C/C++. FPP-MODEL=PRECISE enables FFP-CONTRACT(FMA is enabled).

by ci_notify＠linaro.org

After llvm commit f04e387055e495e3e14570087d68e93593cf2918 Author: Zahira Ammarguellat <zahira.ammarguellat(a)intel.com> Making the code compliant to the documentation about Floating Point support default values for C/C++. FPP-MODEL=PRECISE enables FFP-CONTRACT(FMA is enabled). the following benchmarks slowed down by more than 2%: - 482.sphinx3 slowed down by 6% from 25875 to 27484 perf samples - 482.sphinx3:[.] mgau_eval slowed down by 12% from 9996 to 11165 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-f04e387055e495e3e14570087d68e93593cf2918 cd investigate-llvm-f04e387055e495e3e14570087d68e93593cf2918 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach f04e387055e495e3e14570087d68e93593cf2918 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 491beae71d69960a3bb0298b17d4ef1f3119b767 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit f04e387055e495e3e14570087d68e93593cf2918 Author: Zahira Ammarguellat <zahira.ammarguellat(a)intel.com> Date: Tue Nov 9 09:35:25 2021 -0500 Making the code compliant to the documentation about Floating Point support default values for C/C++. FPP-MODEL=PRECISE enables FFP-CONTRACT(FMA is enabled). Fix for https://bugs.llvm.org/show_bug.cgi?id=50222 --- clang/docs/ReleaseNotes.rst | 10 +++ clang/docs/UsersManual.rst | 47 +++++++++++- clang/lib/Driver/ToolChains/Clang.cpp | 48 +++++++----- clang/test/CodeGen/ffp-contract-option.c | 127 +++++++++++++++++++++++++++++-- clang/test/CodeGen/ffp-model.c | 48 ++++++++++++ clang/test/CodeGen/ppc-emmintrin.c | 4 +- clang/test/CodeGen/ppc-xmmintrin.c | 4 +- clang/test/Driver/fp-model.c | 2 +- clang/test/Misc/ffp-contract.c | 10 +++ 9 files changed, 263 insertions(+), 37 deletions(-) diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst index 57c5150becae..00582b689862 100644 --- a/clang/docs/ReleaseNotes.rst +++ b/clang/docs/ReleaseNotes.rst @@ -202,6 +202,16 @@ Arm and AArch64 Support in Clang architecture features, but will enable certain optimizations specific to Cortex-A57 CPUs and enable the use of a more accurate scheduling model. + +Floating Point Support in Clang +------------------------------- +- The -ffp-model=precise now implies -ffp-contract=on rather than + -ffp-contract=fast, and the documentation of these features has been + clarified. Previously, the documentation claimed that -ffp-model=precise was + the default, but this was incorrect because the precise model implied + -ffp-contract=fast, whereas the default behavior is -ffp-contract=on. + -ffp-model=precise is now exactly the default mode of the compiler. + Internal API Changes -------------------- diff --git a/clang/docs/UsersManual.rst b/clang/docs/UsersManual.rst index 8c6922db6b37..406efb093d55 100644 --- a/clang/docs/UsersManual.rst +++ b/clang/docs/UsersManual.rst @@ -1260,8 +1260,49 @@ installed. Controlling Floating Point Behavior ----------------------------------- -Clang provides a number of ways to control floating point behavior. The options -are listed below. +Clang provides a number of ways to control floating point behavior, including +with command line options and source pragmas. This section +describes the various floating point semantic modes and the corresponding options. + +.. csv-table:: Floating Point Semantic Modes + :header: "Mode", "Values" + :widths: 15, 30, 30 + + "ffp-exception-behavior", "{ignore, strict, may_trap}", + "fenv_access", "{off, on}", "(none)" + "frounding-math", "{dynamic, tonearest, downward, upward, towardzero}" + "ffp-contract", "{on, off, fast, fast-honor-pragmas}" + "fdenormal-fp-math", "{IEEE, PreserveSign, PositiveZero}" + "fdenormal-fp-math-fp32", "{IEEE, PreserveSign, PositiveZero}" + "fmath-errno", "{on, off}" + "fhonor-nans", "{on, off}" + "fhonor-infinities", "{on, off}" + "fsigned-zeros", "{on, off}" + "freciprocal-math", "{on, off}" + "allow_approximate_fns", "{on, off}" + "fassociative-math", "{on, off}" + +This table describes the option settings that correspond to the three +floating point semantic models: precise (the default), strict, and fast. + + +.. csv-table:: Floating Point Models + :header: "Mode", "Precise", "Strict", "Fast" + :widths: 25, 15, 15, 15 + + "except_behavior", "ignore", "strict", "ignore" + "fenv_access", "off", "on", "off" + "rounding_mode", "tonearest", "dynamic", "tonearest" + "contract", "on", "off", "fast" + "denormal_fp_math", "IEEE", "IEEE", "PreserveSign" + "denormal_fp32_math", "IEEE","IEEE", "PreserveSign" + "support_math_errno", "on", "on", "off" + "no_honor_nans", "off", "off", "on" + "no_honor_infinities", "off", "off", "on" + "no_signed_zeros", "off", "off", "on" + "allow_reciprocal", "off", "off", "on" + "allow_approximate_fns", "off", "off", "on" + "allow_reassociation", "off", "off", "on" .. option:: -ffast-math @@ -1467,7 +1508,7 @@ Note that floating-point operations performed as part of constant initialization and ``fast``. Details: - * ``precise`` Disables optimizations that are not value-safe on floating-point data, although FP contraction (FMA) is enabled (``-ffp-contract=fast``). This is the default behavior. + * ``precise`` Disables optimizations that are not value-safe on floating-point data, although FP contraction (FMA) is enabled (``-ffp-contract=on``). This is the default behavior. * ``strict`` Enables ``-frounding-math`` and ``-ffp-exception-behavior=strict``, and disables contractions (FMA). All of the ``-ffast-math`` enablements are disabled. Enables ``STDC FENV_ACCESS``: by default ``FENV_ACCESS`` is disabled. This option setting behaves as though ``#pragma STDC FENV_ACESS ON`` appeared at the top of the source file. * ``fast`` Behaves identically to specifying both ``-ffast-math`` and ``ffp-contract=fast`` diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp index e8ad105a7829..5d6f8e9fba0e 100644 --- a/clang/lib/Driver/ToolChains/Clang.cpp +++ b/clang/lib/Driver/ToolChains/Clang.cpp @@ -2666,10 +2666,14 @@ static void RenderFloatingPointOptions(const ToolChain &TC, const Driver &D, llvm::DenormalMode DenormalFPMath = DefaultDenormalFPMath; llvm::DenormalMode DenormalFP32Math = DefaultDenormalFP32Math; - StringRef FPContract = ""; + // CUDA and HIP don't rely on the frontend to pass an ffp-contract option. + // If one wasn't given by the user, don't pass it here. + StringRef FPContract; + if (!JA.isDeviceOffloading(Action::OFK_Cuda) && + !JA.isOffloading(Action::OFK_HIP)) + FPContract = "on"; bool StrictFPModel = false; - if (const Arg *A = Args.getLastArg(options::OPT_flimited_precision_EQ)) { CmdArgs.push_back("-mlimit-float-precision"); CmdArgs.push_back(A->getValue()); @@ -2691,7 +2695,7 @@ static void RenderFloatingPointOptions(const ToolChain &TC, const Driver &D, ReciprocalMath = false; SignedZeros = true; // -fno_fast_math restores default denormal and fpcontract handling - FPContract = ""; + FPContract = "on"; DenormalFPMath = llvm::DenormalMode::getIEEE(); // FIXME: The target may have picked a non-IEEE default mode here based on @@ -2711,12 +2715,10 @@ static void RenderFloatingPointOptions(const ToolChain &TC, const Driver &D, // ffp-model= is a Driver option, it is entirely rewritten into more // granular options before being passed into cc1. // Use the gcc option in the switch below. - if (!FPModel.empty() && !FPModel.equals(Val)) { + if (!FPModel.empty() && !FPModel.equals(Val)) D.Diag(clang::diag::warn_drv_overriding_flag_option) - << Args.MakeArgString("-ffp-model=" + FPModel) - << Args.MakeArgString("-ffp-model=" + Val); - FPContract = ""; - } + << Args.MakeArgString("-ffp-model=" + FPModel) + << Args.MakeArgString("-ffp-model=" + Val); if (Val.equals("fast")) { optID = options::OPT_ffast_math; FPModel = Val; @@ -2724,7 +2726,7 @@ static void RenderFloatingPointOptions(const ToolChain &TC, const Driver &D, } else if (Val.equals("precise")) { optID = options::OPT_ffp_contract; FPModel = Val; - FPContract = "fast"; + FPContract = "on"; PreciseFPModel = true; } else if (Val.equals("strict")) { StrictFPModel = true; @@ -2812,9 +2814,9 @@ static void RenderFloatingPointOptions(const ToolChain &TC, const Driver &D, case options::OPT_ffp_contract: { StringRef Val = A->getValue(); if (PreciseFPModel) { - // -ffp-model=precise enables ffp-contract=fast as a side effect - // the FPContract value has already been set to a string literal - // and the Val string isn't a pertinent value. + // -ffp-model=precise enables ffp-contract=on. + // -ffp-model=precise sets PreciseFPModel to on and Val to + // "precise". FPContract is set. ; } else if (Val.equals("fast") || Val.equals("on") || Val.equals("off")) FPContract = Val; @@ -2907,23 +2909,27 @@ static void RenderFloatingPointOptions(const ToolChain &TC, const Driver &D, AssociativeMath = false; ReciprocalMath = false; SignedZeros = true; - TrappingMath = false; - RoundingFPMath = false; // -fno_fast_math restores default denormal and fpcontract handling DenormalFPMath = DefaultDenormalFPMath; DenormalFP32Math = llvm::DenormalMode::getIEEE(); - FPContract = ""; + if (!JA.isDeviceOffloading(Action::OFK_Cuda) && + !JA.isOffloading(Action::OFK_HIP)) + if (FPContract == "fast") { + FPContract = "on"; + D.Diag(clang::diag::warn_drv_overriding_flag_option) + << "-ffp-contract=fast" + << "-ffp-contract=on"; + } break; } if (StrictFPModel) { // If -ffp-model=strict has been specified on command line but // subsequent options conflict then emit warning diagnostic. - if (HonorINFs && HonorNaNs && - !AssociativeMath && !ReciprocalMath && - SignedZeros && TrappingMath && RoundingFPMath && - (FPContract.equals("off") || FPContract.empty()) && - DenormalFPMath == llvm::DenormalMode::getIEEE() && - DenormalFP32Math == llvm::DenormalMode::getIEEE()) + if (HonorINFs && HonorNaNs && !AssociativeMath && !ReciprocalMath && + SignedZeros && TrappingMath && RoundingFPMath && + DenormalFPMath == llvm::DenormalMode::getIEEE() && + DenormalFP32Math == llvm::DenormalMode::getIEEE() && + FPContract.equals("off")) // OK: Current Arg doesn't conflict with -ffp-model=strict ; else { diff --git a/clang/test/CodeGen/ffp-contract-option.c b/clang/test/CodeGen/ffp-contract-option.c index 52b750795940..857a9c2369ef 100644 --- a/clang/test/CodeGen/ffp-contract-option.c +++ b/clang/test/CodeGen/ffp-contract-option.c @@ -1,9 +1,120 @@ -// RUN: %clang_cc1 -O3 -ffp-contract=fast -triple=aarch64-apple-darwin -S -o - %s | FileCheck %s -// REQUIRES: aarch64-registered-target - -float fma_test1(float a, float b, float c) { -// CHECK: fmadd - float x = a * b; - float y = x + c; - return y; +// REQUIRES: x86-registered-target +// RUN: %clang_cc1 -triple=x86_64 %s -emit-llvm -o - \ +// RUN:| FileCheck --check-prefixes CHECK,CHECK-DEFAULT %s + +// RUN: %clang_cc1 -triple=x86_64 -ffp-contract=off %s -emit-llvm -o - \ +// RUN:| FileCheck --check-prefixes CHECK,CHECK-DEFAULT %s + +// RUN: %clang_cc1 -triple=x86_64 -ffp-contract=on %s -emit-llvm -o - \ +// RUN:| FileCheck --check-prefixes CHECK,CHECK-ON %s + +// RUN: %clang_cc1 -triple=x86_64 -ffp-contract=fast %s -emit-llvm -o - \ +// RUN:| FileCheck --check-prefixes CHECK,CHECK-CONTRACTFAST %s + +// RUN: %clang_cc1 -triple=x86_64 -ffast-math %s -emit-llvm -o - \ +// RUN:| FileCheck --check-prefixes CHECK,CHECK-CONTRACTOFF %s + +// RUN: %clang_cc1 -triple=x86_64 -ffast-math -ffp-contract=off %s -emit-llvm \ +// RUN: -o - | FileCheck --check-prefixes CHECK,CHECK-CONTRACTOFF %s + +// RUN: %clang_cc1 -triple=x86_64 -ffast-math -ffp-contract=on %s -emit-llvm \ +// RUN: -o - | FileCheck --check-prefixes CHECK,CHECK-ONFAST %s + +// RUN: %clang_cc1 -triple=x86_64 -ffast-math -ffp-contract=fast %s -emit-llvm \ +// RUN: -o - | FileCheck --check-prefixes CHECK,CHECK-FASTFAST %s + +// RUN: %clang_cc1 -triple=x86_64 -ffp-contract=fast -ffast-math %s \ +// RUN: -emit-llvm \ +// RUN: -o - | FileCheck --check-prefixes CHECK,CHECK-FASTFAST %s + +// RUN: %clang_cc1 -triple=x86_64 -ffp-contract=off -fmath-errno \ +// RUN: -fno-rounding-math %s -emit-llvm -o - \ +// RUN: -o - | FileCheck --check-prefixes CHECK,CHECK-NOFAST %s + +// RUN: %clang -S -emit-llvm -fno-fast-math %s -o - \ +// RUN: | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-ON + +// RUN: %clang -S -emit-llvm -ffp-contract=fast -fno-fast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-ON + +// RUN: %clang -S -emit-llvm -ffp-contract=on -fno-fast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-ON + +// RUN: %clang -S -emit-llvm -ffp-contract=off -fno-fast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-OFF + +// RUN: %clang -S -emit-llvm -ffp-model=fast -fno-fast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-ON + +// RUN: %clang -S -emit-llvm -ffp-model=precise -fno-fast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-ON + +// RUN: %clang -S -emit-llvm -ffp-model=strict -fno-fast-math \ +// RUN: -target x86_64 %s -o - | FileCheck %s \ +// RUN: --check-prefixes=CHECK,CHECK-FPSC-OFF + +// RUN: %clang -S -emit-llvm -ffast-math -fno-fast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-ON + +float mymuladd(float x, float y, float z) { + // CHECK: define{{.*}} float @mymuladd + return x * y + z; + // expected-warning{{overriding '-ffp-contract=fast' option with '-ffp-contract=on'}} + + // CHECK-DEFAULT: load float, float* + // CHECK-DEFAULT: fmul float + // CHECK-DEFAULT: load float, float* + // CHECK-DEFAULT: fadd float + + // CHECK-ON: load float, float* + // CHECK-ON: load float, float* + // CHECK-ON: load float, float* + // CHECK-ON: call float @llvm.fmuladd.f32(float {{.*}}, float {{.*}}, float {{.*}}) + + // CHECK-CONTRACTFAST: load float, float* + // CHECK-CONTRACTFAST: load float, float* + // CHECK-CONTRACTFAST: fmul contract float + // CHECK-CONTRACTFAST: load float, float* + // CHECK-CONTRACTFAST: fadd contract float + + // CHECK-CONTRACTOFF: load float, float* + // CHECK-CONTRACTOFF: load float, float* + // CHECK-CONTRACTOFF: fmul reassoc nnan ninf nsz arcp afn float + // CHECK-CONTRACTOFF: load float, float* + // CHECK-CONTRACTOFF: fadd reassoc nnan ninf nsz arcp afn float {{.*}}, {{.*}} + + // CHECK-ONFAST: load float, float* + // CHECK-ONFAST: load float, float* + // CHECK-ONFAST: load float, float* + // CHECK-ONFAST: call reassoc nnan ninf nsz arcp afn float @llvm.fmuladd.f32(float {{.*}}, float {{.*}}, float {{.*}}) + + // CHECK-FASTFAST: load float, float* + // CHECK-FASTFAST: load float, float* + // CHECK-FASTFAST: fmul fast float + // CHECK-FASTFAST: load float, float* + // CHECK-FASTFAST: fadd fast float {{.*}}, {{.*}} + + // CHECK-NOFAST: load float, float* + // CHECK-NOFAST: load float, float* + // CHECK-NOFAST: fmul float {{.*}}, {{.*}} + // CHECK-NOFAST: load float, float* + // CHECK-NOFAST: fadd float {{.*}}, {{.*}} + + // CHECK-FPC-ON: load float, float* + // CHECK-FPC-ON: load float, float* + // CHECK-FPC-ON: load float, float* + // CHECK-FPC-ON: call float @llvm.fmuladd.f32(float {{.*}}, float {{.*}}, float {{.*}}) + + // CHECK-FPC-OFF: load float, float* + // CHECK-FPC-OFF: load float, float* + // CHECK-FPC-OFF: fmul float + // CHECK-FPC-OFF: load float, float* + // CHECK-FPC-OFF: fadd float {{.*}}, {{.*}} + + // CHECK-FFPC-OFF: load float, float* + // CHECK-FFPC-OFF: load float, float* + // CHECK-FPSC-OFF: call float @llvm.experimental.constrained.fmul.f32(float {{.*}}, float {{.*}}, {{.*}}) + // CHECK-FPSC-OFF: load float, float* + // CHECK-FPSC-OFF: [[RES:%.*]] = call float @llvm.experimental.constrained.fadd.f32(float {{.*}}, float {{.*}}, {{.*}}) + } diff --git a/clang/test/CodeGen/ffp-model.c b/clang/test/CodeGen/ffp-model.c new file mode 100644 index 000000000000..21dbc8de99aa --- /dev/null +++ b/clang/test/CodeGen/ffp-model.c @@ -0,0 +1,48 @@ +// REQUIRES: x86-registered-target +// RUN: %clang -S -emit-llvm -ffp-model=fast -emit-llvm %s -o - \ +// RUN: | FileCheck %s --check-prefixes=CHECK,CHECK-FAST + +// RUN: %clang -S -emit-llvm -ffp-model=precise %s -o - \ +// RUN: | FileCheck %s --check-prefixes=CHECK,CHECK-PRECISE + +// RUN: %clang -S -emit-llvm -ffp-model=strict %s -o - \ +// RUN: -target x86_64 | FileCheck %s --check-prefixes=CHECK,CHECK-STRICT + +// RUN: %clang -S -emit-llvm -ffp-model=strict -ffast-math \ +// RUN: -target x86_64 %s -o - | FileCheck %s \ +// RUN: --check-prefixes CHECK,CHECK-STRICT-FAST + +// RUN: %clang -S -emit-llvm -ffp-model=precise -ffast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes CHECK,CHECK-FAST1 + +float mymuladd(float x, float y, float z) { + // CHECK: define{{.*}} float @mymuladd + return x * y + z; + + // CHECK-FAST: fmul fast float + // CHECK-FAST: load float, float* + // CHECK-FAST: fadd fast float + + // CHECK-PRECISE: load float, float* + // CHECK-PRECISE: load float, float* + // CHECK-PRECISE: load float, float* + // CHECK-PRECISE: call float @llvm.fmuladd.f32(float {{.*}}, float {{.*}}, float {{.*}}) + + // CHECK-STRICT: load float, float* + // CHECK-STRICT: load float, float* + // CHECK-STRICT: call float @llvm.experimental.constrained.fmul.f32(float {{.*}}, float {{.*}}, {{.*}}) + // CHECK-STRICT: load float, float* + // CHECK-STRICT: call float @llvm.experimental.constrained.fadd.f32(float {{.*}}, float {{.*}}, {{.*}}) + + // CHECK-STRICT-FAST: load float, float* + // CHECK-STRICT-FAST: load float, float* + // CHECK-STRICT-FAST: call fast float @llvm.experimental.constrained.fmul.f32(float {{.*}}, float {{.*}}, {{.*}}) + // CHECK-STRICT-FAST: load float, float* + // CHECK-STRICT-FAST: call fast float @llvm.experimental.constrained.fadd.f32(float {{.*}}, float {{.*}}, {{.*}} + + // CHECK-FAST1: load float, float* + // CHECK-FAST1: load float, float* + // CHECK-FAST1: fmul fast float {{.*}}, {{.*}} + // CHECK-FAST1: load float, float* {{.*}} + // CHECK-FAST1: fadd fast float {{.*}}, {{.*}} +} diff --git a/clang/test/CodeGen/ppc-emmintrin.c b/clang/test/CodeGen/ppc-emmintrin.c index fa3801f50a01..4a246ff92d76 100644 --- a/clang/test/CodeGen/ppc-emmintrin.c +++ b/clang/test/CodeGen/ppc-emmintrin.c @@ -2,9 +2,9 @@ // REQUIRES: powerpc-registered-target // RUN: %clang -S -emit-llvm -target powerpc64-unknown-linux-gnu -mcpu=pwr8 -ffreestanding -DNO_WARN_X86_INTRINSICS %s \ -// RUN: -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-BE +// RUN: -ffp-contract=off -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-BE // RUN: %clang -S -emit-llvm -target powerpc64le-unknown-linux-gnu -mcpu=pwr8 -ffreestanding -DNO_WARN_X86_INTRINSICS %s \ -// RUN: -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-LE +// RUN: -ffp-contract=off -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-LE // CHECK-BE-DAG: @_mm_movemask_pd.perm_mask = internal constant <4 x i32> <i32 -2139062144, i32 -2139062144, i32 -2139062144, i32 -2139078656>, align 16 // CHECK-BE-DAG: @_mm_shuffle_epi32.permute_selectors = internal constant [4 x i32] [i32 66051, i32 67438087, i32 134810123, i32 202182159], align 4 diff --git a/clang/test/CodeGen/ppc-xmmintrin.c b/clang/test/CodeGen/ppc-xmmintrin.c index d3f18bfbb1e5..4aff7a7c3dda 100644 --- a/clang/test/CodeGen/ppc-xmmintrin.c +++ b/clang/test/CodeGen/ppc-xmmintrin.c @@ -2,11 +2,11 @@ // REQUIRES: powerpc-registered-target // RUN: %clang -S -emit-llvm -target powerpc64-unknown-linux-gnu -mcpu=pwr8 -ffreestanding -DNO_WARN_X86_INTRINSICS %s \ -// RUN: -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-BE +// RUN: -ffp-contract=off -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-BE // RUN: %clang -x c++ -fsyntax-only -target powerpc64-unknown-linux-gnu -mcpu=pwr8 -ffreestanding -DNO_WARN_X86_INTRINSICS %s \ // RUN: -fno-discard-value-names -mllvm -disable-llvm-optzns // RUN: %clang -S -emit-llvm -target powerpc64le-unknown-linux-gnu -mcpu=pwr8 -ffreestanding -DNO_WARN_X86_INTRINSICS %s \ -// RUN: -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-LE +// RUN: -ffp-contract=off -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-LE // RUN: %clang -x c++ -fsyntax-only -target powerpc64le-unknown-linux-gnu -mcpu=pwr8 -ffreestanding -DNO_WARN_X86_INTRINSICS %s \ // RUN: -fno-discard-value-names -mllvm -disable-llvm-optzns diff --git a/clang/test/Driver/fp-model.c b/clang/test/Driver/fp-model.c index 5fa9d110dd83..0824b3e2c596 100644 --- a/clang/test/Driver/fp-model.c +++ b/clang/test/Driver/fp-model.c @@ -99,7 +99,7 @@ // RUN: %clang -### -nostdinc -ffp-model=precise -c %s 2>&1 \ // RUN: | FileCheck --check-prefix=CHECK-FPM-PRECISE %s // CHECK-FPM-PRECISE: "-cc1" -// CHECK-FPM-PRECISE: "-ffp-contract=fast" +// CHECK-FPM-PRECISE: "-ffp-contract=on" // CHECK-FPM-PRECISE: "-fno-rounding-math" // RUN: %clang -### -nostdinc -ffp-model=strict -c %s 2>&1 \ diff --git a/clang/test/Misc/ffp-contract.c b/clang/test/Misc/ffp-contract.c new file mode 100644 index 000000000000..0d26905d4ef2 --- /dev/null +++ b/clang/test/Misc/ffp-contract.c @@ -0,0 +1,10 @@ +// RUN: %clang_cc1 -O3 -ffp-contract=fast -triple=aarch64-apple-darwin \ +// RUN: -S -o - %s | FileCheck --check-prefix=CHECK-FMADD %s +// REQUIRES: aarch64-registered-target + +float fma_test1(float a, float b, float c) { + // CHECK-FMADD: fmadd + float x = a * b; + float y = x + c; + return y; +} </cut>

4 years, 6 months

1
0
0 0

[ACTIVITY] week ending Nov. 14 2021

by Alex Bennée

VirtIO Initiative ([STR-9]) =========================== - project admin [STR-9] <https://linaro.atlassian.net/browse/STR-9> [upstream rust-vmm sync meeting] <https://etherpad.opendev.org/p/rust-vmm-sync-2021&sa=D&source=calendar&ust=…> [proposal] <https://github.com/rust-vmm/vhost-device/pull/57> vhost-device maintainer effort ([UM-196]) - did a bunch of review on [vhost-device crate] [UM-196] <https://linaro.atlassian.net/browse/UM-196> [vhost-device crate] <https://github.com/rust-vmm/vhost-device> QEMU Upstream Work ([UM-2]) =========================== [UM-2] <https://linaro.atlassian.net/browse/UM-2> Upstream MTTCG tests ([QEMU-52]) - posted [kvm-unit-tests PATCH v3 0/3] GIC ITS tests Message-Id: <20211112114734.3058678-1-alex.bennee(a)linaro.org> - might as well flush the tree state as I left it - posted [RFC PATCH] hw/intc: clean-up error reporting for failed ITS cmd Message-Id: <20211112170454.3158925-1-alex.bennee(a)linaro.org> - re-based [mttcg tests to current state and fixed up] [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> [mttcg tests to current state and fixed up] <https://github.com/stsquad/qemu/tree/mttcg/current-tests-v8> Completed Reviews [2/2] ======================= [PATCH v2 0/3] Some watchpoint-related patches Message-Id: <163662450348.125458.5494710452733592356.stgit@pasha-ThinkPad-X280> [PATCH 0/5] Update linux-headers + NOIRQ support for KVM gdbstub Message-Id: <20211111110604.207376-1-pbonzini(a)redhat.com> Absences ======== - none Current Review Queue ==================== TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com> =================================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== TODO [PATCH] softmmu: fix watchpoint processing in icount mode Message-Id: <163101424137.678744.18360776310711795413.stgit@pasha-ThinkPad-X280> ============================================================================================================================================== -- Alex Bennée

4 years, 6 months

1
0
0 0

[ACTIVITY] report week ending 11 Nov

by Peter Maydell

Progress (short week, 3 days) * UM-2 [QEMU upstream maintainership] - recent changes to QEMU's PSCI emulation broke booting of guest code at EL3 on the imx7 board, which was previously accidentally relying on PSCI-emulation-via-SMC not getting in its way despite being enabled. We need to make this board disable PSCI when the guest code is booting to EL3, as the virt board does, but it's trickier here because the CPU-creation code is hidden inside a model of an SoC object. After some on-list discussion I have a plan for how to restructure this, and need to write some code... * QEMU-420 [GICv4 emulation] - re-read the GIC architecture specification, acquired a better understanding of the required work, and broke this epic down into stories - discussed with Leif how the ITS support should be landed in the sbsa-ref board Misc: * higher-than-usual amount of meetings and meeting-prep this week -- PMM

4 years, 6 months

1
0
0 0

[TCWG CI] 456.hmmer slowed down by 3% after llvm: [flang] Fix crash in semantic error recovery situation

by ci_notify＠linaro.org

After llvm commit f411c1dd95092139c8b992260705ac0b75c8583f Author: Peter Klausler <pklausler(a)nvidia.com> [flang] Fix crash in semantic error recovery situation the following benchmarks slowed down by more than 2%: - 456.hmmer slowed down by 3% from 7600 to 7806 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-f411c1dd95092139c8b992260705ac0b75c8583f cd investigate-llvm-f411c1dd95092139c8b992260705ac0b75c8583f # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach f411c1dd95092139c8b992260705ac0b75c8583f ../artifacts/test.sh # Reproduce last_good build git checkout --detach c0b298fc213c1b33e97ca72fba58597365375875 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit f411c1dd95092139c8b992260705ac0b75c8583f Author: Peter Klausler <pklausler(a)nvidia.com> Date: Tue Nov 2 16:41:15 2021 -0700 [flang] Fix crash in semantic error recovery situation A CHECK() in semantics is triggering when analyzing a program with an undefined derived type pointer because the CHECK is expecting a new error message to have been issued in a function but not allowing for the case that a diagnostic could have been produced earlier. Adjust the predicate. Differential Revision: https://reviews.llvm.org/D113307 --- flang/lib/Semantics/expression.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/flang/lib/Semantics/expression.cpp b/flang/lib/Semantics/expression.cpp index 331b9b2cf5bc..8ee8c9a9c9ce 100644 --- a/flang/lib/Semantics/expression.cpp +++ b/flang/lib/Semantics/expression.cpp @@ -1916,7 +1916,7 @@ auto ExpressionAnalyzer::AnalyzeProcedureComponentRef( "Base of procedure component reference is not a derived-type object"_err_en_US); } } - CHECK(!GetContextualMessages().empty()); + CHECK(context_.AnyFatalError()); return std::nullopt; } </cut>

4 years, 6 months

1
0
0 0

ARM Cortex A55 support

by Stefan Johansson A

Hello, We have been using Linaro GCC 7.5-2019.12 for the A53. As we move on to new tech there seems to be no support for "- mcpu=cortex-a55". Today, we use the aarch64-elf- toolchain. What GCC do you suggest we start using for A55 ? Thanks, Stefan

4 years, 6 months

2
1
0 0

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

linaro-toolchain